Skip to main content

Sorting Data

In this tutorial, we'll explore how to efficiently sort data in a Pandas DataFrame using the nlargest and nsmallest functions. These functions are particularly useful when you need to identify the largest or smallest values in a DataFrame based on one or multiple columns.

1. Importing Necessary Libraries and Loading Data

First, let's import the Pandas library and load our dataset. We'll specify the columns we're interested in, which are the series title, rating, and the number of votes.

import pandas as pd

# Define columns of interest
columns = ['title', 'rating', 'num_votes']

# Load dataset
df = pd.read_csv('your_dataset.csv', usecols=columns)

Replace 'your_dataset.csv' with the path to your dataset file.

2. Sorting with nlargest

2.1 Sorting by a Single Column

To sort the DataFrame by a single column, such as the number of votes, we use the nlargest function.

# Sorting by the number of votes (top 10)
top_10_votes = df.nlargest(10, 'num_votes')
print(top_10_votes)

2.2 Sorting by Multiple Columns

To sort by multiple columns, such as rating and number of votes, we pass a list of columns to the nlargest function.

# Sorting by rating and then by number of votes
top_rated = df.nlargest(10, ['rating', 'num_votes'])
print(top_rated)

3. Sorting with nsmallest

Sorting with nsmallest works similarly to nlargest, but it retrieves the smallest values instead.

# Getting the bottom 10 smallest values for number of votes
bottom_10_votes = df.nsmallest(10, 'num_votes')
print(bottom_10_votes)

4. Handling Duplicates with the keep Parameter

The keep parameter determines how duplicates are handled when retrieving the largest or smallest values.

# Keeping all duplicates for the top rated movies
all_duplicates = df.nlargest(10, 'rating', keep='all')
print(all_duplicates)

5. Using sort_values as an Alternative

Though less efficient, you can also use sort_values to achieve similar results.

# Sorting using sort_values (less efficient)
sorted_df = df.sort_values(by=['rating', 'num_votes'], ascending=False).head(10)
print(sorted_df)