Using the value_counts()
Method in Pandas
In this tutorial, we'll explore the value_counts()
method in Pandas, a powerful tool for counting the occurrences of values within a Series in a DataFrame. We'll cover its basic usage, as well as various arguments you can use to customize its behavior.
1. Introduction to value_counts()
The value_counts()
method in Pandas is used to count the occurrences of unique values in a Series. It's particularly useful when you want to analyze the distribution of categorical data within a DataFrame.
1.1 Syntax:
Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
1.2 Parameters:
normalize
: If True, returns the relative frequencies of the unique values.sort
: If True, sorts the counts in descending order by default.ascending
: If True, sorts in ascending order.bins
: Specifies the number of equal-width bins to divide the data into (only works with numerical data).dropna
: If False, does not drop NaN values from the result.
Now, let's dive into each parameter and see how it influences the behavior of value_counts()
.
2. Basic Usage
2.1 Example Data
Let's assume we have a DataFrame containing information about the top richest people in the world. We'll focus on the 'industry' column for our examples.
import pandas as pd
# Assuming 'df' is our DataFrame containing the data
industry_counts = df['industry'].value_counts().sort_index()
print(industry_counts)
Output:
Fashion and retail 18
Finance 13
Technology 15
Name: industry, dtype: int64
3. Understanding Parameters
3.1 normalize
Setting normalize=True
returns the relative frequencies of unique values instead of counts.
industry_normalized = df['industry'].value_counts(normalize=True)
print(industry_normalized)
Output:
Fashion and retail 0.45
Finance 0.325
Technology 0.225
Name: industry, dtype: float64
3.2 sort
and ascending
By default, sort=True
sorts the values in descending order. You can change the sorting order using the ascending
parameter.
industry_sorted = df['industry'].value_counts(sort=True, ascending=False)
print(industry_sorted)
Output:
Fashion and retail 18
Technology 15
Finance 13
Name: industry, dtype: int64
3.3 bins
The bins
parameter divides numerical data into equal-width intervals and counts the occurrences within each interval. This is useful for analyzing distributions.
age_bins = df['age'].value_counts(bins=5)
print(age_bins)
Output:
(29.9, 42.8] 5
(42.8, 55.6] 17
... ...
Name: age, dtype: int64
3.4 dropna
By default, dropna=True
excludes NaN values from the result. Set it to False if you want to include NaN values.
industry_with_na = df['industry'].value_counts(dropna=False)
print(industry_with_na)
Output:
Fashion and retail 18
Finance 13
Technology 15
NaN 3
Name: industry, dtype: int64