concat()
The concat()
function in Pandas is a powerful tool for combining data from multiple DataFrames. It offers flexible options for handling datasets of different sizes, columns, or indices. This documentation will walk you through its functionality, use cases, and practical examples.
Introduction
In real-world scenarios, data often comes from various sources and may not have consistent structures. You might encounter datasets with:
- Different indices.
- Varying column names.
- Overlapping or completely distinct data points.
The concat()
function allows seamless merging of these datasets into a single DataFrame, addressing scenarios where information may or may not overlap.
Syntax
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, ...) -> DataFrame
Parameters:
objs
: List or tuple of DataFrames or Series to concatenate.axis
:0
(default) for vertical concatenation,1
for horizontal concatenation.join
: Specifies how to handle indexes and columns:"outer"
(default): Union of all columns or indices."inner"
: Intersection of columns or indices.
ignore_index
: Boolean. Resets the index to default integer indexing.keys
: List of keys to create a multi-level index.
Examples
1. Basic Concatenation
Combine two DataFrames vertically (default behavior):
import pandas as pd
# Define DataFrames
data1 = {"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}
data2 = {"A": [10, 11, 12], "B": [13, 14, 15], "C": [16, 17, 18]}
# Convert to DataFrames
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
# Concatenate
result = pd.concat([df1, df2])
print(result)
Output:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
0 10 13 16
1 11 14 17
2 12 15 18
Notice that indices are repeated.
2. Horizontal Concatenation
Concatenate DataFrames along the horizontal axis by setting axis=1
:
result = pd.concat([df1, df2], axis=1)
print(result)
Output:
A B C A B C
0 1 4 7 10 13 16
1 2 5 8 11 14 17
2 3 6 9 12 15 18
3. Adding Keys for Multi-Level Indexing
Use the keys
parameter to distinguish between datasets:
result = pd.concat([df1, df2], keys=['data1', 'data2'])
print(result)
Output:
A B C
data1 0 1 4 7
1 2 5 8
2 3 6 9
data2 0 10 13 16
1 11 14 17
2 12 15 18
4. Ignoring Index
Reset the index using ignore_index=True
:
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
3 10 13 16
4 11 14 17
5 12 15 18
5. Different Indices
Handle DataFrames with differing indices:
# Modify indices
df1.index = [1, 2, 3]
df2.index = [4, 5, 6]
result = pd.concat([df1, df2])
print(result)
Output:
A B C
1 1 4 7
2 2 5 8
3 3 6 9
4 10 13 16
5 11 14 17
6 12 15 18
6. Different Column Names
Concatenate DataFrames with mismatched columns:
# Modify columns
data2 = {"D": [10, 11, 12], "E": [13, 14, 15], "F": [16, 17, 18]}
df2 = pd.DataFrame(data2)
result = pd.concat([df1, df2], sort=False)
print(result)
Output:
A B C D E F
0 1.0 4.0 7.0 NaN NaN NaN
1 2.0 5.0 8.0 NaN NaN NaN
2 3.0 6.0 9.0 NaN NaN NaN
0 NaN NaN NaN 10.0 13.0 16.0
1 NaN NaN NaN 11.0 14.0 17.0
2 NaN NaN NaN 12.0 15.0 18.0
7. Join Options
Outer Join (default):
Includes all columns:
result = pd.concat([df1, df2], join='outer')
print(result)
Inner Join:
Keeps only overlapping columns:
result = pd.concat([df1, df2], join='inner')
print(result)
Conclusion
The concat()
function in Pandas is essential for combining datasets in flexible and efficient ways. By understanding its parameters and behaviors, you can handle various real-world data integration tasks with ease.