Pandas Merging 101

Merging basics - basic types of joins

When working with pandas, merging DataFrames is a common operation that allows you to combine data from different sources based on common columns or indices. There are several types of joins that you can perform using the merge function in pandas:

  • Inner Join: This type of join returns only the records that have matching values in both DataFrames.
  • Left Join: This type of join returns all the records from the left DataFrame and the matching records from the right DataFrame. If there are no matching records in the right DataFrame, NaN values are added for the missing rows.
  • Right Join: This type of join returns all the records from the right DataFrame and the matching records from the left DataFrame. If there are no matching records in the left DataFrame, NaN values are added for the missing rows.
  • Full Outer Join: This type of join returns all the records from both DataFrames. If there are no matching records, NaN values are added for the missing rows.

Here are some examples to demonstrate these types of joins:

Inner Join


import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [2, 3, 4], 'C': ['x', 'y', 'z']})

df_inner = pd.merge(df1, df2, on='A', how='inner')
print(df_inner)
            

The output of this code will be:


   A  B  C
0  2  b  x
1  3  c  y
            

Only the records with matching values in column A are included in the result.

Left Join


df_left = pd.merge(df1, df2, on='A', how='left')
print(df_left)
            

The output of this code will be:


   A  B    C
0  1  a  NaN
1  2  b    x
2  3  c    y
            

All the records from the left DataFrame are included, with NaN values added for the missing rows in the right DataFrame.

Right Join


df_right = pd.merge(df1, df2, on='A', how='right')
print(df_right)
            

The output of this code will be:


   A    B  C
0  2    b  x
1  3    c  y
2  4  NaN  z
            

All the records from the right DataFrame are included, with NaN values added for the missing rows in the left DataFrame.

Full Outer Join


df_outer = pd.merge(df1, df2, on='A', how='outer')
print(df_outer)
            

The output of this code will be:


   A    B  C
0  1    a  NaN
1  2    b  x
2  3    c  y
3  4  NaN  z
            

All the records from both DataFrames are included, with NaN values added for the missing rows.

Index-based joins

In addition to joining on columns, you can also perform merges based on indices. To do this, you can use the left_index and right_index parameters of the merge function, instead of specifying the columns to join on:

Merging on Index


df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']}, index=[10, 20, 30])
df2 = pd.DataFrame({'C': ['x', 'y', 'z']}, index=[20, 30, 40])

df_index = pd.merge(df1, df2, left_index=True, right_index=True, how='inner')
print(df_index)
            

The output of this code will be:


    A  B  C
20  2  b  x
30  3  c  y
            

Only the rows with matching indices are included in the result.

Generalizing to multiple DataFrames

So far, we have only looked at merging two DataFrames. However, you can also merge multiple DataFrames by chaining the merge function. Here's an example:


df1 = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'c']})
df2 = pd.DataFrame({'A': [2, 3, 4], 'C': ['x', 'y', 'z']})
df3 = pd.DataFrame({'A': [3, 4, 5], 'D': ['p', 'q', 'r']})

df_multi = pd.merge(df1, df2, on='A').merge(df3, on='A')
print(df_multi)
            

The output of this code will be:


   A  B  C  D
0  3  c  y  p
            

The resulting DataFrame contains only the rows with matching values in column A from all three DataFrames.

Cross join

In some cases, you might want to perform a cross join, which produces the Cartesian product of the two DataFrames. This can be achieved by setting the how parameter to 'cross'.


df_cross = pd.merge(df1.assign(key=1), df2.assign(key=1), on='key').drop('key', axis=1)
print(df_cross)
            

The output of this code will be:


   A_x  B  A_y  C
0    1  a    2  x
1    1  a    3  y
2    1  a    4  z
3    2  b    2  x
4    2  b    3  y
5    2  b    4  z
6    3  c    2  x
7    3  c    3  y
8    3  c    4  z
            

The resulting DataFrame contains all possible combinations of rows from both DataFrames.