How to Join (Merge) Data Frames in R: Inner, Outer, Left, and Right Joins

When working with data frames in R, it is often necessary to combine or merge them based on a common key. This process is similar to joining tables in SQL and allows you to combine data from different sources into a single data frame. In this article, we will explore how to perform different types of joins in R, including inner, outer, left, and right joins.

Example Data Frames


df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))

df1
#  CustomerId Product
#           1 Toaster
#           2 Toaster
#           3 Toaster
#           4   Radio
#           5   Radio
#           6   Radio

df2
#  CustomerId   State
#           2 Alabama
#           4 Alabama
#           6    Ohio
                

Inner Join

An inner join returns only the rows in which the left table has matching keys in the right table. In other words, it combines the rows for which the customer ID exists in both data frames.


inner_join <- merge(df1, df2, by = "CustomerId")
inner_join
#  CustomerId Product   State
#           2 Toaster Alabama
#           4   Radio Alabama
#           6   Radio    Ohio
                

In this example, the inner join combines the rows with CustomerID 2, 4, and 6 from both data frames, resulting in a new data frame with three rows.

Outer Join

An outer join returns all rows from both tables and includes join records from the left table with matching keys in the right table. If there is no match, the missing values are filled with NA.


outer_join <- merge(df1, df2, by = "CustomerId", all = TRUE)
outer_join
#  CustomerId Product   State
#           1 Toaster    <NA>
#           2 Toaster Alabama
#           3 Toaster    <NA>
#           4   Radio Alabama
#           5   Radio    <NA>
#           6   Radio    Ohio
                

In this example, the outer join includes all rows from both data frames. The missing values for CustomerID 1, 3, and 5 in the State column are filled with NA.

Left Outer Join (Left Join)

A left outer join returns all rows from the left table and includes any rows with matching keys from the right table. If there is no match, the missing values are filled with NA.


left_join <- merge(df1, df2, by = "CustomerId", all.x = TRUE)
left_join
#  CustomerId Product   State
#           1 Toaster    <NA>
#           2 Toaster Alabama
#           3 Toaster    <NA>
#           4   Radio Alabama
#           5   Radio    <NA>
#           6   Radio    Ohio
                

In this example, the left join includes all rows from df1 and only includes the matching row from df2 for CustomerID 2 and 4. The missing values for CustomerID 1, 3, and 5 in the State column are filled with NA.

Right Outer Join (Right Join)

A right outer join returns all rows from the right table and includes any rows with matching keys from the left table. If there is no match, the missing values are filled with NA.


right_join <- merge(df1, df2, by = "CustomerId", all.y = TRUE)
right_join
#  CustomerId Product   State
#           2 Toaster Alabama
#           4   Radio Alabama
#           6   Radio    Ohio
#           7    <NA>    Ohio
                

In this example, the right join includes all rows from df2 and only includes the matching rows from df1 for CustomerID 2, 4, and 6. The missing value for CustomerID 7 in the Product column is filled with NA.

SQL Style Select Statement

If you want to perform a SQL-style select statement, you can use the dplyr package in R. Here's an example:


library(dplyr)

select_statement <- inner_join %>%
  select(Product, State)
select_statement
#  Product   State
# Toaster Alabama
#   Radio Alabama
#   Radio    Ohio
                

In this example, we use the select function from the dplyr package to select the Product and State columns from the inner join data frame.