How to Join (Merge) Data Frames in R: Inner, Outer, Left, and Right Joins
When working with data frames in R, it is often necessary to combine or merge them based on a common key. This process is similar to joining tables in SQL and allows you to combine data from different sources into a single data frame. In this article, we will explore how to perform different types of joins in R, including inner, outer, left, and right joins.
Example Data Frames
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
Inner Join
An inner join returns only the rows in which the left table has matching keys in the right table. In other words, it combines the rows for which the customer ID exists in both data frames.
inner_join <- merge(df1, df2, by = "CustomerId")
inner_join
# CustomerId Product State
# 2 Toaster Alabama
# 4 Radio Alabama
# 6 Radio Ohio
In this example, the inner join combines the rows with CustomerID 2, 4, and 6 from both data frames, resulting in a new data frame with three rows.
Outer Join
An outer join returns all rows from both tables and includes join records from the left table with matching keys in the right table. If there is no match, the missing values are filled with NA.
outer_join <- merge(df1, df2, by = "CustomerId", all = TRUE)
outer_join
# CustomerId Product State
# 1 Toaster <NA>
# 2 Toaster Alabama
# 3 Toaster <NA>
# 4 Radio Alabama
# 5 Radio <NA>
# 6 Radio Ohio
In this example, the outer join includes all rows from both data frames. The missing values for CustomerID 1, 3, and 5 in the State column are filled with NA.
Left Outer Join (Left Join)
A left outer join returns all rows from the left table and includes any rows with matching keys from the right table. If there is no match, the missing values are filled with NA.
left_join <- merge(df1, df2, by = "CustomerId", all.x = TRUE)
left_join
# CustomerId Product State
# 1 Toaster <NA>
# 2 Toaster Alabama
# 3 Toaster <NA>
# 4 Radio Alabama
# 5 Radio <NA>
# 6 Radio Ohio
In this example, the left join includes all rows from df1 and only includes the matching row from df2 for CustomerID 2 and 4. The missing values for CustomerID 1, 3, and 5 in the State column are filled with NA.
Right Outer Join (Right Join)
A right outer join returns all rows from the right table and includes any rows with matching keys from the left table. If there is no match, the missing values are filled with NA.
right_join <- merge(df1, df2, by = "CustomerId", all.y = TRUE)
right_join
# CustomerId Product State
# 2 Toaster Alabama
# 4 Radio Alabama
# 6 Radio Ohio
# 7 <NA> Ohio
In this example, the right join includes all rows from df2 and only includes the matching rows from df1 for CustomerID 2, 4, and 6. The missing value for CustomerID 7 in the Product column is filled with NA.
SQL Style Select Statement
If you want to perform a SQL-style select statement, you can use the dplyr
package in R. Here's an example:
library(dplyr)
select_statement <- inner_join %>%
select(Product, State)
select_statement
# Product State
# Toaster Alabama
# Radio Alabama
# Radio Ohio
In this example, we use the select
function from the dplyr
package to select the Product and State columns from the inner join data frame.