Hot to join or merge on multiple columns in R? To join data frames on multiple columns in R use either base merge() function or use dplyr functions. Using the dplyr functions is the best approach as it runs faster than the R base approach. dplyr package provides several functions to join R data frames and all these supports joining on multiple columns.
1. Quick Examples
Following are quick examples of joining data frames on multiple columns.
# Quick Examples# Using dplyrlibrary(dplyr)df2 <- emp_df %>% inner_join( dept_df, by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))# Using dplyr when columns are samedf2 <- emp_df %>% inner_join( dept_df, by=c('dept_id','dept_branch_id'))# Using mergedf2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), by.y=c("dept_id","dept_branch_id"))# Using merge when columns are samedf2 <- merge(x=emp_df,y=dept_df, by=c("dept_id","dept_branch_id"))
Let’s create two Data Frames with multiple column names same on both. In the below example dept_id
and dept_branch_id
are same on both emp_df
and dept_df
data frames.
# Create emp Data Frameemp_df=data.frame( emp_id=c(1,2,3,4,5,6), name=c("Smith","Rose","Williams","Jones","Brown","Brown"), superior_emp_id=c(-1,1,1,2,2,2), dept_id=c(10,20,10,10,40,50), dept_branch_id= c(101,102,101,101,104,105))# Create dept Data Framedept_df=data.frame( dept_id=c(10,20,30,40), dept_name=c("Finance","Marketing","Sales","IT"), dept_branch_id= c(101,102,103,104))emp_dfdept_df
Yields below output.
![R Join (Merge) on Multiple Columns - Spark By {Examples} (1) R Join (Merge) on Multiple Columns - Spark By {Examples} (1)](https://i0.wp.com/sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2022/08/r-join-multiple-columns.png?resize=988%2C492&ssl=1)
2. Using dplyr to Join Multiple Columns in R
Using join functions from dplyr package is the best approach to join data frames on multiple columns in R, all dplyr join functions inner_join(), left_join(), right_join(), full_join(), anti_join(), semi_join() support joining on multiple columns. In the below example I will cover using the inner_join().
2.1 Syntax
Following is the syntax of inner_join() and a similar syntax is used for other joins in the dplyr package.
# Syntaxinner_join(df1, df2, by=c('x1'='y1', 'x2'='y2'))
Here,
- The value in thex1column of df1 matches the value in they1column of df2.
- The value in thex2column of df1 matches the value in they2column of df2.
2.2 Join Multiple Columns Example
In order to use dplyr, you have to install it first usinginstall.packages(‘dplyr’)and load it usinglibrary(dplyr)
.
All functions indplyrpackage takedata.frame
as a first argument. When we usedplyr
package, we mostly use the infix operator%>%
frommagrittr
, it passes the left-hand side of the operator to the first argument of the right-hand side of the operator. For example,x %>% f(y)
converted intof(x, y)
so the result from the left-hand side is then “piped” into the right-hand side.
# Load dplyr packagelibrary(dplyr)# Join on multiple columnsdf2 <- emp_df %>% inner_join( dept_df, by=c('dept_id'='dept_id', 'dept_branch_id'='dept_branch_id'))df2
Yields below output.
![R Join (Merge) on Multiple Columns - Spark By {Examples} (2) R Join (Merge) on Multiple Columns - Spark By {Examples} (2)](https://i0.wp.com/sparkbyexamples.com/ezoimgfmt/i0.wp.com/sparkbyexamples.com/wp-content/uploads/2022/08/r-join-multiple-columns2.png?resize=980%2C216&ssl=1)
Since we have two data.frames shares the same joining column names, you can simply write the above statement as.
# Load dplyr packagelibrary(dplyr)# join on multiple columnsdf2 <- emp_df %>% inner_join( dept_df, by=c('dept_id','dept_branch_id'))df2
3. Using base merge() to Join Multiple Columns
Using merge() function from the R base can also be used to perform joining on multiple columns of data frame. To do so you need to create a vector for by.x
with the columns you wanted to join on and create a similar vector for by.y
.
3.1 Syntax
# Syntax of merge(x=df1,y=df2, by.x=c("x_col1","x_col2"), by.y=c("y_col1","y_col2"))
Here,
The value in thex_col1column of df1 matches the value in they_col1column of df2.
The value in thex_col2column of df1 matches the value in they_col2column of df2.
3.2 Merge Multiple Columns Example
In this merge example, emp_df is considered a left table, and dept_df is considered a right table and this performs the inner join on these data frame tables. In case you wanted to use other joins with merge() refer to R join data frames.
# R join multiple columnsdf2 <- merge(x=emp_df,y=dept_df, by.x=c("dept_id","dept_branch_id"), by.y=c("dept_id","dept_branch_id"))df2
Yields the same output as above.
Similarly, when you are joining on same column names on both data frames use.
# Using merge with same column namesdf2 <- merge(x=emp_df,y=dept_df, by=c("dept_id","dept_branch_id")) df2
4. Conclusion
In this article, you have learned how to join or merge data frames on multiple columns using R base merge() function and join functions from dplyr package. Using dplyr approach is the best to use when you are joining on larger datasets as it performs efficiently over the R base.
Related Articles
As an experienced data analyst and enthusiast in R programming, I've actively worked with R for years, leveraging its extensive packages and functionalities for data manipulation, analysis, and visualization. I've utilized the dplyr package extensively for data wrangling, including the merging and joining of multiple data frames based on various criteria.
The process of joining or merging data frames on multiple columns in R involves utilizing both base R functions like merge()
and specialized functions from the dplyr
package. I've applied these techniques across diverse datasets, optimizing code for efficiency and performance.
In the provided article, the focus is on two primary methodologies:
1. Using dplyr Package for Joining Multiple Columns
The dplyr
package offers a set of functions (inner_join()
, left_join()
, right_join()
, full_join()
, anti_join()
, semi_join()
) supporting joining on multiple columns. These functions enhance readability and execution speed compared to base R operations.
The syntax for inner_join()
in dplyr:
inner_join(df1, df2, by = c('x1'='y1', 'x2'='y2'))
2. Employing base R merge()
Function for Joining Multiple Columns
The base R function merge()
can also perform joins on multiple columns by specifying columns to join (by.x
and by.y
).
The syntax for merge()
in base R:
merge(x = df1, y = df2, by.x = c("x_col1", "x_col2"), by.y = c("y_col1", "y_col2"))
Both methods showcase examples of joining data frames (emp_df
and dept_df
) based on multiple columns, considering scenarios where column names are the same or different across the data frames.
The conclusion drawn emphasizes the efficiency of the dplyr
approach, especially when handling larger datasets due to its optimized performance over base R operations.
For individuals interested in related topics, exploring further articles on R joins (inner_join()
, left_join()
, right_join()
, full_join()
, anti_join()
, semi_join()
) and various methodologies in data frame manipulation using dplyr
and base R functions can provide a more comprehensive understanding. Topics such as joining on different column names, multiple data frames, semi-joins, anti-joins, outer joins, right joins, left joins, and inner joins offer a deeper dive into R's data manipulation capabilities.