6 min read2018/01/03

We can merge two data frames in R by using the merge function or by using family of join function in dplyr package. The data frames must have same column names on which the merging happens. Merge Function in R is similar to database join operation in SQL. Cross join: Returns all the possible combination of records in both the dataframes; To merge 2 dataframes in R, we use merge function to do so. In this recipe, we will learn how to merge two dataframe in R using an example. Step 1: Loading required library and creating 2 dataframes. 1: 3%% map (createdf)%% reduce (leftjoin, by = 'A') A K R Q I P E 1 a 0.73928418 1 0.98996062 2 0.28084510 3 2 b 0.23690265 1 0.14713347 2 0.09790211 3 3 c 0.03932318 1 0.03024889 2 0.22254985 3 4 d 0.44752161 1 0.06377270 2 0.42092912 3 5 e 0.47648939 1 0.96873677 2 0.58043129 3. There’s just no comparison. I need to go back.

Another Exciting Project

Recently, I started the new project with NIA in order to find the topics and their moving trends over time (2005~2017) from news articles: Total = around 15,000,000 articles as several giga bytes of csv files. Those are lovely size for the analysis!

Anyways, I received the Dropbox link containing those csv files and could easily transfer those to my account, thereby downloading them to my desktop to start playing around with R.

As expected, those are kept in several folders and I was about to load them into RStudio Server since it has sufficient memory to deal with.

As a typical habit, I googled and double checked if there is any better NEW way in R to load all of the files at once, and merge them into one huge data frame. Guess what? There is a WOW Package(readbulk) to do that!!! And as always, thank god I realized how stupid I am, everyday!


Importing and Merging Multiple csv files into One Data Frame - 3 Ways

As such, I would like to summarize three ways to merge a bunch of csv files into one huge data frame in R, including the readbulk: (1) fread(), (2) spark_read_csv(), and (3) read_bulk().

(1) data.table::fread()

All the csv files are saved in the folder of data/2005-2017 in my side, so I make them as path to tell R where my csv files are located. Since I need to load all the csv files, I set the pattern as any(*) csv files (with their path as full names). Then, I use rbindlist() from data.table package with fread() function to load all the files with lapply() at once. Finally, the merged object is converted as DATA_ALL with tibble object

(2) sparklyr::spark_read_csv()

First thing first. To take the full benefit of spark with sparklyr in R, we first need to connect the already installed spark. To do that, spark_install() takes care of the easy installation. I assume we already did that and proceed to import the csv files directly from csv files in folder to spark. This way, we don’t worry about the memory issue we may encounter when dealing with tons of files.

Using a handy spark_read_csv() with simple wild card(*) in path setting, all the csv files in that folders are easily loaded to spark. As always, we can quickly check the data imported with sdf_dim(), and even with glimpse() from dplyr.

Prior to saving the combined csv file to desktop, we might want to make sure that how many csv files we want to have. In this case, I need only one big csv file, and therefore I set partitions = 1 using sdf_coalesce() function. Finally, with spark_write_csv(), DATA_ALL object is now saved under “data” folder named as DATA_ALL.csv. This is simple but very effective approach of combining multiple files if you have at least more than two cores in your local machine: Distribute your cores to work!

(3) read_bulk() with SQLite

Now, I will use recently known read_bulk() together with a handy SQLite() from RSQLite package. This is a really handy database that requires no separate installation and configuration.

We first create and connect to database, named as “DATA_DB” with dbConnect(). Then, DATA_DB.sqlite file is created at your working directory and all the csv files that will be combined are about to be inserted to that. In the meantime, you will be amazed by the size of the object of DATA_DB in R: Only few kilobyte. What this means is that you are free of memory issue in R since the real csv files are directlty stored in this DATA_DB.sqlite instead. R here works as intermediary liaison in between Database and csv file-transfer.

One thing to note is that I always measure the time it takes (i.e. system.time()) when working with large volumes of data including but not limited to model fitting. Once I know how long it will take for a specific task, it becomes much easier to manage multiple tasks at the same time. Clock is ticking!

fun = argument in read_bulk() allows you to use any function to import the files. So, we can set it as dplyr::read_csv, rio::import(), or util::read.csv, to name a few. data.table::fread() is renowned for its speed to import, so this is my choice of the day.


The beauty of dplyr also applies to SQL. Though not all SQL commands are possible with dplyr, it is sufficient enough to get the data we want (i.e. select, filter). After playing around the data with several times, use collect() to save it in R (memory) for subsequent analysis. I made sure I had it by saving DATA_TO_R as Rda to disk.

Wrap up

As always, tons of data (or files) need to be massaged (migrated or integrated, etc) in order for us to analyze them with the way we want. Of course, with R, there should be multiple alternative ways to do the same tasks. Here, this post looked at the very fundamental stage of importing and combining bunch of csv files into one dataset with three simple ways (that I know of): (1) fread(), (2) spark_read_csv(), and (3) read_bulk().

Again, any comment will be appreciated and adding your version to do this job will also be greatly apprecited! Cheers!

The merge function in R allows you to combine two data frames, much like the join function that is used in SQL to combine data tables. Merge, however, does not allow for more than two data frames to be joined at once, requiring several lines of code to join multiple data frames.

This post explains the methodology behind merging multiple data frames in one line of code using base R. We will be using the Reduce function, part of Funprog in base R v3.4.3. Funprog contains a suite of higher order functions which provide simple alternatives to laborious, long winded coding solutions.

The merge function

As described, merge is essentially the “join” of the R world. Whilst this post is not about the fine workings of merge, I will give a brief introduction.

Merge takes two data frames, x and y, and combines them based on one or more shared columns. Rows are combined where the data of these shared columns are equal, meaning we can combine columns from different data frames that refer to the same piece of data. For instance, take the following two data frames:

It is clear that the two data frames are referring to the same characters, however it may be more useful to us if the two were combined into a single data frame. This is where merge comes in. Merge takes the following structure:

Here, we are looking to combine the height and gender data frames where the character columns are equal. To continue the SQL analogy, x is the left-hand table, y is the right-hand table, and merge is the LEFT JOIN operation. The “by” component is our “ON” clause. For example:

Running this merge function gives us the following output:

Joining Multiple Dataframes In R

This is the result we were expecting, but what if we introduce a third data frame?

Sadly, merge does not allow us to simply add our eyeColour data frame as a third input (we only have x and y parameters available). That’s where Reduce comes in.

The Reduce function

Reduce takes a function and sequentially applies it to a given list of inputs, in our case a list of data frames. For example, imagine we have a function f which accepts two arguments, and a list of objects (a, b, c). Then Reduce(x, list(a, b, c)) would perform the following action:

f(a, f(b, c))

where the function x is first applied to data frames b and c, and is then applied to data frame a and the output of the first application of x. This allows us to avoid running and saving x(b, c), like this:

Applying Reduce to merge

In merge we have an example of a function that performs an action on two inputs. Reduce takes two parameters; f which stands for function and x which represents a vector. Reduce will sequentially apply the function f to the list x.

In our example, the function that we want to apply is merge, and the vector which we want to apply it to is a list of our data frames. First off, let’s try the following:

Perfect! But what if we wanted to specify the parameters within our merge function call? Well, we could define our own function which merges two data frames with specified parameters:

Here, we have specified our f as a custom function, which takes two parameters and applies the merge function to them. Within this custom function, we have specified our by parameter, which may be necessary for longer or more complex uses of Reduce.

Further reading

The function that we passed to Reduce is known in the world of functional programming as a lambda function, or an anonymous function; a single use function that is not named and saved. Functional programming is a principle around which R is built, and can provide many smart and elegant ways to achieve things that would otherwise require large amounts of coding. We may explore more of the functional programming features of R in future blog posts, however for now the following link provides a nice overview of the most used techniques:

Join Two Dataframes In R

by Jon Willis

R Merge Dataframe By Column

  • 2019
    • 27 Feb 2019 5 reasons why Microsoft became Gartner’s market leader for BI 27 Feb 2019
  • 2018
    • 14 Dec 2018 8 insights from the SDR 2017-18 Dashboard 14 Dec 2018
    • 23 Nov 2018 What is a Dashboard? 23 Nov 2018
    • 31 Aug 2018 Plotly in R: How to make ggplot2 charts interactive with ggplotly 31 Aug 2018
    • 16 Aug 2018 Making the most of box plots 16 Aug 2018
    • 24 Jul 2018 Plotly in R: How to order a Plotly bar chart 24 Jul 2018
    • 11 Apr 2018 Machine learning in the housing sector 11 Apr 2018
    • 5 Mar 2018 How Useful Are Traffic Light Scorecards for Performance Management? 5 Mar 2018
    • 16 Feb 2018 How to merge multiple data frames using base R 16 Feb 2018
    • 8 Feb 2018 The beginner's guide to time series forecasting 8 Feb 2018
    • 24 Jan 2018 R Shiny vs. Power BI 24 Jan 2018
  • 2017
    • 18 Oct 2017 What is predictive analytics? 18 Oct 2017
    • 19 Sep 2017 Performance Management Case Study 19 Sep 2017
  • 2016
    • 15 Aug 2016 Fundamentals of a good performance framework 15 Aug 2016
Coments are closed

Most Viewed Posts

  • Combine Data In Power Bi
  • Ulala Discord
  • Amazon Video Streaming
  • Best Shortcuts Ios 14

Scroll to top