At least once a year I meet with a graduate student who has many separate datasets that need to be combined into a single file. The data are usually from a series of data loggers (e.g., iButtons or RFID readers) that record data remotely over a specified time period. The researcher periodically downloads the data from each data logger and then redeploys it for further data collection.
I’m going to set up the background for my particular use case before jumping into the R code to perform this sort of task. Go straight to List all files to read in if you want to get right into R code.
The most common way to merge two datasets is to use the leftjoin function. We can see from the picture below that the key-pair matches perfectly the rows A, B, C and D from both datasets. However, E and F are left over. How do we treat these two observations? Combining Say you have two data files that have the same columns in them (for example, two months worth of data from a database), but you want to combine them into one object in R so you can more easily visualise differences or trends. Let’s set up a simple example to show how this works.
Table 1: Merging Two Data Frames by Row Names. Table 1 shows the output of our previous R code. As you can see the merge function retained all rows where the row names were available in both data sets. The inner join keyword selects records that have matching values in both tables. To join two datasets, we can use merge function. We will use three arguments: merge (x, y, by.x = x, by.y = y).
For someone who is at least somewhat familiar with a programming language (e.g., SAS, Python, R), reading many datasets in and combining them into a single file might not seem like a big deal. For example, if I do a quick web search on “r read many datasets” I get at least 5 Stack Overflow posts (with answers) as well as several blog entries. These links show code for relatively simple situations of reading many identical dataset in to R (a couple SO examples can be found here and here).
However, in my experience this work doesn’t feel very simple to beginner programmers. Most of the graduate students I meet with have never worked with any sort of programming language prior to entering their degree program. By the time I meet with them they usually have had a very basic introduction to R in their intro statistics courses. They may have read in a dataset into R only a couple of times at most, and now they have hundreds of them to manage. To further complicate things, there is usually more work that needs to be done beyond reading the datasets in, such as adding important identifying information to each dataset.
That is the question I got from the first student I ever advised on this topic. She was collecting data using many data loggers in a field experiment set up as randomized complete block design with repeated measures. She had 300 comma-delimited files that needed to be concatenated together from her first field season and was planning on a second season that was at least twice as long (so would have at least twice as many files).
Her research collaborators had used these loggers previously, and had given her the following algorithm to follow:
But really this gave me a chance to discuss reproducibility and the convenience of using software to do repetitive tasks with the next generation of researchers. While she would need to expend effort understanding R code, the effort would be valuable in the long term given she was going to do a second field season.
To be honest, she was pretty skeptical that it made sense to use a programming language to do the work. From her perspective, doing the work via R looked more difficult and more “black-box” than manually copying and pasting in Excel. It helped that we found mistakes in the files she’d already edited when when we were setting up the R code (it’s so easy to make mistakes when copying and pasting 300 times!).
Her skepticism continues to be a good reminder to me of what it feels like to be a beginner in a programming language, where you don’t quite trust that the program is doing what you want it to and you don’t exactly fully understand the code. The manual approach you already know how to do can look pretty darn attractive by comparison.
Today I’ll use package here for relative directory paths, stringr for string manipulation, and purrr for looping through to read all the files.
When reading in many files, my first two tasks are:
Here I’ll start by getting a list of the files. I’m using some toy files I made to mirror what the real files looked like. These CSV files are available on the blog GitHub repository.
Listing all files can be done in R via
dir(). I’ll use
list.files() here (no reason, just habit most likely based on what I learned first).
For this particular case I will use four of the arguments in
The directory containing the files needs to be defined in the
'path' argument of
list.files(). I’m working within an RStudio Project, and will use the here package to indicate the directory the files are in relative to the root directory. See Jenny Bryan’s post here on the merits of here and self-contained projects.
'pattern' argument is used to tell R which file paths should be listed. If you want to read in all CSV files in a directory, for example, the pattern to match might by
'.csv'. In the real scenario, there were additional CSV files in the directory that we didn’t want to read. All the files we wanted to read ended in “AB.csv”, so we first defined the pattern as
'AB.csv'. Later we realized that some file names were all lowercase, so used
'AB.csv ab.csv'. The vertical pipe,
, stands for “or”.
'recursive' argument is used to indicate whether or not child folders in the parent directory should be searched for files to list or not. It defaults to
FALSE. The files in this particular case are not stored in a single folder. Instead they are in child folders within an overall “data” directory. The names of the child folders actually indicate the study units (“Blocks” and “Sites”) the data were collected in.
'full.names' argument is used to indicate if the complete file paths should be returned or only the relative file paths. In this case, the only place in the information about some of the physical units of the study (“Blocks” and “Sites”) are in the directory path. We needed the full path names in order to extract that information and add it to the dataset.
list.files function returns a vector of file paths.
I find things go more smoothly if I work out the file-reading process with a single file before I try to read a bunch of files. It’s an easy step to want to skip because it feels more efficient to do “everything at once”. I’ve never found that to actually be the case. 👅
I’ll practice with the first file listed in
allfiles. The top 6 lines of the raw data file is all extraneous header information, which will be skipped via
'skip'. There are no column headers (
'header') in the file, so those need to be added (
That went pretty smoothly, but things get a little hairy from here. The information on the physical units of the study, “Blocks” and “Sites”, are contained only in the file directory path. These need to be extracted from the file path and added to the dataset.
In addition, the “Plot” information is contained in the file name. Plot names are single numbers that are found directly before the period in the file name. In
allfiles that number is 5 (the file name is “SIT1_17_12_21_5.2_AB.csv”).
Last, the final two digits of the file name is a code to indicate where the data logger was located. This also needs to be added to the dataset. In the toy example these values are all “AB”, but in the larger set of files this wasn’t true.
All the tasks above are string manipulation tasks. I will tackle these with the functions from the stringr package.
Extract “Block” names from the file path
Since some information is located within the file path string, splitting the file path up into separate pieces seems like a reasonable first step. This can be done via
'/' as the symbol to split on. As there is only a single character string to split for each dataset, it is convenient to return a matrix instead of a list via
simplify = TRUE.
The result is a matrix containing strings in each column.
The “Block” information is always the third column if counting from the end (it’s the 10th in this case if counting from the beginning). It’s “safer” (i.e., less likely to fail on a different computer) to count from the last column and work backwards here; on a different computer the full file paths may change length but the directory containing the files to read will be the same.
To extract the third column from the end I take the total number of columns and subtract 2.
This can be added to the dataset as a “block” variable.
Extract “Site” names from the file path
This will be the same as above, except site names are contained in the second-to-last column.
Extract “Plot” names from the file name
The last character string in our matrix is the file name, which contains the plot name and logger location. In the test case the plot name is “5” and the logger location is “AB”.
This can be split on the underscores and periods and the “Plot” information extracted in much the same was as the “Block” information. I think this option can feel more approachable to beginners and is a reasonable way to solve the problem.
Another option is to define which part of the file name we want to pull out. This involves using regular expressions. I personally find regular expressions quite difficult and invariably turn to Stack Overflow to find the answers. I will use them here to demonstrate a wider breadth of options
A basic introduction to regular expressions is on the second page of the “Work with Strings” cheatsheet from RStudio. A more in depth set of examples can be found on the UBC ST 545 page here.
str_extract() from stringr with a regular expression for the
'pattern' to extract the plot number. In this case I used a lookaround following this Stack Overflow answer. These can be apparently be costly in terms of performance, which I did not find that to be a deterrent in my case. 😄
The regular expression I use indicates I want to extract a digit (
[0-9]) that comes immediately before a period. The
(?=.) is a positive lookahead, telling R that the digit to match will be followed by a period. The plot names are always just in front of a period, which is why this works.
This can then be added to the dataset.
Extract data logger location from the file name
The last thing to do is extract the data logger location code from the file names. These are the last two digits of the file name, immediately before “.csv”.
str_sub() function from stringr is useful for extracting characters from fixed locations in a string. The logger location information is in the same position in every file name if counting from the end of the string. The
str_sub() function allows the user to pull information counting from the end of the string as well as from the beginning. Because our file names differ in length due to the way that dates are stored, the location data does not always have the same indices if counting characters from the beginning of the string.
Negative indices are used to extract from the end of the string. The location information is stored in the 5th and 6th positions from the end. The negative number largest in absolute value is passed to
start and the smallest in absolute value to
end. (I forget this pretty much every time I subset strings from the end.)
The location data can then be added to the dataset. Since at least one of the file names is all lowercase, I make sure the data logger location information is converted to all caps via
Here’s what the test dataset looks like now.
Once the process is worked out for one file, I can “functionize” it (i.e., make a function). This allows me to apply the exact same procedure to every dataset as it is read in.
The function I create below takes a single argument: the file path of the dataset. The function reads the file and then adds all the desired columns. It returns the modified dataset.
I’ll test the function with that first file path again to make sure it works like I expect it to.
All that’s left to do now is to loop through all the file paths in
allfiles, read and modify each one with my function, and stack them together into a single dataset. This can be done in base R with a
lapply() loop. If using either of those options, the final concatenation step can be done via
These days I’ve been using the
map functions from package purrr for this, mostly because the
map_dfr() variant conveniently binds everything together by rows.
Here’s what using
map_dfr() looks like, looping through each element of
allfiles to read and modify the datasets with the
read_fun() function and then stacking everything together into a final combined dataset.
In working with real data, the final “combining” step can lead to errors due to unanticipated complexities. In my experience, this most often happens because some of the datasets are physically different than the rest. I’ve worked on problems where, for example, it turned out some datasets were present in the directory but were empty.
In the real files for this particular example, it turned out some of the files had been previously modified manually to remove the header information. We ended up adding an
if statement to the function to test each file as we read it in. If it had the header information we’d use
skip while reading in the dataset and if it didn’t we wouldn’t. I did something similar in the case where some of the datasets were blank.
After the combined dataset has been created, you might want to save it for further data exploration and/or analysis. If working on an interim set of datasets (such as before a field season is over), saving the R object with
saveRDS() can be pretty convenient. Saving a final dataset as a CSV may be useful for sharing with collaborators once all datasets have been downloaded and combined, which can be done with, e.g.,
Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here.