Selecting columns and filtering rows We’re going to learn some of the most common dplyr functions: select , filter , mutate , groupby , and summarize. To select columns of a data frame, use select. The first argument to this function is the data frame (metadata), and the subsequent arguments are the columns to keep. Dplyr::dataframe(a = 1:3, b = 4:6) Combine vectors into data frame (optimized). Dplyr::arrange(mtcars, mpg) Order rows by values of a column (low to high). Dplyr::arrange(mtcars, desc(mpg)) Order rows by values of a column (high to low). Dplyr::rename(tb, y = year) Rename the columns of a data frame. Tidyr::spread(pollution, size, amount.
Today, I wanted to talk a little bit how dplyr 1.0.0 uses the vctrs package. This post explains why vctrs is so important, why we can’t just copy what base R does, how to interpret some of new error messages that you’ll see, and some of the major changes since the last version.
Update: as of June 1, dplyr 1.0.0 is now available on CRAN! Read all about it or install it now with
The heart of the reason we’re using vctrs is the need to combine vectors. You’re already familiar with one base R tool for combining vectors,
Combining vectors comes up in many places in the tidyverse, e.g.:
dplyr::summarise()have to combine the results from each group.
dplyr::bind_rows()has to combine columns from different data frames.
dplyr::full_join()has to combine the keys from the
tidyr::pivot_longer()has to combine multiple columns into one.
Our goal is to unify the code that underlies all these various functions so that there’s one consistent, principled approach. We’ve already made the change in tidyr, and now it’s dplyr’s turn.
You might wonder why we can’t just copy the behaviour of
c() has some major downsides:
It doesn’t possess a
factor method so it converts factors to their underlying integer levels.
It’s difficult to implement methods when different classes are involved. For example, combining a date (
Date) and a date-time (
POSIXct) yields an incorrect result because the underlying data is combined without first being translated.
It’s difficult to change how
c() works because any changes are likely to break some existing code, and base R is committed to backward compatibility. Additionally,
c() isn’t the only way that base R combines vectors.
unlist() can also be used to perform a similar job, but return different results. This is not to say that the tidyverse has been any better in the past — we have used a variety of ad hoc methods, undoubtedly using well more than three different approaches.
Given that it’s hard to fix the problem in base R, we’ve come up with our own alternative to
vec_c()'s behaviour is governed by three main principles:
vec_c(x, y) should return a type as similar as possible to
vec_c(y, x). For example, when combining a date and a date-time you always get a date-time.
vec_c(x, y) should return the richer type, where type
<x> is richer than type
x can represent all values in
y. For example, this implies that combining an integer and double should return a double, and that combining a date and date-time should return a date-time.
vec_c(x, y) should error if
y are of fundamentally different types. For example, this implies that combining a string and a number or a factor and a date should error.
As a data scientist, you don’t really need to know much about the vctrs package, except that it exists and its used internally by dplyr. (As a software engineer, you might want to learn about vctrs because it makes it easier to create new types of vectors). But vctrs is responsible for creating a number of error messages in dplyr, so it’s worth understanding their basic form.
In this first example, we attempt to bind two data frames together where the columns have incompatible types: double and character.
Note the components of the error message:
“Can’t combine” means that vctrs can’t combine double and character vectors.
vctrs error messages always puts the “type” of the variable in
<character>. I’m using type informally here (although it does have a precise definition); for many simple cases it’s the same as the class.
bind_rows() doesn’t have named arguments so vctrs uses
..2 to refer to the first and second arguments. You can tell the problem is with the
If after reading the error, you do still want to combine the data frames, you’ll need to make them compatible by manually transforming one of the columns:
Where possible, we attempt to give you more information to solve the problem. For example, if your call to
mutate() returns incompatible types, we’ll tell you which groups have the problem:
Writing good error messages is hard, so we’ve spent a lot of time trying to make them informative. We expect them to continue to improve as we see more examples from live data analysis code.
If you’re not sure where the errors are coming from, learning how to use the traceback (either
rlang::last_error()) will be helpful. I’d highly recommend Jenny Bryan’s rstudio::conf keynote on debugging: Object of type ‘closure’ is not subsettable.
Using vctrs in dplyr also causes two behaviour changes. We hope that these don’t affect much existing code because they both previously generated warnings.
When combining factors with different level sets, dplyr previously converted to a character vector with a warning. As of 1.0.0, dplyr will create a factor with the union of the individual levels:
When combining a factor and a character, dplyr previously warned about creating a character vector. It now silently creates a character vector:
These changes are motivated more by pragmatism than by theory. Strictly speaking, one should probably consider
factor('male') to be incompatible, but this level of strictness causes much pain because character vectors can usually be used interchangeably with factors.
Note that dplyr continues to be stricter than base R when it comes to character conversions:
In this case, we don’t know whether you want a character vector or a numeric vector, so you need to decide by manually converting one of the inputs: