Lesser known dplyr 0.7* tricks
RThis blog post is an update to an older one
I wrote in March.
In the post from March, dplyr
was at version 0.50, but since then a major update introduced some
changes that make some of the tips in that post obsolete. So here I revisit the blog post from March
by using dplyr
0.70.
Create new columns with mutate()
and case_when()
The basic things such as selecting columns, renaming them, filtering, etc did not change with this new
version. What did change however is creating new columns using case_when()
.
First, load dplyr
and the mtcars
dataset:
library("dplyr")
data(mtcars)
This was how it was done in version 0.50 (notice the ‘.$’ symbol before the variable ‘carb’):
mtcars %>%
mutate(carb_new = case_when(.$carb == 1 ~ "one",
.$carb == 2 ~ "two",
.$carb == 4 ~ "four",
TRUE ~ "other")) %>%
head(5)
## mpg cyl disp hp drat wt qsec vs am gear carb carb_new
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 four
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 four
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 two
This has been simplified to:
mtcars %>%
mutate(carb_new = case_when(carb == 1 ~ "one",
carb == 2 ~ "two",
carb == 4 ~ "four",
TRUE ~ "other")) %>%
head(5)
## mpg cyl disp hp drat wt qsec vs am gear carb carb_new
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 four
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 four
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 one
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 one
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 two
No need for .$
anymore.
Apply a function to certain columns only, by rows, with purrrlyr
dplyr
wasn’t the only package to get an overhaul, purrr
also got the same treatment.
In the past, I applied a function to certains columns like this:
mtcars %>%
select(am, gear, carb) %>%
purrr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)
Now, by_row()
does not exist in purrr
anymore, but instead a new package called purrrlyr
was introduced with functions that don’t really fit inside purrr
nor dplyr
:
mtcars %>%
select(am, gear, carb) %>%
purrrlyr::by_row(sum, .collate = "cols", .to = "sum_am_gear_carb") -> mtcars2
head(mtcars2)
## # A tibble: 6 x 4
## am gear carb sum_am_gear_carb
## <dbl> <dbl> <dbl> <dbl>
## 1 1 4 4 9
## 2 1 4 4 9
## 3 1 4 1 6
## 4 0 3 1 4
## 5 0 3 2 5
## 6 0 3 1 4
Think of purrrlyr
as purrr
s and dplyr
s love child.
Using dplyr
functions inside your own functions, or what is tidyeval
Programming with dplyr
has been simplified a lot. Before version 0.70
, one needed to use
dplyr
in conjuction with lazyeval
to use dplyr
functions inside one’s own fuctions. It was
not always very easy, especially if you mixed columns and values inside your functions. Here’s the
example from the March blog post:
extract_vars <- function(data, some_string){
data %>%
select_(lazyeval::interp(~contains(some_string))) -> data
return(data)
}
extract_vars(mtcars, "spam")
More examples are available in this other blog post.
I will revisit them now with dplyr
’s new tidyeval
syntax. I’d recommend you read the Tidy evaluation
vignette here. This vignette
is part of the rlang
package, which gets used under the hood by dplyr
for all your programming needs.
Here is the function I called simpleFunction()
, written with the old dplyr
syntax:
simpleFunction <- function(dataset, col_name){
dataset %>%
group_by_(col_name) %>%
summarise(mean_mpg = mean(mpg)) -> dataset
return(dataset)
}
simpleFunction(mtcars, "cyl")
## # A tibble: 3 x 2
## cyl mean_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
With the new synax, it must be rewritten a little bit:
simpleFunction <- function(dataset, col_name){
col_name <- enquo(col_name)
dataset %>%
group_by(!!col_name) %>%
summarise(mean_mpg = mean(mpg)) -> dataset
return(dataset)
}
simpleFunction(mtcars, cyl)
## # A tibble: 3 x 2
## cyl mean_mpg
## <dbl> <dbl>
## 1 4 26.7
## 2 6 19.7
## 3 8 15.1
What has changed? Forget the underscore versions of the usual functions such as select_()
,
group_by_()
, etc. Now, you must quote the column name using enquo()
(or just quo()
if working
interactively, outside a function), which returns a quosure. This quosure can then be
evaluated using !!
in front of the quosure and inside the usual dplyr
functions.
Let’s look at another example:
simpleFunction <- function(dataset, col_name, value){
filter_criteria <- lazyeval::interp(~y == x, .values=list(y = as.name(col_name), x = value))
dataset %>%
filter_(filter_criteria) %>%
summarise(mean_cyl = mean(cyl)) -> dataset
return(dataset)
}
simpleFunction(mtcars, "am", 1)
## mean_cyl
## 1 5.076923
As you can see, it’s a bit more complicated, as you needed to use lazyeval::interp()
to make it work.
With the improved dplyr
, here’s how it’s done:
simpleFunction <- function(dataset, col_name, value){
col_name <- enquo(col_name)
dataset %>%
filter((!!col_name) == value) %>%
summarise(mean_cyl = mean(cyl)) -> dataset
return(dataset)
}
simpleFunction(mtcars, am, 1)
## mean_cyl
## 1 5.076923
Much, much easier! There is something that you must pay attention to though. Notice that I’ve written:
filter((!!col_name) == value)
and not:
filter(!!col_name == value)
I have enclosed !!col_name
inside parentheses. I struggled with this, but thanks to help
from @dmi3k and
@_lionelhenry I was able to understand
what was happening (isn’t the #rstats community on twitter great?).
One last thing: let’s make this function a bit more general. I hard-coded the variable cyl
inside the
body of the function, but maybe you’d like the mean of another variable? Easy:
simpleFunction <- function(dataset, group_col, mean_col, value){
group_col <- enquo(group_col)
mean_col <- enquo(mean_col)
dataset %>%
filter((!!group_col) == value) %>%
summarise(mean((!!mean_col))) -> dataset
return(dataset)
}
simpleFunction(mtcars, am, cyl, 1)
## mean(cyl)
## 1 5.076923
«That’s very nice Bruno, but mean((cyl))
in the output looks ugly as sin» you might think, and you’d be
right. It is possible to set the name of the column in the output using :=
instead of =
:
simpleFunction <- function(dataset, group_col, mean_col, value){
group_col <- enquo(group_col)
mean_col <- enquo(mean_col)
mean_name <- paste0("mean_", mean_col)[2]
dataset %>%
filter((!!group_col) == value) %>%
summarise(!!mean_name := mean((!!mean_col))) -> dataset
return(dataset)
}
simpleFunction(mtcars, am, cyl, 1)
## mean_cyl
## 1 5.076923
To get the name of the column I added this line:
mean_name <- paste0("mean_", mean_col)[2]
To see what it does, try the following inside an R interpreter (remember to us quo()
instead of enquo()
outside functions!):
paste0("mean_", quo(cyl))
## [1] "mean_~" "mean_cyl"
enquo()
quotes the input, and with paste0()
it gets converted to a string that can be used as a column
name. However, the ~
is in the way and the output of paste0()
is a vector of two strings: the correct
name is contained in the second element, hence the [2]
. There might be a more elegant way of doing that,
but for now this has been working well for me.
That was it folks! I do recommend you read the Programming with dplyr vignette here as well as other blog posts, such as the one recommended to me by @dmi3k here.
Have fun with dplyr 0.70
!