tidyr::spread() and dplyr::rename_at() in action
RI was recently confronted to a situation that required going from a long dataset to a wide dataset, but with a small twist: there were two datasets, which I had to merge into one. You might wonder what kinda crappy twist that is, right? Well, let’s take a look at the data:
data1; data2
## # A tibble: 20 x 4
## country date variable_1 value
## <chr> <chr> <chr> <int>
## 1 lu 01/01/2005 maybe 22
## 2 lu 01/07/2005 maybe 13
## 3 lu 01/01/2006 maybe 40
## 4 lu 01/07/2006 maybe 25
## 5 lu 01/01/2005 totally_agree 42
## 6 lu 01/07/2005 totally_agree 17
## 7 lu 01/01/2006 totally_agree 25
## 8 lu 01/07/2006 totally_agree 16
## 9 lu 01/01/2005 totally_disagree 39
## 10 lu 01/07/2005 totally_disagree 17
## 11 lu 01/01/2006 totally_disagree 23
## 12 lu 01/07/2006 totally_disagree 21
## 13 lu 01/01/2005 kinda_disagree 69
## 14 lu 01/07/2005 kinda_disagree 12
## 15 lu 01/01/2006 kinda_disagree 10
## 16 lu 01/07/2006 kinda_disagree 9
## 17 lu 01/01/2005 kinda_agree 38
## 18 lu 01/07/2005 kinda_agree 31
## 19 lu 01/01/2006 kinda_agree 19
## 20 lu 01/07/2006 kinda_agree 12
## # A tibble: 20 x 4
## country date variable_2 value
## <chr> <chr> <chr> <int>
## 1 lu 01/01/2005 kinda_agree 22
## 2 lu 01/07/2005 kinda_agree 13
## 3 lu 01/01/2006 kinda_agree 40
## 4 lu 01/07/2006 kinda_agree 25
## 5 lu 01/01/2005 totally_agree 42
## 6 lu 01/07/2005 totally_agree 17
## 7 lu 01/01/2006 totally_agree 25
## 8 lu 01/07/2006 totally_agree 16
## 9 lu 01/01/2005 totally_disagree 39
## 10 lu 01/07/2005 totally_disagree 17
## 11 lu 01/01/2006 totally_disagree 23
## 12 lu 01/07/2006 totally_disagree 21
## 13 lu 01/01/2005 maybe 69
## 14 lu 01/07/2005 maybe 12
## 15 lu 01/01/2006 maybe 10
## 16 lu 01/07/2006 maybe 9
## 17 lu 01/01/2005 kinda_disagree 38
## 18 lu 01/07/2005 kinda_disagree 31
## 19 lu 01/01/2006 kinda_disagree 19
## 20 lu 01/07/2006 kinda_disagree 12
As explained in Hadley (2014), this is how you should keep your data… But for a particular purpose, I had to transform these datasets. What I was asked to do was to merge these into a single wide data frame. Doing this for one dataset is easy:
data1 %>%
spread(variable_1, value)
## # A tibble: 4 x 7
## country date kinda_agree kinda_disagree maybe totally_agree
## <chr> <chr> <int> <int> <int> <int>
## 1 lu 01/01/2005 38 69 22 42
## 2 lu 01/01/2006 19 10 40 25
## 3 lu 01/07/2005 31 12 13 17
## 4 lu 01/07/2006 12 9 25 16
## # ... with 1 more variable: totally_disagree <int>
But because data1
and data2
have the same levels for variable_1
and variable_2
, this would not
work. So the solution I found online, in this SO thread was to use tidyr::spread()
with
dplyr::rename_at()
like this:
data1 <- data1 %>%
spread(variable_1, value) %>%
rename_at(vars(-country, -date), funs(paste0("variable1:", .)))
glimpse(data1)
## Observations: 4
## Variables: 7
## $ country <chr> "lu", "lu", "lu", "lu"
## $ date <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree` <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree` <int> 69, 10, 12, 9
## $ `variable1:maybe` <int> 22, 40, 13, 25
## $ `variable1:totally_agree` <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
data2 <- data2 %>%
spread(variable_2, value) %>%
rename_at(vars(-country, -date), funs(paste0("variable2:", .)))
glimpse(data2)
## Observations: 4
## Variables: 7
## $ country <chr> "lu", "lu", "lu", "lu"
## $ date <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable2:kinda_agree` <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree` <int> 38, 19, 31, 12
## $ `variable2:maybe` <int> 69, 10, 12, 9
## $ `variable2:totally_agree` <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21
rename_at()
needs variables which you pass to vars()
, a helper function to select variables, and
a function that will do the renaming, passed to funs()
. The function I use is simply paste0()
,
which pastes a string, for example “variable1:” with the name of the columns, given by the single ‘.’,
a dummy argument. Now these datasets can be merged:
data1 %>%
full_join(data2) %>%
glimpse()
## Joining, by = c("country", "date")
## Observations: 4
## Variables: 12
## $ country <chr> "lu", "lu", "lu", "lu"
## $ date <chr> "01/01/2005", "01/01/2006", "01/0...
## $ `variable1:kinda_agree` <int> 38, 19, 31, 12
## $ `variable1:kinda_disagree` <int> 69, 10, 12, 9
## $ `variable1:maybe` <int> 22, 40, 13, 25
## $ `variable1:totally_agree` <int> 42, 25, 17, 16
## $ `variable1:totally_disagree` <int> 39, 23, 17, 21
## $ `variable2:kinda_agree` <int> 22, 40, 13, 25
## $ `variable2:kinda_disagree` <int> 38, 19, 31, 12
## $ `variable2:maybe` <int> 69, 10, 12, 9
## $ `variable2:totally_agree` <int> 42, 25, 17, 16
## $ `variable2:totally_disagree` <int> 39, 23, 17, 21
Hope this post helps you understand the difference between long and wide datasets better, as well
as dplyr::rename_at()
!