What's the fastest way to search and replace strings in a data frame?
RI’ve tweeted this:
Just changed like 100 grepl calls to stringi::stri_detect and my pipeline now runs 4 times faster #RStats
— Bruno Rodrigues (@brodriguesco) July 20, 2022
much discussed ensued. Some people were surprised, because in their experience, grepl()
was faster than alternatives, especially if you set the perl
parameter in grepl()
to TRUE
.
My use case was quite simple; I have a relatively large data set (half a million lines) with
one column with several misspelling of city names. So I painstakingly wrote some code
to correct the spelling of the major cities (those that came up often enough to matter. Minor
cities were set to “Other”. Sorry, Wiltz!)
So in this short blog post, I benchmark some code to see if what I did the other day was a fluke.
Maybe something weird with my R installation on my work laptop running Windows 10 somehow
made stri_detect()
run faster than grepl()
? I don’t even know if something like that is
possible. I’m writing these lines on my Linux machine, unlike the code I run at work.
So maybe if I find some differences, they could be due to the different OS running.
I don’t want to have to deal with Windows on my days off (for my blood pressure’s sake),
so I’m not running this benchmark on my work laptop. So that part we’ll never know.
Anyways, let’s start by getting some data. I’m not commenting the code below, because that’s not the point of this post.
library(dplyr)
library(stringi)
library(stringr)
library(re2)
adult <- vroom::vroom(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
)
adult_colnames <- readLines(
"https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"
)
adult_colnames <- adult_colnames[97:110] %>%
str_extract(".*(?=:)") %>%
str_replace_all("-", "_")
adult_colnames <- c(adult_colnames, "wage")
colnames(adult) <- adult_colnames
adult
## # A tibble: 32,560 × 15
## age workclass fnlwgt educa…¹ educa…² marit…³ occup…⁴ relat…⁵ race sex
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 50 Self-emp-no… 83311 Bachel… 13 Marrie… Exec-m… Husband White Male
## 2 38 Private 215646 HS-grad 9 Divorc… Handle… Not-in… White Male
## 3 53 Private 234721 11th 7 Marrie… Handle… Husband Black Male
## 4 28 Private 338409 Bachel… 13 Marrie… Prof-s… Wife Black Fema…
## 5 37 Private 284582 Masters 14 Marrie… Exec-m… Wife White Fema…
## 6 49 Private 160187 9th 5 Marrie… Other-… Not-in… Black Fema…
## 7 52 Self-emp-no… 209642 HS-grad 9 Marrie… Exec-m… Husband White Male
## 8 31 Private 45781 Masters 14 Never-… Prof-s… Not-in… White Fema…
## 9 42 Private 159449 Bachel… 13 Marrie… Exec-m… Husband White Male
## 10 37 Private 280464 Some-c… 10 Marrie… Exec-m… Husband Black Male
## # … with 32,550 more rows, 5 more variables: capital_gain <dbl>,
## # capital_loss <dbl>, hours_per_week <dbl>, native_country <chr>, wage <chr>,
## # and abbreviated variable names ¹education, ²education_num, ³marital_status,
## # ⁴occupation, ⁵relationship
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
Let’s now write the functions used for benchmarking. There will be 5 of them:
- One using
grepl()
without any fancy options; - One using
grepl()
whereperl
is set toTRUE
; - One that uses
stringi::stri_detect()
; - One that uses
stringr::str_detect()
; - One that uses
re2::re2_detect()
.
Below you can read the functions. They’re all pretty much the same, only the function
looking for the string changes. These functions look for a string in the marital_status
variable and create a new variable with a corresponding integer.
with_grepl <- function(dataset){
dataset |>
mutate(married = case_when(
grepl("Married", marital_status) ~ 1,
grepl("married", marital_status) ~ 2,
TRUE ~ 3)
)
}
with_grepl_perl <- function(dataset){
dataset |>
mutate(married = case_when(
grepl("Married", marital_status, perl = TRUE) ~ 1,
grepl("married", marital_status, perl = TRUE) ~ 2,
TRUE ~ 3)
)
}
with_stringi <- function(dataset){
dataset |>
mutate(married = case_when(
stri_detect(marital_status, regex = "Married") ~ 1,
stri_detect(marital_status, regex = "married") ~ 2,
TRUE ~ 3)
)
}
with_stringr <- function(dataset){
dataset |>
mutate(married = case_when(
str_detect(marital_status, "Married") ~ 1,
str_detect(marital_status, "married") ~ 2,
TRUE ~ 3)
)
}
with_re2 <- function(dataset){
dataset |>
mutate(married = case_when(
re2_detect(marital_status, "Married") ~ 1,
re2_detect(marital_status, "married") ~ 2,
TRUE ~ 3)
)
}
Now I make extra sure these functions actually return the exact same thing. So for this
I’m running them once on the data and use testthat::expect_equal()
. It’s a bit
unwieldy, so if you have a better way of doing this, please let me know.
run_grepl <- function(){
with_grepl(adult) %>%
count(married, marital_status)
}
one <- run_grepl()
run_grepl_perl <- function(){
with_grepl_perl(adult) %>%
count(married, marital_status)
}
two <- run_grepl_perl()
run_stringi <- function(){
with_stringi(adult) %>%
count(married, marital_status)
}
three <- run_stringi()
run_stringr <- function(){
with_stringr(adult) %>%
count(married, marital_status)
}
four <- run_stringr()
run_re2 <- function(){
with_re2(adult) %>%
count(married, marital_status)
}
five <- run_re2()
one_eq_two <- testthat::expect_equal(one, two)
one_eq_three <- testthat::expect_equal(one, three)
three_eq_four <- testthat::expect_equal(three, four)
testthat::expect_equal(
one_eq_two,
one_eq_three
)
testthat::expect_equal(
one_eq_three,
three_eq_four
)
testthat::expect_equal(
one,
five)
testthat::expect_equal()
does not complain, so I’m pretty sure my functions, while different,
return the exact same thing. Now, we’re ready for the benchmark itself. Let’s run these
function 500 times using {microbenchmark}
:
microbenchmark::microbenchmark(
run_grepl(),
run_grepl_perl(),
run_stringi(),
run_stringr(),
run_re2(),
times = 500
)
## Unit: milliseconds
## expr min lq mean median uq max neval
## run_grepl() 24.37832 24.89573 26.64820 25.50033 27.05967 115.0769 500
## run_grepl_perl() 19.03446 19.41323 20.91045 19.89093 21.16683 104.3917 500
## run_stringi() 23.01141 23.40151 25.00304 23.82441 24.83598 104.8065 500
## run_stringr() 22.98317 23.44332 25.32851 23.92721 25.18168 145.5861 500
## run_re2() 22.22656 22.60817 24.07254 23.05895 24.22048 108.6825 500
There you have it folks! The winner is grepl()
with perl = TRUE
, and then it’s
pretty much tied between stringi()
, stringr()
and re2()
(maybe there’s a slight edge
for re2()
) and grepl()
without perl = TRUE
is last. But don’t forget that this is running
on my machine with Linux installed on it; maybe you’ll get different results on different
hardware and OSs! So if you rely a lot on grepl()
and other such string manipulation
function, maybe run a benchmark on your hardware first. How come switching from grepl()
(without perl = TRUE
though) to stri_detect()
made my pipeline at work run 4 times
faster I don’t know. Maybe it has also to do with the size of the data, and the complexity
of the regular expression used to detect the problematic strings?
Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!