Code longevity of the R programming language
RI’ve been working on a way to evaluate how old R code runs on the current version of R, and am now ready to share some results. It all started with this tweet:
The problem is that you have to find old code laying around. Some people have found old code they wrote a decade or more ago and tried to rerun it; there’s this blog post by Thomas Lumley and this other one by Colin Gillespie that I find fascinating, but ideally we’d have more than a handful of old scripts laying around. This is when Dirk Eddelbuettel suggested this:
And this is what I did. I wrote a lot of code to achieve this graph here:
This graph shows the following: for each version of R, starting with R version 0.6.0 (released in
1997), how well the examples that came with a standard installation of R run on the current version
of R (version 4.2.2 as of writing). These are the examples from the default packages like {base}
,
{stats}
, {stats4}
, and so on. Turns out that more than 75% of the example code from version
0.6.0 still works on the current version of R. A small fraction output a message (which doesn’t
mean the code doesn’t work), some 5% raise a warning, which again doesn’t necessarily mean that the
code doesn’t work, and finally around 20% or so errors. As you can see, the closer we get to the
current release, the less errors get raised.
(But something important should be noted: just because some old piece of code runs without error, doesn’t mean that the result is exactly the same. There might be cases where the same function returns different results on different versions of R.)
Then, once I had this graph, I had to continue with packages. How well do old examples from any given package run on the current version of the same package?
What I came up with is a Docker image that runs this for you, and even starts a Shiny app
to let you explore the results. All you have to do is edit one line in the Dockerfile.
This Docker image uses a lot of code from other projects, and I even had to write a package
for this, called {wontrun}
.
The {wontrun} package
The problem I needed to solve was how to easily run examples from archived packages. So I
needed to first have an easy way to download them, then extract the examples, and then run them. So
to help me with this I wrote the {wontrun}
package (thanks again to
Deemah for suggesting the name and making
the hex logo!). To be honest, the quality of this package could be improved. Documentation is still
lacking, and the package only seems to work on Linux (but that’s not an issue, since it really only
makes sense to use it within Docker). In any case, this package has a function to download the
archived source code for a given package, using the get_archived_sources()
function. This
function takes the name of a package as an input and returns a data frame with the archived sources
and the download links to them. To actually download the source packages, the get_examples()
function is used. This function extracts the examples from the man/
folder included in source
packages, and converts the examples into scripts. Remember that example files are in the .Rd
format,
which is some kind of markup language format. Thankfully, there’s a function called Rd2ex()
from
the {tools}
package which I use to convert .Rd
files into .R
scripts.
Then, all that there is to do is to run these scripts. But that’s not as easy as one might think.
This is becuse I first need to make sure that the latest version of the package is installed, and
ideally, I don’t want to pollute my library with packages that I never use but only wanted to
assess for their code longevity. I also need to make sure that I’m running all these scripts all
else being equal: so same version of R, same version of the current packages and same operating
system. That why I needed to use Docker for this. Also, all the required dependencies to run the
examples should get installed as well. Sometimes, some examples load data from another package.
So for this, I’m using the renv::dependencies()
function which scans a file for calls to
library()
or package::function()
to list the dependencies and then install them.
This all happens automatically.
To conclude this section: I cannot stress how much I’m relying on work by other people for this.
This is the NAMESPACE file of the {wontrun}
package (I’m only showing the import statements):
importFrom(callr,r_vanilla)
importFrom(ctv,ctv)
importFrom(dplyr,filter)
importFrom(dplyr,group_by)
importFrom(dplyr,mutate)
importFrom(dplyr,rename)
importFrom(dplyr,select)
importFrom(dplyr,ungroup)
importFrom(furrr,future_map2)
importFrom(future,multisession)
importFrom(future,plan)
importFrom(janitor,clean_names)
importFrom(lubridate,year)
importFrom(lubridate,ymd)
importFrom(lubridate,ymd_hm)
importFrom(magrittr,"%>%")
importFrom(pacman,p_load)
importFrom(pkgsearch,cran_package)
importFrom(purrr,keep)
importFrom(purrr,map)
importFrom(purrr,map_chr)
importFrom(purrr,map_lgl)
importFrom(purrr,pluck)
importFrom(purrr,pmap_chr)
importFrom(purrr,possibly)
importFrom(renv,dependencies)
importFrom(rlang,`!!`)
importFrom(rlang,cnd_message)
importFrom(rlang,quo)
importFrom(rlang,try_fetch)
importFrom(rvest,html_nodes)
importFrom(rvest,html_table)
importFrom(rvest,read_html)
importFrom(stringr,str_extract)
importFrom(stringr,str_remove_all)
importFrom(stringr,str_replace)
importFrom(stringr,str_trim)
importFrom(tibble,as_tibble)
importFrom(tidyr,unnest)
importFrom(tools,Rd2ex)
importFrom(withr,with_package)
That’s a lot of packages, most of them from Posit. What can I say, these packages are great! Even
if I could reduce the number of dependencies from {wontrun}
, I honestly cannot be bothered, I’ve
been spoilt by the quality of Posit packages.
Docker for reproducibility
The Dockerfile I wrote is based on Ubuntu 22.04, compiles R 4.2.2 from source, and sets the
repositories to https://packagemanager.rstudio.com/cran/__linux__/jammy/2022-11-21 . This way, the
packages get downloaded exactly as they were on November 21st 2022. This ensures that if readers of
this blog post want to run this to assess the code longevity of some R packages, we can compare
results and be certain that any conditions raised are not specific to any difference in R or
package version. It should be noted that this Dockerfile is based on the work of the Rocker
project, and more specifically their versioned
images which are recommended when
reproducibility is needed. Becuse the code runs inside Docker, it doesn’t matter if the {wontrun}
package only runs on Linux (I think that this is the case because of the untar()
function which I
use to decompress the downloaded compressed archives from CRAN, and which seems to
have a different behaviour on Linux vs Windows. No idea how this function behaves on macOS).
The image defined by this Dockerfile is quite heavy, because I also installed all possible dependencies required to run R packages smoothly. This is because even though the Posit repositories install compiled packages on Linux, shared libraries are still needed for the packages to run.
Here is what the Dockerfile looks like:
FROM brodriguesco/wontrun:r4.2.2
# This gets the shiny app
RUN git clone https://github.com/b-rodrigues/longevity_app.git
# These are needed for the Shiny app
RUN R -e "install.packages(c('dplyr', 'forcats', 'ggplot2', 'shiny', 'shinyWidgets', 'DT'))"
RUN mkdir /home/intermediary_output/
RUN mkdir /home/output/
COPY wontrun.R /home/wontrun.R
# Add one line per package you want to asses
RUN Rscript '/home/wontrun.R' dplyr 6
RUN Rscript '/home/wontrun.R' haven 6
CMD mv /home/intermediary_output/* /home/output/ && R -e 'shiny::runApp("/longevity_app", port = 1506, host = "0.0.0.0")'
As you can see it starts by pulling an image from Docker Hub called wontrun:r4.2.2
. This is the
image based on Ubuntu 22.04 with R compiled from source and all dependencies pre-installed.
(This Dockerfile is available here.)
Then my Shiny app gets cloned, the required packages for the app to run get installed, and some
needed directories get made. Now comes the interesting part; a script called wontrun.R
gets copied.
This is what the script looks like:
#!/usr/bin/env Rscript
args <- commandArgs(trailingOnly = TRUE)
library(wontrun)
packages_sources <- get_archived_sources(args[1])
out <- wontrun(packages_sources ,
ncpus = args[2],
setup = TRUE,
wontrun_lib = "/usr/local/lib/R/site-library/")
saveRDS(object = out,
file = paste0("/home/intermediary_output/", args[1], ".rds"))
This script uses the {wontrun}
package to get the archived sources of a package of interest,
and the examples get executed and results tallied using the wontrun()
function. The results
then get saved into an .rds
file.
Calling this script is done with this line in the Dockerfile:
RUN Rscript '/home/wontrun.R' dplyr 6
The dplyr
and 6
get passed down to the wontrun.R
script as a list called args
. So
args[1]
is the “dplyr” string, and args[2]
is 6. This means that the examples from
archived versions of the {dplyr}
package will get assessed on the current version of
{dplyr}
using 6 cores. You can add as many lines as you want and thus assess as many
packages as you want. Once you’re done with editing the Dockerfile, you can build the image;
this will actually run the code, so depending on how many packages you want to assess and the
complexity of the examples, this may take some hours. To build the image run this in a console:
docker build -t code_longevity_packages .
Now, you still need to actually run a container based on this image. Running the container will
move the .rds
files from the container to your machine so you can actually get to the results,
and it will also start a Shiny app in which you will be able to upload the .rds
file and
explore the results. Run the container with (and don’t forget to change path/to/repository/
with
the correct path on your machine):
docker run --rm --name code_longevity_packages_container -v /path/to/repository/code_longevity_packages/output:/home/output:rw -p 1506:1506 code_longevity_packages
Go over to http://localhost:1506/
to start the Shiny app and explore the results:
A collaborative project
Now it’s your turn: are you curious about the code longevity of one particular package? Then
fork the repository, edit the Dockerfile, build, run and do a pull request! It’d be great
to have an overview of the code longevity of as many packages as possible. I thought about
looking at the longevity of several packages that form a group, like the tidyverse packages
or packages from CRAN Task views. Please also edit the package_list.txt
file which
lists packages for which we already have results. Find the repository here:
https://github.com/b-rodrigues/code_longevity_packages
By the way, if you want the results of the language itself (so the results of running the examples
of {base}
, {stats}
, etc), go here and click “download”.
Looking forward to your pull requests!
Hope you enjoyed! If you found this blog post useful, you might want to follow me on Mastodon or twitter for blog post updates and buy me an espresso or paypal.me, or buy my ebook on Leanpub. You can also watch my videos on youtube. So much content for you to consoom!