Reason 9 Reproducibility

  • How to organize your project directory
  • Turning an RMarkdown notebook into a shareable document.
  • Using Git to track changes.
  • Using {renv} snapshots to manage package versions.
  • Other ways to improve reproducibility

Workflows that are based on graphical user interfaces are difficult to reproduce! In contrast, when you code your analysis with R, all the steps are laid out in your script. Well-written code allows someone else to retrace everything you have done and to understand how you got from your data to your results.

To ensure that your results are truly reproducible, you must take additional measures. You need to make sure that your code produces the same results on another computer and at a later point in time. Moreover you should provide enough documentation to allow others to understand the motivation behind your analysis.

You may be thinking that nobody will care to reproduce your analysis. Even if this is true, it is quite possible that you will at some later point get back to it, and your future self will certainly be grateful for making their job as easy as possible.

9.1 Self-contained projects with RStudio

You have already learned how to store your analysis as a self-contained RStudio project. This was the first step to reproducibility, because RStudio projects are portable and self-contained.

In addition, you should get familiar with best practices for organizing and naming the files in your project folder. As a primer, I recommend reading Good Enough Practices for Scientific Computing (Wilson et al. 2017). A few of the key points are:

  • Store your data in a data/ directory. Keep your raw data as raw as possible.

9.2 Reproducible reports with RMarkdown

Another tool for reproducible analyses that we have already introduced is RMarkdown. As you know, an RMarkdown document lets you mix text, R code, and the output of that R code.

Now here is something amazing about RMarkdown: With the click of a button, you can turn an RMarkdown document into something that you can share with others, even if they do not have R! Let’s try this with our file analysis.Rmd

Exercise 9.1 Click the knit button.

OK, but it still looks rather …. computerish. How do I make something that a normal person would want to read, or a journal would be willing to accept? As so often n R, there are packages that can help you with that. Packages to make pretty APA tables for models, summary stats, …

Note that RStudio has a visual markdown editor that makes authoring longer documents more convenient. You can activate it with the button on the top right of the script panel.

And if the data changes or a journal reviewer asks you to exclude a subject? Update the code and click the button again — all the figures and stats update! No more copy and paste!

9.2.1 Other output formats

You can use RMarkdown to make a wide range of documents:

HTML is probably the most convenient and easy to use output format. If you want to use PDF or Word, you will have to worry about page sizes and page breaks. It’s best to avoid this until later in the process. For PDF in particular, you will also have to install a \(\LaTeX\) distribution. I recommend {tinytex}.

We will leave it at this short teaser for RMarkdown. There’s much more that RMarkdown can do, for example:

  • insert references and bibliographies

  • embed videos and interactive graphics

  • parameterized reports for recurring reports.

To learn more, check out one of the two big books of RMarkdown (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020), or RStudios tutorial on RMarkdown (https://rmarkdown.rstudio.com/lesson-1.html). If you are a researcher, you might like RMarkdown for Scientists – A guide to RMarkdown aimed at researchers

9.3 Version control with Git

Version control is an essential tool for any programming endeavour. It is like a time machine for your code. Among other things, version control allows you to

  • store the state of your project and restore a previous state.

  • create parallel versions (branches) of your code, switch between them and merge them back together

  • collaborate with others on the same code, review and merge the changes that others have made

Using version control is very liberating because it gives you the freedom to experiment and make mistakes. A popular version control tool is Git. To start using Git with your project you will need to do four things:

  1. Install Git

  2. initialize a new Git repository in your project directory.

  3. Decide which files and changes you want to save — there may be some files that you do not really need to track. For example, large intermediate data files may not be worth tracking if they can be easily recreated from the code.

  4. Commit the current state — A Git commit is to code what the 💾 is to a Word document.

In principle you could use Git in the terminal, but there are many graphical user interfaces for Git that make things easier, and RStudio has a Git interface built in.15

git init
git add . 
git commit -m "initial commit"

Congratulations! You have made your first Git commit! Now you can rest assured, that you can always go back to the current state of the analysis, see the changes you have made since, figure out where things broke, and undo changes if necessary. Isn’t that comforting?

To utilize the full power of Git, you will have to learn a few more things. This is outside of the scope of this book but you can find a good introduction to Git for R users in the book Happy Git with R (https://happygitwithr.com/). Learn to use Git and use it often!

9.4 Make your development environment reproducible

9.4.1 Snapshot packages with {renv}

Unfortunately, using Git to track changes in your own code is not enough, because the packages that you use may get updated too and sometimes, the developers decide to change the way a function work. This can cause your code to break and it can be really frustrating to find that your analysis from a few months ago suddenly does not run anymore, or that it fails to run on your collaborators computer, who uses different versions of the same packages. You can avoid this frustration by storing a snapshot of your package library with the {renv} package (Ushey 2021).

Install the {renv} package, then – in your project initialize {renv} for your project and take a snapshot of the packages you use in your project:

install.packages("renv")
renv::init()
renv::snapshot()

This creates a private, per-project library, in which new packages will be installed. This library is isolated from other R libraries on your system. Later, you or your collaborators can recreate this library, with the exact package versions you used by running renv::restore().

9.4.2 Switching between R versions

R itself receives updates, too. While the updates are often minor and breaking changes are rare, it is not guaranteed that today’s code will work on a future version of R. It is possible to install multiple versions of R in parallel on the same computer and switch between them. On Windows you can do this in the RStudio interface, Tools ▸ Global Options ▸ General ▸ Basic ▸ Use RStudio Version .... On MacOs, it is a little more complicated, but you can use a tool called RSwitch (https://rud.is/rswitch/) that makes it a bit easier.

9.4.3 Snapshot everything with Docker

Finally, you can also take the idea of snapshotting your environment a step further by using Docker Containers. This is akin to creating an entire virtual computer just for your analysis. Instead of just your analysis files, you then track, store, and share that entire virtual computer. This is a lot more effort, but it may be the only way to guarantee that your analysis can be reproduced out of the box. However, this is beyond the scope of this book.

9.5 More

R has many more tools for improving reproducibility, and as always there’s more than one option.

9.5.1 Pipelines

Once your analysis becomes more complex it often contains many different outputs (models, figures, intermediate data sets). Some of them can take long to compute and it can be difficult to stay on top of the different steps of your analysis. There are packages that try to solve that.

  • {targets} - smart make-files for your analysis
  • {orderly} - makes sure your reproducible reports are reproducible
  • {rrtols}

9.5.2 Test-driven development

with {testthat}, {tinytest}. {testrmd}


  1. Personally, I prefer Github Desktop (https://desktop.github.com/) because it is prettier, snappier, and arguably more user friendly.↩︎