Reproducible Research
Introduction
Reproducible research (RR) is the practice of conducting and presenting research in such a way that others, and yourself can later re-implement your research strategy without ambiguity. In the context of statistical collaboration, this means that you or someone else can easily reproduce all of your actions relating to data management and data analysis, and reach the same result. Since the statistical collaborators work is mostly done using computer tools, reproducible research means documenting all of the tools and procedures (applications, data storage formats, programs/scripts) that were used.
For more detail and background information on redroducible research see this. For an entire reproducible R workflow see R Workflow.
The most significant barrier to adopting RR practices was the additional (perceived) cost in time. While the initial cost of RR may be greater than some non-RR practices, there is a strong argument that RR actually saves time over the course of a research project, because it streamlines actions that are often repeated. For example, data handling steps may be repeated many times because the data are updated, e.g., because errors are found, or new data become available. By implementing a RR framework, the statistical collaborator avoids having to remember and manually repeat the data management and analysis steps every time the data are updated. As another example, peer-reviewed manuscripts often require revision, and may be greatly simplified when RR practices are used.
When working with data from collaborators, reproducibility should be considered as early as possible. Suppose that data are received from a collaborator in Microsoft Excel format, but that it’s necessary to convert the data to CSV format. A copy of the original Excel database should be kept for posterity, and it should be documented (e.g. in a README file, or comment in an R script) that the original data were received in Excel format, but were converted to CSV format. How the conversion is accomplished (e.g. using the export feature of Microsoft Excel) should also be recorded.
There are many ways to implement, and many software tools to facilitate reproducible research. How RR is implemented is a matter of preference. But, note that some strategies may impose fewer barriers to RR (e.g. time, financial, interoperability) than others. Because of widespread adoption, ease of use, and zero financial cost, R is one of the primary tools for RR. RStudio has developed a free graphical environment for R that has additional features to facilitate RR.
Literate Programming and Reproducibility
- Literate programming: Writing documentation containing computer code. Documentation (and perhaps a statistical report) and code are maintained in one file. An extractor program splits out the code to compile. HP Wolf and P Naeve have done a lot of work in this area. In Peter Wolf’s words
Some years ago we have developed a system for reporting the steps of a data analysis. The system is based on the ideas of literate programming.
Noweb
andLaTeX
are used to generate nice output. The result of the tangle path can be reloaded by our functionrevive()
into the S-Plus interpreter. Then you can select and extract the elements of the old analysis, you can modify them and you can activate the statements again. Therefore, our tool can be used for teaching, demonstrations, case studies, … We have constructed a lot of papers for our statistics courses in this way.
- Reproducible electronic documents from Matt Schwab and Jon Claerbout of Stanford University. This approach is based on the
make
utility readily available for Unix, Linux, and Windows. Final figures and calculations are easily regenerated by runningmake
, which senses file dependencies and creation/modification dates to re-run whatever needs to be re-run to build the final product. Quoting Schwab and Claerbout,
It takes some effort to organize your research to be reproducible. We found that although the effort seems to be directed to helping other people stand up on your shoulders, the principal beneficiary is generally the author herself. This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves.
- A Manifesto for Reproducible Science by MR Munafò et al
- Charles Geyer’s excellent page on literate programming and related areas
- Roger Peng’s examples of reproducible research
- Why should you avoid using point-and-click methods in statistical software packages by C Baum and S Sirin, Boston College
- Reproducibility in Econometrics Research by Roger Koenker. A document on that page describes many useful approaches, including a function
how.created
that makes it easy to attach to an object the following information: comments, user name, date, and the environment in effect (e.g., the search list) when the object was created. - University of Michigan ICPSR Guide to Social Sciences Data Preparation and Archiving - includes information on data entry, quality control, data management, codebooks and other documentation, and archiving
- Excellent
knitr
examples from The Statistical Sleuth - Baggerly and Broman course
Statistical Reports
The recommended software environment for RR is Quarto using the R knitr package and pandoc to produce html
reports, which is what produced this web page.
Templates
quarto
Quarto is a replacement for Rmarkdown
that has several advantages related to formatting and being multi-lingual. A useful Quarto
template for reports is here with html and pdf output.
rmdformats
See this for an R markdown html
report template using rmdformats
. Example output is here. The complete source script is here.
Comprehensive Example
This is a comprehensive report illustrating many features, including having the output and graphics automatically reformatted depending on whether rmsformats
html output is being rendered, or a pdf file is being created. This report also illustrates parallel computing, having complete control over when time-consuming calculations need to be re-run, and the use of the data.table
package.
See R Flow for an article with suggested R workflows including analysis file creation and data manipulation.