Reproducible Research

May 26, 2024

Introduction

Reproducible research (RR) is the practice of conducting and presenting research in such a way that others, and yourself can later re-implement your research strategy without ambiguity. In the context of statistical collaboration, this means that you or someone else can easily reproduce all of your actions relating to data management and data analysis, and reach the same result. Since the statistical collaborators work is mostly done using computer tools, reproducible research means documenting all of the tools and procedures (applications, data storage formats, programs/scripts) that were used.

For more detail and background information on redroducible research see this. For an entire reproducible R workflow see R Workflow.

The most significant barrier to adopting RR practices was the additional (perceived) cost in time. While the initial cost of RR may be greater than some non-RR practices, there is a strong argument that RR actually saves time over the course of a research project, because it streamlines actions that are often repeated. For example, data handling steps may be repeated many times because the data are updated, e.g., because errors are found, or new data become available. By implementing a RR framework, the statistical collaborator avoids having to remember and manually repeat the data management and analysis steps every time the data are updated. As another example, peer-reviewed manuscripts often require revision, and may be greatly simplified when RR practices are used.

When working with data from collaborators, reproducibility should be considered as early as possible. Suppose that data are received from a collaborator in Microsoft Excel format, but that it’s necessary to convert the data to CSV format. A copy of the original Excel database should be kept for posterity, and it should be documented (e.g. in a README file, or comment in an R script) that the original data were received in Excel format, but were converted to CSV format. How the conversion is accomplished (e.g. using the export feature of Microsoft Excel) should also be recorded.

There are many ways to implement, and many software tools to facilitate reproducible research. How RR is implemented is a matter of preference. But, note that some strategies may impose fewer barriers to RR (e.g. time, financial, interoperability) than others. Because of widespread adoption, ease of use, and zero financial cost, R is one of the primary tools for RR. RStudio has developed a free graphical environment for R that has additional features to facilitate RR.

Literate Programming and Reproducibility

Literate programming: Writing documentation containing computer code. Documentation (and perhaps a statistical report) and code are maintained in one file. An extractor program splits out the code to compile. HP Wolf and P Naeve have done a lot of work in this area. In Peter Wolf’s words

Some years ago we have developed a system for reporting the steps of a data analysis. The system is based on the ideas of literate programming. Noweb and LaTeX are used to generate nice output. The result of the tangle path can be reloaded by our function revive() into the S-Plus interpreter. Then you can select and extract the elements of the old analysis, you can modify them and you can activate the statements again. Therefore, our tool can be used for teaching, demonstrations, case studies, … We have constructed a lot of papers for our statistics courses in this way.

Reproducible electronic documents from Matt Schwab and Jon Claerbout of Stanford University. This approach is based on the make utility readily available for Unix, Linux, and Windows. Final figures and calculations are easily regenerated by running make, which senses file dependencies and creation/modification dates to re-run whatever needs to be re-run to build the final product. Quoting Schwab and Claerbout,

It takes some effort to organize your research to be reproducible. We found that although the effort seems to be directed to helping other people stand up on your shoulders, the principal beneficiary is generally the author herself. This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves.

A Manifesto for Reproducible Science by MR Munafò et al
Charles Geyer’s excellent page on literate programming and related areas
Roger Peng’s examples of reproducible research
Why should you avoid using point-and-click methods in statistical software packages by C Baum and S Sirin, Boston College
Reproducibility in Econometrics Research by Roger Koenker. A document on that page describes many useful approaches, including a function how.created that makes it easy to attach to an object the following information: comments, user name, date, and the environment in effect (e.g., the search list) when the object was created.
University of Michigan ICPSR Guide to Social Sciences Data Preparation and Archiving - includes information on data entry, quality control, data management, codebooks and other documentation, and archiving
Excellent knitr examples from The Statistical Sleuth
Baggerly and Broman course

Statistical Reports

The recommended software environment for RR is Quarto using the R knitr package and pandoc to produce html reports, which is what produced this web page.

Templates

`quarto`

Quarto is a replacement for Rmarkdown that has several advantages related to formatting and being multi-lingual. A useful Quarto template for reports is here with html and pdf output.

`rmdformats`

See this for an R markdown html report template using rmdformats. Example output is here. The complete source script is here.

Comprehensive Example

This is a comprehensive report illustrating many features, including having the output and graphics automatically reformatted depending on whether rmsformats html output is being rendered, or a pdf file is being created. This report also illustrates parallel computing, having complete control over when time-consuming calculations need to be re-run, and the use of the data.table package.

See R Flow for an article with suggested R workflows including analysis file creation and data manipulation.

--- title: "Reproducible Research" published-title: "" date: last-modified format: html: embed-resources: true anchor-sections: true code-tools: true code-fold: true fig-width: 6 fig-height: 4 code-block-bg: "#f1f3f5" code-block-border-left: "#31BAE9" mainfont: Source Sans Pro theme: journal toc: true toc-depth: 3 toc-location: left captions: true cap-location: margin table-captions: true tbl-cap-location: margin reference-location: margin --- ## Introduction Reproducible research (RR) is the practice of conducting and presenting research in such a way that others, and yourself can later re-implement your research strategy without ambiguity. In the context of statistical collaboration, this means that you or someone else can easily reproduce all of your actions relating to data management and data analysis, and reach the same result. Since the statistical collaborators work is mostly done using computer tools, reproducible research means documenting all of the tools and procedures (applications, data storage formats, programs/scripts) that were used. ::: {.column-margin} For more detail and background information on redroducible research see [this](../bbr/repro). For an entire reproducible R workflow see [R Workflow](../rflow). ::: The most significant barrier to adopting RR practices was the additional (perceived) cost in time. While the initial cost of RR may be greater than some non-RR practices, there is a strong argument that RR actually saves time over the course of a research project, because it streamlines actions that are often repeated. For example, data handling steps may be repeated many times because the data are updated, e.g., because errors are found, or new data become available. By implementing a RR framework, the statistical collaborator avoids having to remember and manually repeat the data management and analysis steps every time the data are updated. As another example, peer-reviewed manuscripts often require revision, and may be greatly simplified when RR practices are used. When working with data from collaborators, reproducibility should be considered as early as possible. Suppose that data are received from a collaborator in Microsoft Excel format, but that it's necessary to convert the data to CSV format. A copy of the original Excel database should be kept for posterity, and it should be documented (e.g. in a README file, or comment in an R script) that the original data were received in Excel format, but were converted to CSV format. How the conversion is accomplished (e.g. using the export feature of Microsoft Excel) should also be recorded. There are many ways to implement, and many software tools to facilitate reproducible research. How RR is implemented is a matter of preference. But, note that some strategies may impose fewer barriers to RR (e.g. time, financial, interoperability) than others. Because of widespread adoption, ease of use, and zero financial cost, [R](http://www.r-project.org) is one of the primary tools for RR. [RStudio](http://rstudio.com) has developed a free graphical environment for R that has additional features to facilitate RR. ## Literate Programming and Reproducibility * [Literate programming](http://literateprogramming.com): Writing documentation containing computer code. Documentation (and perhaps a statistical report) and code are maintained in one file. An extractor program splits out the code to compile. [HP Wolf](mailto:pwolf@wiwi.uni-bielefeld.de) and P Naeve have done a lot of work in this area. In Peter Wolf's words > Some years ago we have developed a system for reporting the steps of a data analysis. The system is based on the ideas of literate programming. `Noweb` and `LaTeX` are used to generate nice output. The result of the tangle path can be reloaded by our function `revive()` into the S-Plus interpreter. Then you can select and extract the elements of the old analysis, you can modify them and you can activate the statements again. Therefore, our tool can be used for teaching, demonstrations, case studies, ... We have constructed a lot of papers for our statistics courses in this way. * [Reproducible electronic documents](http://sepwww.stanford.edu/lib/exe/fetch.php?media=sep:research:reproducible:cip.pdf) from Matt Schwab and Jon Claerbout of Stanford University. This approach is based on the `make` utility readily available for Unix, Linux, and Windows. Final figures and calculations are easily regenerated by running `make`, which senses file dependencies and creation/modification dates to re-run whatever needs to be re-run to build the final product. Quoting Schwab and Claerbout, > It takes some effort to organize your research to be reproducible. We found that although the effort seems to be directed to helping other people stand up on your shoulders, the principal beneficiary is generally the author herself. This is because time turns each one of us into another person, and by making effort to communicate with strangers, we help ourselves to communicate with our future selves. * [A Manifesto for Reproducible Science](http://dx.doi.org/10.1038/s41562-016-0021) by MR Munafò et al * Charles Geyer's excellent page on [literate programming](http://www.stat.umn.edu/~charlie/Sweave) and related areas * Roger Peng's examples of [reproducible research](http://www.biostat.jhsph.edu/~rpeng/reproducible) * [Why should you avoid using point-and-click methods in statistical software packages](http://fmwww.bc.edu/GStat/docs/pointclick.html) by C Baum and S Sirin, Boston College * [Reproducibility in Econometrics Research](http://www.econ.uiuc.edu/~roger/repro.html) by Roger Koenker. A document on that page describes many useful approaches, including a function `how.created` that makes it easy to attach to an object the following information: comments, user name, date, and the environment in effect (e.g., the search list) when the object was created. * [University of Michigan ICPSR Guide to Social Sciences Data Preparation and Archiving](http://www.icpsr.umich.edu/files/ICPSR/access/dataprep.pdf) - includes information on data entry, quality control, data management, codebooks and other documentation, and archiving * Excellent `knitr` examples from [The Statistical Sleuth](http://www.math.smith.edu/~nhorton/sleuth) * Baggerly and Broman [course](https://github.com/SISBID/Module3) ## Statistical Reports The recommended software environment for RR is [Quarto](https://quarto.org) using the R [knitr](https://cran.r-project.org/web/packages/knitr) package and [pandoc](https://pandoc.org) to produce `html` reports, which is what produced this web page. <a name="template"></a> ### Templates #### `quarto` [Quarto](https://quarto.org) is a replacement for `Rmarkdown` that has several advantages related to formatting and being multi-lingual. A useful `Quarto` template for reports is [here](../R/reportTemplate.qmd) with [html](../R/reportTemplate.html) and [pdf](../R/reportTemplate.pdf) output. #### `rmdformats` See [this](../R/reportTemplate.Rmd) for an R markdown `html` report template using `rmdformats`. Example output is [here](https://hbiostat.org/R/hreport/testmult.html). The complete source script is [here](https://github.com/harrelfe/hreport/blob/master/inst/tests/testmult.Rmd). #### Comprehensive Example [This](https://hbiostat.org/R/Hmisc/markov) is a comprehensive report illustrating many features, including having the output and graphics automatically reformatted depending on whether `rmsformats` html output is being rendered, or a pdf file is being created. This report also illustrates parallel computing, having complete control over when time-consuming calculations need to be re-run, and the use of the `data.table` package. See [R Flow](https://fharrell.com/post/rflow) for an article with suggested R workflows including analysis file creation and data manipulation.