UNC CSCC 2017-11-16

Reproducible Statistical Reports

Purpose of Reproducible Analysis/Reports

  • The code is the ultimate documentation of how data analysis was done
  • Need to be able to regenerate an entire analysis and report with a single command
  • Allows others to reproduce your work
  • Allows you to easily re-run analyses upon data corrections/updates or changes in statistical analysis
  • Team work and personnel changes
  • Journals starting to require code

Problems with Current Model

  • R + knitr + LaTeX + pdflatex + Acrobat Reader
  • Exquisite control of formatting; beautiful printing
  • Only Adobe Acrobat Reader supports javascript in pdf files, for pop-ups etc.
  • Acrobat Reader is poorly supported and bloated
  • R greport function for clinical trial reports
    • Pop-ups to show detail
    • Minor update to Acrobat Reader on Macs disabled pop-ups

Problems, continued

  • Copying and pasting advanced tables from pdf into Word doesn’t work well
  • Graphics are static, without drill-down
  • Code present/absent
  • Requires extensive LaTeX styling

New HTML Model

  • RStudio html documents and html notebooks
    • Notebooks allow interactive graphics
  • HTML5, self-contained javascript
  • Viewable in any browser
  • R functions write HTML
    • Regular tabular output, hyperlinks, navigation bars, etc.
    • Advanced tables
      • htmlTable package and Hmisc summaryM

New HTML Model, continued

  • R programming key: abstract markup, store translations in a central place
    • plain text, HTML, LaTeX
    • Go through the pain of figuring out markup for χ 2 7 once
    • R Hmisc package markupSpecs list: large number of translations and helper functions
    • Special LaTeX/HTML translation tables for functions
    • Fine tuning: edit one file, markup used by many functions

New HTML Model: Drawbacks

  • HTML file can contain real data, not just relative coordinates of points
  • Self-contained HTML files can be large
  • No concept of pagination and other special control for pretty printing
  • But: Nice format on any device (dynamic resizing)

Interactive Graphics

Full Interactivity

  • Requires statistical software to be run, i.e., report not self-contained and useable offline
  • E.g. change the bandwidth and re-run a nonparametric smoother for trend; selection of variables to include in a model

Partial interactivity

  • Zoom, pan
  • Rescale axes
  • Extra information pop-up (hover text)
  • Select which traces to show
  • Instead of having legends and explanations (e.g., for box plots) show extra information as hover text

R Software: plotly Package

  • Implementation of javascript D3 graphics model plotly
  • Best developed partially interactive scientific graphics for R
  • Has it’s own model, or:
  • ggplotly function: pass any ggplot2 graphics object through it to get interactivity

Examples

R / RStudio

Abstract

Using R, Rmarkdown, RStudio, knitr, plotly, and HTML for the Next Generation of Reproducible Statistical Reports

Frank E Harrell Jr
Professor
Department of Biostatistics
Vanderbilt University School of Medicine

The Department of Biostatistics has two policies currently in effect:

  1. All statistical reports will be reproducible
  2. All reports should include all the code used to produce the report, in some fashion

We have succeeded with 1. (mainly using knitr in R) and to a large extent with 2. Some biostatisticians have been concerned about interspersing code with the contents of the report. It has also been challenging to copy some PDF report components (e.g., advanced tables) into word processing documents.

Fortunately R and RStudio have recently added a number of new features that allow for easy creation of HTML notebooks that are viewed with any web browser. This solves the problems listed above and adds new possibilities such as interactive graphics that appear in a self-contained HTML file to post on a collaboration web server or send to a collaborator. Interactive graphics allow the analyst to create more detail (e.g., confidence bands for multiple confidence levels; confidence bands for group differences as well as those for each group individually) with the collaborator able to easily select which details to view.

I have made major revisions in the R Hmisc and rms packages to provide new capabilities that fit into the R/RStudio Rmarkdown HTML notebook framework. Interactive plotly graphics (based on Javascript and D3) and customized HTML output are the main new ingredients. In this talk the rationale for this approach is discussed, and the new features are demonstrated with two statistical reports. A few miscellaneous topics will also be covered, e.g. how to cite bibliographic references in Rmarkdown and how to interface R to citeulike.org for viewing or extracting bibliographic references.

For more information see