R Workflow

R Workflow for Reproducible Data Analysis and Reporting

Author
Affiliation

Department of Biostatistics
School of Medicine
Vanderbilt University

Published

September 16, 2023

flowchart LR
R[R Workflow] --> Rformat[Report formatting]
Rformat --> Quarto[Quarto setup<br><br>Using metadata in<br>report output<br><br>Table and graph formatting]
R --> DI[Data import] --> Annot[Annotate data<br><br>View data dictionary<br>to assist coding]
R --> Do[Data overview] --> F[Observation filtration<br>Missing data patterns<br>Data about data]
R --> P[Data processing] --> DP[Recode<br>Transform<br>Reshape<br>Merge<br>Aggregate<br>Manipulate]
R --> Des[Descriptive statistics<br>Univariate or simple<br>stratification]
R --> An[Analysis<br>Stay close to data] --> DA[Descriptive<br><br>Avoid tables by using<br>nonparametric smoothers] & FA[Formal]
R --> CP[Caching<br>Parallel computing<br>Simulation]

Preface

This work is intended to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting. It will enable the reader to efficiently produce attractive, readable, and reproducible research reports while keeping code concise and clear. Readers are also guided in choosing statistically efficient descriptive analyses that are consonant with the type of data being analyzed. The Statistical Thinking article R Workflow provides an overview of this book and includes some more motivation from the standpoint of doing good scientific research.

Anyone who claims to be able to do good data science without coding is misleading you. Coding is one of the most valuable skills for data preparation and analysis, and it leads to personal efficiency, reproducibility, and maintainability. Learning how to write concise, elegant, debug-able code that generalizes to handle more complex tasks is not an insurmountable goal for anyone dealing with data, and R Workflow is intended to assist you in this regard.

The methods in R Workflow will be helpful to anyone who analyzes data, whether they work in business, marketing, manufacturing, journalism, finance, science, observational research, experimental research, and virtually any field needing to understand data. The book is best suited for those having at least rudimentary experience in running R commands, but Chapter 3 points readers to excellent resources for learning R from scratch. R can also be learned by starting with some standard analysis templates such as this in this Github repository.

The work also showcases RStudio’s Quarto which is a new standard for making beautiful and reproducible reports with R and other languages. This book also captures what I’ve learned in using R (and its precursor S) heavily in biomedical research and clinical trials since 1991. See my Statistical Thinking blog fharrell.com and resources at hbiostat.org for more.

The term “workflow” connotes a rigid step-by-step process of data processing and reporting. In one’s day-to-day usage of R, myriad needs arise, and much creativity is needed to get the most insights from data while writing reliable code that generates reproducible results. R Workflow will equip R users/analysts with a variety of powerful and flexible tools that will assist them in attacking a huge variety of problems and producing elegant reports while reducing the amount of coding required.

A video covering many parts of the first 13 chapters may be found here.

The general statistical analysis/inference companion to this book is Biostatistics for Biomedical Research which is a reproducible book with numerous examples of R code. For and in-depth text and course notes on reproducible regression modeling with R, including extensive case studies, see RMS.

Resources for Learning Quarto

The author wishes to thank the R Core team and R package developers along with RStudio for the free software they have developed that has revolutionized statistical computing, reporting, and reproducible research. Thanks to Titus von der Malsburg for careful reading of the text and for reporting numerous typographical and grammatical errors and a few programming errors. Thanks to Norm Matloff, University of California Davis, who provided big ideas to improve the preface and motivation for the book.

Date Sections Changes
2023-09-16 18.2 More array-style simulation examples
2023-07-30 4.2 Mention tabsets, collapsible text, and tricks
2023-07-10 10.2 Added data.table::setcolorder
2023-07-19 3.9 Listed data.table set functions
2023-07-12 5.7 New section on protecting sensitive files
2023-05-11 3.12 Added info about learning by running scripts from Github
2023-05-06 11.3 New section on customizing summary statistic tables using gt
2023-04-28 14.2, 14.3 New section on graphical devices, added ggplot2 themes and fonts
2023-04-20 13.2.1 New subsection on adding summary statistics to a longitudinal dataset
2023-04-09 4.5.1 New subsection on gt package
2023-04-08 2.10, 9 Switched to new describe function output
2023-04-02 4.8 New section on mixing graphics and tables
2023-04-02 1.2 New small subsection with links to installing R and RStudio
2023-03-29 3 Several new language features covered
2023-03-29 13.2.2 New subsection on interpolating longitudinal data to a target time
2023-03-26 18.1 New subsection showing simulation using lapply and rbindlist
2023-03-26 14.3 Example of plotting in a for-loop, and math expressions in caption
2023-03-24 3.11 New section on interactive code writing
2023-03-16 5.2 Added description of new features of cleanupREDCap
2023-03-13 10 New example of data.table by-reference using a list of data tables
2023-03-05 14.3 Added ggplot2 ECDF example, with math rendering; added plotting of ECDFs with different transformations, and labeling with math notation
2023-02-28 10.3 Examples added for combine.levels
2023-02-26 5.2 New csv.get example, expanded Excel, added General tab which discusses the rio package
2023-02-25 10.6 New section on computing total scores with simple imputation
2023-02-24 10.5.1 New reshaping example
2023-02-23 5.2 Added description of new features in importREDCap
2023-02-18 Many Updated chapter to use Hmisc 5.0 and the pre-release of the new qreport package and dropping use of reptools and movStats from Github. Made use of new Hmisc easy labeling functions hlab, hlabs, vlab.
2023-02-08 2 Changed rendering of html for contents and describe in anticipation of Hmisc 4.8-0
2023-01-19 14.3 Added how to plot transformed axes
2023-01-16 14.3 Added simpler way to pull labels and units for plotting
2022-12-17 3.9.1 New subsection on character manipulation functions
2022-12-15 10.7 New subsection on text analysis
2022-12-11 4.7 New section on graphviz for diagrams
2022-12-04 3.3 New section on dates and date/times
2022-12-03 14, 14.1 Linked to hex binning example and added new section
2022-12-03 Replaced length(unique(x)) with uniqueN(x) everywhere
2022-11-29 12.2 New rolling join (closest match) example
2022-11-22 10.2 Added let alias for := in data.table
2022-11-09 5.2 Discussed multDataOverview function to summarize a list of datasets
2022-11-07 10.3.1 New section showing how to specify derived variable formulas in a separate file
2022-11-05 5.6 New section for qs package for object storage
2022-11-05 5.2 Much new material on REDCap
2022-11-05 10.2 Examples of in-place data.table changes of variables named in a separate vector
2022-11-01 12.1 New subsection with example on looking up participant disposition for multiple clinical trials
2022-10-24 12.2 New subsection on merging with closest matches
2022-10-22 3.9.3 New subsection on conditional function definitions
2022-10-22 3.4 New section on logical operators
2022-10-22 3.6 Added more subscripting examples
2022-10-17 10.8 Added direct retrieval fst example where row numbers are looked up
2022-10-16 5.5 Added fst package as alternative to saveRDS
2022-09-24 Preface Link to YouTube video
2022-09-21 14, 3.12 New links to Irizarry book
2022-09-09 3.12 New resources for learning R
2022-08-28 4.6 New section on styling html with css
2022-08-24 3.10 New section on stat model formula language
2022-08-19 14 Added how to use group= in ggplot2
2022-08-15 3.2 Added material about R object naming
2022-08-15 Preface Added links to resources for learning Quarto
2022-08-14 2 Introduce the packages used (thanks to Tom Philips)
2022-07-17 6, 8 Moved overall missing data summary to missChk
2022-07-11 14 New introductory text and references copied from BBR Chapter 4
2022-07-10 2 New chapter with a case study of methods used in the book (thanks to Norm Matloff)
2022-07-08 11.1 New section on using data.table with summarization functions that return two-dimensional results
2022-07-07 15, 15.7 New chart about 1st, 2nd, 3rd order analysis; new section with example of 3rd order
2022-07-06 13, 10 Re-wrote intro to chapter, added LOCF example, added data table examples using %like%
2022-07-05 10.2, 10 Renamed section and added more about removing columns; added link to data.table vignettes
2022-07-04 10.4 New section on operations on multiple data tables
2022-07-03 10 New diagram to explain data tables
2022-06-30 Preface Better wording (thanks to Norm Matloff)
2022-06-28 Added Flow diagrams at the start of chapters (thanks to Norm Matloff)
2022-06-27 4.5 New section on making html tables
2022-06-27 3.9, 3.2, 3.5, 4 Added more basic R functions, arrays NAs, how to make knitr use plain text printing of objects such as data frame/tables
2022-06-26 Preface Clarified goals and audience (thanks to Norm Matloff)
2022-06-26 Fixed various typographical errors (thanks to Titus von der Malsburg)
2022-06-15 Published