R Workflow
R Workflow for Reproducible Data Analysis and Reporting
Preface
This work is intended to foster best practices in reproducible data documentation and manipulation, statistical analysis, graphics, and reporting. It will enable the reader to efficiently produce attractive, readable, and reproducible research reports while keeping code concise and clear. Readers are also guided in choosing statistically efficient descriptive analyses that are consonant with the type of data being analyzed. The Statistical Thinking article R Workflow provides an overview of this book and includes some more motivation from the standpoint of doing good scientific research.
Anyone who claims to be able to do good data science without coding is misleading you. Coding is one of the most valuable skills for data preparation and analysis, and it leads to personal efficiency, reproducibility, and maintainability. Learning how to write concise, elegant, debug-able code that generalizes to handle more complex tasks is not an insurmountable goal for anyone dealing with data, and R Workflow
is intended to assist you in this regard.
The methods in R Workflow
will be helpful to anyone who analyzes data, whether they work in business, marketing, manufacturing, journalism, finance, science, observational research, experimental research, and virtually any field needing to understand data. The book is best suited for those having at least rudimentary experience in running R commands, but 3 R Basics points readers to excellent resources for learning R from scratch. R can also be learned by starting with some standard analysis templates such as this in this Github repository.
The work also showcases RStudio’s Quarto
which is a new standard for making beautiful and reproducible reports with R and other languages. This book also captures what I’ve learned in using R (and its precursor S) heavily in biomedical research and clinical trials since 1991. See my Statistical Thinking blog fharrell.com
and resources at hbiostat.org
for more.
The term “workflow” connotes a rigid step-by-step process of data processing and reporting. In one’s day-to-day usage of R, myriad needs arise, and much creativity is needed to get the most insights from data while writing reliable code that generates reproducible results. R Workflow
will equip R users/analysts with a variety of powerful and flexible tools that will assist them in attacking a huge variety of problems and producing elegant reports while reducing the amount of coding required.
A video covering many parts of the first 13 chapters may be found here.
The general statistical analysis/inference companion to this book is Biostatistics for Biomedical Research which is a reproducible book with numerous examples of R code. For and in-depth text and course notes on reproducible regression modeling with R, including extensive case studies, see RMS.
Resources for Learning Quarto
- Welcome to Quarto 2-hour workshop by Tom Mock
- Awesome Quarto list by Mickaël Canouil
The author wishes to thank the R Core team and R package developers along with RStudio for the free software they have developed that has revolutionized statistical computing, reporting, and reproducible research. Thanks to Titus von der Malsburg for careful reading of the text and for reporting numerous typographical and grammatical errors and a few programming errors. Thanks to Norm Matloff, University of California Davis, who provided big ideas to improve the preface and motivation for the book.
Date | Sections | Changes |
---|---|---|
2024-08-13 | ?sec-manip-gap | New section for computing interval gaps |
2024-08-11 | 10.1.1 Special Symbols in data.table Expressions | New subsection defining special data.table variables |
2024-07-27 | 10.2 Analyzing Selected Variables and Subsets | Added two ways to run describe over subsets, getting around a knitr bug in rendering the original approach |
2024-07-15 | 14.3 ggplot2 | Added ggiraph for tooltips on ggplot2 plots |
2024-07-06 | 20 Other Resources and Computing Environment | Changed environment pretty-printing to use the grateful package |
2024-05-26 | 4 Report Formatting | Added link to updated recommended general report template |
2024-05-04 | 4.1.1 Annotating Simple Output | New subsection for printing calculations in context when code is folded |
2024-04-30 | 10.7.3 Turning a Frequency Table Into Raw Data | New subsection on expanding a frequency table into raw data rows |
2024-04-21 | 4.5 Multi-Output Format Reports | New collapsed tab with considerations for collaborating with a Word user |
2024-04-07 | 4.8 Advanced Tables That Render to Both HTML and Word | New section of advanced tables that work with Word |
2024-04-04 | 4.7 CSS | More examples of colorizing text |
2024-04-03 | 10.7.1 Directly Creating a Melted Data Table | New data.table example: creating melted aggregate statistics |
2024-02-24 | 9.1.1 Descriptive Graphics for Continuous Variables | New examples using ggplot2 for spike histograms and ECDFs |
2024-02-18 | 18.1 Data Table Approach | New way to use data.table for simulations |
2024-02-11 | 10.4.1 Adding Variables Depending on Other Variables Being Added | New subsection showing how to add multiple new variables to a data table when the new variables depend on each other |
2024-02-10 | 9.4 Multiple Longitudinal Continuous Variables | New subsection on exploratory analysis of multiple longitudinal variables |
2024-01-07 | 5.7.1 Depositing Files on REDCap | New subsection with R code for automatic file deposit into REDCap file repository |
2024-01-07 | 5.2 Importing and Creating Annotated Analysis Files | Re-wrote REDCap API section for latest REDCap R API |
2024-01-02 | 5.7 Secure File Storage and Transmission | Showed how to mix interacting and batch processing so passwords will work |
2023-10-28 | 10.3 Adding Aggregate Statistics to Raw Data | New subsection showing how to add aggregate summaries to raw data |
2023-10-23 | 10 Data Manipulation and Aggregation | Added examples of data table containing lists |
2023-10-20 | 13.6 Linear Interpolation to a Vector of Times | New longitudinal example on linear interpolation/extrapolation on regularized measurement times |
2023-09-16 | 18.2 Array Approach | More array-style simulation examples |
2023-07-30 | 4.3 Quarto Built-in Syntax for Enhancing R Output | Mention tabsets, collapsible text, and tricks |
2023-07-10 | 10.4 Adding, Changing, and Removing Variables | Added data.table::setcolorder |
2023-07-19 | 3.9 Functions | Listed data.table set functions |
2023-07-12 | 5.7 Secure File Storage and Transmission | New section on protecting sensitive files |
2023-05-11 | 3.12 Resources for Learning R | Added info about learning by running scripts from Github |
2023-05-06 | 11.3 Customizing Tables of Summary Statistics | New section on customizing summary statistic tables using gt |
2023-04-28 | 14.2 R Graphics Devices, 14.3 ggplot2 | New section on graphical devices, added ggplot2 themes and fonts |
2023-04-20 | 13.4 Summarizing Multiple Baseline Measurements | New subsection on adding summary statistics to a longitudinal dataset |
2023-04-09 | 4.6.1 gt Package | New subsection on gt package |
2023-04-08 | 2.10 Univariate and Bivariate Descriptions, 9 Descriptive Statistics | Switched to new describe function output |
2023-04-02 | 4.10 Mixing Graphics and Tables | New section on mixing graphics and tables |
2023-04-02 | 1.2 Installing R and RStudio | New small subsection with links to installing R and RStudio |
2023-03-29 | 3 R Basics | Several new language features covered |
2023-03-29 | 13.5 Interpolation/Extrapolation to a Specific Time | New subsection on interpolating longitudinal data to a target time |
2023-03-26 | 18.1 Data Table Approach | New subsection showing simulation using lapply and rbindlist |
2023-03-26 | 14.3 ggplot2 | Example of plotting in a for-loop, and math expressions in caption |
2023-03-24 | 3.11 Interactively Writing and Debugging R Code | New section on interactive code writing |
2023-03-16 | 5.2 Importing and Creating Annotated Analysis Files | Added description of new features of cleanupREDCap |
2023-03-13 | 10 Data Manipulation and Aggregation | New example of data.table by-reference using a list of data tables |
2023-03-05 | 14.3 ggplot2 | Added ggplot2 ECDF example, with math rendering; added plotting of ECDFs with different transformations, and labeling with math notation |
2023-02-28 | 10.5 Recoding Variables | Examples added for combine.levels |
2023-02-26 | 5.2 Importing and Creating Annotated Analysis Files | New csv.get example, expanded Excel, added General tab which discusses the rio package |
2023-02-25 | 10.8 Computing Total Scale Scores in Presence of NAs | New section on computing total scores with simple imputation |
2023-02-24 | 10.7.2 Restructuring Multiple Independent Variables | New reshaping example |
2023-02-23 | 5.2 Importing and Creating Annotated Analysis Files | Added description of new features in importREDCap |
2023-02-18 | Many | Updated chapter to use Hmisc 5.0 and the pre-release of the new qreport package and dropping use of reptools and movStats from Github. Made use of new Hmisc easy labeling functions hlab , hlabs , vlab . |
2023-02-08 | 2 Case Study: The Titanic | Changed rendering of html for contents and describe in anticipation of Hmisc 4.8-0 |
2023-01-19 | 14.3 ggplot2 | Added how to plot transformed axes |
2023-01-16 | 14.3 ggplot2 | Added simpler way to pull labels and units for plotting |
2022-12-17 | 3.9.1 Character Manipulation Functions | New subsection on character manipulation functions |
2022-12-15 | 10.9 Text Analysis | New subsection on text analysis |
2022-12-11 | 4.9 Diagrams | New section on graphviz for diagrams |
2022-12-04 | 3.3 Dates and Time | New section on dates and date/times |
2022-12-03 | 14 Graphics, 14.1 Recommended Graphics by Data Types | Linked to hex binning example and added new section |
2022-12-03 | Replaced length(unique(x)) with uniqueN(x) everywhere |
|
2022-11-29 | 12.2 Non-equi Joins: Closest Matches | New rolling join (closest match) example |
2022-11-22 | 10.4 Adding, Changing, and Removing Variables | Added let alias for := in data.table |
2022-11-09 | 5.2 Importing and Creating Annotated Analysis Files | Discussed multDataOverview function to summarize a list of datasets |
2022-11-07 | 10.5.1 Recoding From an Expression File | New section showing how to specify derived variable formulas in a separate file |
2022-11-05 | 5.6 Efficient Storage and Retrieval With qs | New section for qs package for object storage |
2022-11-05 | 5.2 Importing and Creating Annotated Analysis Files | Much new material on REDCap |
2022-11-05 | 10.4 Adding, Changing, and Removing Variables | Examples of in-place data.table changes of variables named in a separate vector |
2022-11-01 | 12.1 Lookup Participant Disposition | New subsection with example on looking up participant disposition for multiple clinical trials |
2022-10-24 | 12.2 Non-equi Joins: Closest Matches | New subsection on merging with closest matches |
2022-10-22 | 3.9.3 Conditional Function Definition Trick | New subsection on conditional function definitions |
2022-10-22 | 3.4 Logical Operators | New section on logical operators |
2022-10-22 | 3.6 Subscripting | Added more subscripting examples |
2022-10-17 | 10.10 Fast Lookup from Disk | Added direct retrieval fst example where row numbers are looked up |
2022-10-16 | 5.5 Efficient Storage and Retrieval With fst | Added fst package as alternative to saveRDS |
2022-09-24 | Preface | Link to YouTube video |
2022-09-21 | 14 Graphics, 3.12 Resources for Learning R | New links to Irizarry book |
2022-09-09 | 3.12 Resources for Learning R | New resources for learning R |
2022-08-28 | 4.7 CSS | New section on styling html with css |
2022-08-24 | 3.10 R Formula Language | New section on stat model formula language |
2022-08-19 | 14 Graphics | Added how to use group= in ggplot2 |
2022-08-15 | 3.2 Object Types | Added material about R object naming |
2022-08-15 | Preface | Added links to resources for learning Quarto |
2022-08-14 | 2 Case Study: The Titanic | Introduce the packages used (thanks to Tom Philips) |
2022-07-17 | 6 Missing Data, 8 Data Overview | Moved overall missing data summary to missChk |
2022-07-11 | 14 Graphics | New introductory text and references copied from BBR Chapter 4 |
2022-07-10 | 2 Case Study: The Titanic | New chapter with a case study of methods used in the book (thanks to Norm Matloff) |
2022-07-08 | 11.1 Summary Statistics Using Functions Returning Two-Dimensional Results | New section on using data.table with summarization functions that return two-dimensional results |
2022-07-07 | 15 Analysis, 15.7 Third-Order Descriptive Analysis | New chart about 1st, 2nd, 3rd order analysis; new section with example of 3rd order |
2022-07-06 | 13 Manipulation of Longitudinal Data, 10 Data Manipulation and Aggregation | Re-wrote intro to chapter, added LOCF example, added data table examples using %like% |
2022-07-05 | 10.4 Adding, Changing, and Removing Variables, 10 Data Manipulation and Aggregation | Renamed section and added more about removing columns; added link to data.table vignettes |
2022-07-04 | 10.6 Operations on Multiple Data Tables | New section on operations on multiple data tables |
2022-07-03 | 10 Data Manipulation and Aggregation | New diagram to explain data tables |
2022-06-30 | Preface | Better wording (thanks to Norm Matloff) |
2022-06-28 | Added Flow diagrams at the start of chapters (thanks to Norm Matloff) | |
2022-06-27 | 4.6 HTML Tables | New section on making html tables |
2022-06-27 | 3.9 Functions, 3.2 Object Types, 3.5 Missing Values, 4 Report Formatting | Added more basic R functions, arrays NA s, how to make knitr use plain text printing of objects such as data frame/tables |
2022-06-26 | Preface | Clarified goals and audience (thanks to Norm Matloff) |
2022-06-26 | Fixed various typographical errors (thanks to Titus von der Malsburg) | |
2022-06-15 | Published |