1 Introduction
This book describes workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown
and now Quarto
in data analysis projects. I start with a fairly complete case study of survival patterns of passengers on the Titanic that exemplifies many of the methods presented in the book. This is followed by chapters covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc
package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table
package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs.
1.1 R Code Repositories Used in This Book
This report makes heavy use of the following R packages and Github repository:
Hmisc
package which contains functions for importing data, data annotation, summary statistics, statistical graphics, advanced table making, etc. Some newHmisc
functions are used, especiallyaddggLayers
for adding extended box plots and spike histograms toggplot2
plots, especially when run on the output ofmeltData
meltData
melt a data table according to a formula, with optional substitution of variable labels for variable namesseqFreq
for creating a factor variable with categories in descending order of sequential frequencies of conditions (as used in computing study exclusion counts)hashCheck
for checking if parent objects have changed so a slow analysis has to be re-run (i.e., talking control of caching)runifChanged
which useshashCheck
to automatically re-run an analysis if needed, otherwise to retrieve previous results efficientlymovStats
for computing summary statistics by moving overlapping windows of a continuous variable, or simply stratified by a categorical variable
qreport
package, a new R package available on CRAN for facilitating composition ofQuarto
reports, books, and web sites. Some of theqreport
functions used here areaddCap
,printCap
for adding captions to a list of figures and for printing the listdataChk
for data checkingdataOverview
dataset overviewhtmlList
to easily print vectors in a named list usingkable
htmlView
,htmlViewx
for viewing data dictionaries/metadata in browser windowskabl
to make it easy to usekable
andkables
for making html tablesmaketabs
to automatically make multiple tabs inQuarto
reports, each tab holding the output of one or more R commandmakecolmarg
to print an object in the right margin inQuarto
reportsmakecnote
to print an object in a collapsibleQuarto
notemakecallout
a generic Quarto callout maker called bymakecolmarg
,makecnote
makecodechunk
makemermaid
make Quartomermaid
diagrams with insertion of variable valuesmakegraphviz
does likewise forgraphviz
diagramsscplot
for putting graphs in separate chunks with captions in TOCvClus
for variable clusteringaePlot
for making an interactiveplotly
dot chart of adverse event proportions by treatment
data.table
package for data storage, retrieval, manipulation, munging, aggregation, merging, and reshapinghaven
package for importing datasets from statistical packagesrio
package for one-stop importing of a wide variety of file typesggplot2
package for static graphicsgt
package for a comprehensive and flexible approach to making tablesconsort
package for consort diagrams showing observation filteringplotly
package for interactive graphicsrms
package for statistical modeling, validation, and presentationknitr
package for running reproducible reports, and also providingkable
andkables
functions for simple html table printinggrid
andgridExtra
packages for converting tables to graphs (Section 4.8)
1.2 Installing R and RStudio
- Watch this video from Negoita’s R for Ecology Course
- Installing R and
RStudio
- Installing R on Your Machine