1 Introduction
This book describes workflow that I’ve found to be efficient in making reproducible research reports and books using R with Rmarkdown and now Quarto in data analysis projects. I start with a fairly complete case study of survival patterns of passengers on the Titanic that exemplifies many of the methods presented in the book. This is followed by chapters covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs.
1.1 R Code Repositories Used in This Book
This report makes heavy use of the following R packages and Github repository:
Hmiscpackage which contains functions for importing data, data annotation, summary statistics, statistical graphics, advanced table making, etc. Some newHmiscfunctions are used, especiallyaddggLayersfor adding extended box plots and spike histograms toggplot2plots, especially when run on the output ofmeltDatameltDatamelt a data table according to a formula, with optional substitution of variable labels for variable namesseqFreqfor creating a factor variable with categories in descending order of sequential frequencies of conditions (as used in computing study exclusion counts)hashCheckfor checking if parent objects have changed so a slow analysis has to be re-run (i.e., talking control of caching)runifChangedwhich useshashCheckto automatically re-run an analysis if needed, otherwise to retrieve previous results efficientlymovStatsfor computing summary statistics by moving overlapping windows of a continuous variable, or simply stratified by a categorical variable
qreportpackage, a new R package available on CRAN for facilitating composition ofQuartoreports, books, and web sites. Some of theqreportfunctions used here areaddCap,printCapfor adding captions to a list of figures and for printing the listdataChkfor data checkingdataOverviewdataset overviewhtmlListto easily print vectors in a named list usingkablehtmlView,htmlViewxfor viewing data dictionaries/metadata in browser windowskablto make it easy to usekableandkablesfor making html tablesmaketabsto automatically make multiple tabs inQuartoreports, each tab holding the output of one or more R commandmakecolmargto print an object in the right margin inQuartoreportsmakecnoteto print an object in a collapsibleQuartonotemakecallouta generic Quarto callout maker called bymakecolmarg,makecnotemakecodechunkmakemermaidmake Quartomermaiddiagrams with insertion of variable valuesmakegraphvizdoes likewise forgraphvizdiagramsscplotfor putting graphs in separate chunks with captions in TOCvClusfor variable clusteringaePlotfor making an interactiveplotlydot chart of adverse event proportions by treatment
data.tablepackage for data storage, retrieval, manipulation, munging, aggregation, merging, and reshapinghavenpackage for importing datasets from statistical packagesriopackage for one-stop importing of a wide variety of file typesggplot2package for static graphicsgtpackage for a comprehensive and flexible approach to making tablesconsortpackage for consort diagrams showing observation filteringplotlypackage for interactive graphicsrmspackage for statistical modeling, validation, and presentationknitrpackage for running reproducible reports, and also providingkableandkablesfunctions for simple html table printinggridandgridExtrapackages for converting tables to graphs (Section 4.10)
1.2 Installing R and RStudio
- Watch this video from Negoita’s R for Ecology Course
- Installing R and
RStudio - Installing R on Your Machine