1 R

1.1 Background

Computer code shown throughout these notes is R.1 R is free and is the most widely used statistical software in the world. It has the best graphics, statistical modeling, nonparametric methods, survival analysis, clinical trials methods, and data manipulation capabilities. R has the most comprehensive genomics analysis packages and has advanced capabilities for reproducible analysis and reporting. R also has an excellent graphical front-end RStudio (rstudio.org) that has the identical look and feel on all operating systems and via a web browser. Part of R’s appeal is the thousands of add-on packages available (at cran.r-project.org/web/packages), which exist because it is easy to add to R. Many of the add-on packages are specialty packages for biomedical research including packages for such widely diverse areas as

  • interfacing R to REDCap (2 packages)
  • interactive design of adaptive clinical trials
  • analyzing accelerometer data
  • flow cytometry
  • genomics
  • analyzing ICD9 codes and computing comorbidity indexes
  • downloading all annotated NHANES datasets
  • interfacing to clinicaltrials.gov
  • detecting whether a dataset contains personal identifiers of human subjects
  • analysis of early phase cardiovascular drug safety studies The main R web site is www.r-project.org.

1.2 Learning R

Start with R Tutorials at www.r-bloggers.com/how-to-learn-r-2, R Programming Tutorials from Mike Marin at www.youtube.com/user/marinstatlectures, or Fast Lane to Learning R at github.com//matloff/fasteR by Norm Matloff. Or look at swirlstats.com for an interactive way to learn R. Those who have used SPSS or SAS before will profit from R for SAS and SPSS Users by Robert Muenchen. A current list of R books on amazon.com may be found at amzn.to/15URiF6. stats.idre.ucla.edu/r/ and www.introductoryr.co.uk/R_Resources_for_Beginners.html are useful web sites. An excellent resource is R for Data Science by Grolemund and Wickham. See also R in Action, second ed. by Robert Kabacoff. The online open source book on statistical modeling by Legler and Roback at bookdown.org/roback/bookdown-bysh contains a lot of R code. Jenny Bryan’s STAT 545 Data Wrangling, Exploration, and Analysis with R course at stat545.com is an excellent resource. stackoverflow.com/tags/r is the best place for asking questions about the language and for learning from answers to past questions asked (see also the R-help email list).

Three of the best ways to learn how to analyze data in R quickly are

  1. Avoid importing and manipulating data, instead using the R readRDS function to load datasets that are already annotated and analysis-ready (see here for information about importing your own datasets)
  2. Use example R scripts as analysis templates
  3. Use RStudio (rstudio.org) to run R On the first approach, the R Hmisc packages’s getHdata function finds datasets on the Vanderbilt Biostatistics DataSets wiki, downloads them, and load()s them in your R session. These notes use only datasets available via this mechanism. These datasets are fully annotated with variable labels and units of measurements for many of the continuous variables. Concerning analysis scripts, Vanderbilt Biostatistics has collected template analysis scripts on github.com/harrelfe/rscripts and the R Hmisc package has a function getRs to download these scripts and to automatically populate an RStudio script editor window with the script. Many of the scripts are in RMarkdown format for use with the R knitr package to allow mixing of text and R code to make reproducible reports. knitr is described (TBD).

The RMarkdown scripts accessed through getRs use a template that makes the result part of a reproducible research process by documenting the versions of R and attached packages at the end of the report. Some of the scripts make use of the knitrSet function in the Hmisc package. When running Rmarkdown, call knitrSet(lang='markdown'). knitrSet gets rid of ## at the start of R output lines, and makes it easy to specify things like figure sizes in knitr chunk headers. It also causes annoying messages such as those generated from attaching R packages to be put in a separate file messages.txt rather than in the report.

1.3 Setting up R

Before running examples in these notes and R markdown example scripts, you need to do the following:

  1. Make sure your operating system is up to date enough to run the most current version of R at www.r-project.org. For Mac you must have OS X Maverick or later.
  2. Install R from www.r-project.org or upgrade your installation of R to the latest version.
  3. Install RStudio from rstudio.org or update your RStudio to the latest version.
  4. Run RStudio and get it to install the packages that allow Rmarkdown to run, by clicking on . Make sure that the knitr package is installed.
  5. Have RStudio install the Hmisc and rms packages (which will make RStudio install several other packages). For packages you had installed previously, make sure you update them to have the latest versions.
  6. Configure RStudio Tools ... Global Options to match the images below

Here are some examples of how getRs is used once you load the Hmisc package using a menu or by typing require(Hmisc) or library(Hmisc) in the console.

require(Hmisc)      # do this once per session (or library(Hmisc))
options(url.method='libcurl')    # sometimes needed if using Windows
getRs()             # list available scripts
getRs(browse='browser')  # open scripts contents in your web browser
scripts <- getRs()  # store directory of scripts in an object that can easily
                    # be viewed on demand in RStudio (right upper pane)
getRs('introda.r')  # download introda.r and open in script editor
getRs(cats=TRUE)    # list available major and minor categories
categories <- getRs(cats=TRUE)  # store results in a list for later viewing
getRs(cats='reg')   # list all scripts in a major category containing 'reg'
getRs('importREDCap.r', put='source')   # source() to define a function

You can also point your browser to github.com/harrelfe/rscripts/blob/master/contents.md to see the available scripts and categories, and to be able to click on links to see html report output.

To get started using R in RStudio to create reproducible annotated reports, finish the above configuration instructions and type the following in the RStudio console: getRs('descriptives.Rmd'). The above the script editor window click on Knit HTML.

1.4 Using R Markdown

See kbroman.org/knitr_knutshell/pages/Rmarkdown.html and print the R Markdown cheat sheet from www.rstudio.com/resources/cheatsheets.

To make the code listing pretty, put this chunk at the top of your report. echo=FALSE suppresses this setup chunk from printing in the report.

```{r setup,echo=FALSE}
knitrSet('myreport', lang='markdown')

The argument 'myreport' is replaced with a string to use as a prefix to all the graphics file names, making each report in your working directory use graphics file names that do not collide with each other. For example if your report is called acidity_analysis.Rmd you might specify knitrSet('acidity_analysis.Rmd', lang='markdown'). There are many other options to knitrSet. A commonly used one is width=n to specify a line width for printing code of n letters. The default is 61. You can also specify echo,results, and other options. Type ?knitrSet for help.

The R knitr package is used to run the markdown report and insert graphics and text output into the report at appropriate slots. It is best to specify a name for each chunk, and you must use unique names. Each R code chunk must begin exactly with ```{r …} and the chunk name is the first set of characters that appear after the space after r. Here are some example chunk headers. Chunk names must not contain a space.

```{r descriptives} ```{r anova} ```{r anova-y1} ```{r anova_y1} ```{r acidity_plot} ```{r plot_residuals,top=1} ```{r plot_residuals,mfrow=c(2,2),left=1,top=1,rt=1,bot=1} ```{r plot-residuals,w=5,h=4}

Chunk options that were used above are:

Options Description
top=1 Leave an extra line of space at top of graph for title
mfrow=c(2,2) Use base graphics and put the next 4 plots into a single figure with 2 rows, 2 columns
left=1,rt=1,bot=1 Leave one extra line for margin for left, right, bottom of figure
w=5,h=4 Make the figure larger than the default that knitrSet uses (4 inch width by 3 inch height)

Always having a chunk name also allows easy navigation of chunks by clicking to the right of the green C at the bottom of your script. This will show the names of all chunks and you can click on one to go there.

1.5 Debugging R Code

When using RStudio and knitr as with RMarkdown, it is best to debug your commands a piece at a time. The fastest way to do this is to go to some line inside your first chunk and click the green C just above and to the right of your script. Click on then on Run Next Chunk. Shortcut keys for these are Ctrl+Alt+C and Ctrl+Alt+N (Command+Option+C and Command+Option+N for Mac). You can also click on a single line of code and run it by clicking on Run.

Whenever you get a strange execution error it is sometimes helpful to show the history of all the function calls leading to that error. This is done by typing traceback() at the command prompt.

1.6 Importing Other Datasets

Most of the work of getting some data sources ready for analysis involves reshaping datasets from wide to tall and thin, recoding variables, and merging multiple datasets. R has first-class capabilities for all of these tasks but this part of R is harder to learn, partly because there are so many ways to accomplish these tasks in R. Getting good variable names, variable labels, and value labels, can also be tedious but is highly worth the time investment.

1.6.1 Stata and SPSS

If you have Stata or SPSS files that are already shaped correctly and have variable labels and value labels, the R Hmisc package’s stata.get and spss.get functions will produce fully annotated ready-to-analyze R data frames.

1.6.2 REDCap

REDCap exports data to R, and Biostatistics has an R function to make the import process much easier. Here is an example showing how to fetch and use the function. In this example, the user did not provide the name of the file to import but rather let the function find the last created REDCap export files in the current working directory.

getRs('importREDCap.r', put='source')  # source() code to define function
mydata <- importREDCap()  # by default operates on last downloaded export
saveRDS(mydata, 'mydata.rds', compress='xz')

Advanced users can hook into REDCap dynamically with R to avoid the need to export/import.

1.6.3 Spreadsheets

If you have a properly formatted csv file (e.g., exported from a spreadsheet), the Hmisc csv.get function will read it, facilitate handling of date variables, convert column names to legal R names, and save the original column names as variable labels.

Here is an example of importing a csv file into R. First of all make sure your spreadsheet is a “spreadsheet from heaven” and not a “spreadsheet from hell” by reading biostat.app.vumc.org/DataTransmissionProcedures. Then use your spreadsheet software to export a single worksheet to create a csv file. Small csv files may be pasted into your R script as is done in the following example, but in most cases you will call csv.get with an external file name as the first argument.

# What is between data <- .. and ') is exactly like an external .csv file
data <- textConnection('
Age in Years,sex,race,visit date,m/s
d <- csv.get(data, lowernames=TRUE, datevars='visit.date',
# lowernames=TRUE: convert variable names to lower case
# Omit dateformat if dates are in YYYY-MM-DD format

Data frame:d    3 observations and 5 variables    Maximum # NAs:1

                   Labels     Class   Storage NAs
age.in.years Age in Years   integer   integer   1
sex                   sex character character   0
race                 race character character   0
visit.date     visit date      Date    double   0
m.s                   m/s   numeric    double   0
  age.in.years sex race visit.date m.s
1           23   m    w 2014-10-21 1.1
2           14   f    b 2014-10-22 1.3
3           NA   f    w 2014-10-15 1.7

In the contents output above you can see that the original column names have been placed in the variable labels, and the new names have periods in place of blanks or a slash, since these characters are illegal in R names.

You can have as the first argument to csv.get not only a file name but a URL to a file on the web. You can also specify delimiters other than commas.

Also see the excellent tutorial on importing from Excel found at www.r-bloggers.com/r-tutorial-on-reading-and-importing-excel-files-into-r.

The Hmisc upData function may be used to rename variables and provide variable and value labels and units of measurement. Here is another example where there is a junk variable to delete after importing, and a categorical variable is coded as integers and need to have value labels defined after importing. We show how csv.get automatically renamed one illegal (to R) variable name, how to redefine a variable label, and how to define the value labels. Suppose that file test.csv exists in our project directory and has the following contents.
age,sys bp,sex,junk,state

Now import and modify the file.

d <- csv.get('test.csv')
names(d)   # show names after modification by csv.get
[1] "age"    "sys.bp" "sex"    "junk"   "state" 
contents(d)  # show labels created by csv.get

Data frame:d    4 observations and 5 variables    Maximum # NAs:0

       Labels     Class   Storage
age       age   integer   integer
sys.bp sys bp   integer   integer
sex       sex character character
junk     junk   integer   integer
state   state   integer   integer
d <- upData(d,
            state=factor(state, 1:2, c('Alabama','Alaska')),
            labels=c(age = 'Age',
                     sbp = 'Systolic Blood Pressure'),
            drop='junk',   # for > 1: drop=c('junk1','junk2',...)
Input object size:   4040 bytes;     5 variables     4 observations
Renamed variable     sys.bp     to sbp 
Modified variable   state
Dropped variable    junk
New object size:    3608 bytes; 4 variables 4 observations

Data frame:d    4 observations and 4 variables    Maximum # NAs:0

                       Labels Units Levels     Class   Storage
age                       Age                integer   integer
sbp   Systolic Blood Pressure  mmHg          integer   integer
sex                       sex              character character
state                                    2             integer

|Variable|Levels        |
|  state |Alabama,Alaska|

 4  Variables      4  Observations
age : Age 
       n  missing distinct     Info     Mean      Gmd 
       4        0        4        1    36.75    11.83 
Value        23   37   42   45
Frequency     1    1    1    1
Proportion 0.25 0.25 0.25 0.25
sbp : Systolic Blood Pressure [mmHg] 
       n  missing distinct     Info     Mean      Gmd 
       4        0        4        1    134.8      8.5 
Value       127  131  140  141
Frequency     1    1    1    1
Proportion 0.25 0.25 0.25 0.25
       n  missing distinct 
       4        0        2 
Value      female   male
Frequency       2      2
Proportion    0.5    0.5
       n  missing distinct 
       4        0        2 
Value      Alabama  Alaska
Frequency        2       2
Proportion     0.5     0.5
dim(d); nrow(d); ncol(d); length(d)  # length is no. of variables
[1] 4 4
[1] 4
[1] 4
[1] 4

1.6.4 Defining Small Datasets Inline

For tiny datasets it is easiest to define them as follows:

d <- data.frame(age=c(10,20,30), sex=c('male','female','male'),

Large files may be stored in R binary format using saveRDS(..., compress='xz'), which creates an incredibly compact representation of the data in a file usually suffixed with .rds. This allows extremely fast loading of the data frame in your next R session using readRDS(...).

1.7 Suggestions for Initial Data Look

The datadensity function in the Hmisc package gives an overall univariable graphical summary of all variables in the imported dataset. The contents and describe functions are handy for describing the variables, labels, number of NAs, extreme values, and other values.

1.8 Operating on Data Frames

One of the most common operations is subsetting. In the following example we subset on males older than 26.

young.males <- subset(d, sex == 'male' & age > 26)
# If you want to exclude rows that are missing on sex or age:
young.males <- subset(d, sex == 'male' & age > 26 & ! is.na(sex) &
                        ! is.na(age))
# f <- lrm(y ~ sex + age, data=subset(d, sex == 'male' & ...))
# f <- lrm(y ~ sex + age, data=d, subset=sex == 'male' & age > 26 ...)
R Development Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; www.r-project.org; 2020. http://www.R-project.org.