1 R
1.1 Background
Computer code shown throughout these notes is R
.1 R
is free
and is the
most widely used statistical software in the world. It has the best graphics,
statistical modeling, nonparametric methods, survival analysis,
clinical trials methods, and data manipulation capabilities. R
has
the most comprehensive genomics analysis packages and has advanced
capabilities for reproducible analysis and reporting. R
also has an
excellent graphical front-end RStudio
(rstudio.org) that
has the identical look and feel on all operating systems and via a web
browser. Part of
R
’s appeal is the thousands of add-on packages available (at
cran.r-project.org/web/packages), which
exist because it is easy to add to R
. Many of the add-on packages
are specialty packages for biomedical research including packages for
such widely diverse areas as
- interfacing
R
toREDCap
(2 packages) - interactive design of adaptive clinical trials
- analyzing accelerometer data
- flow cytometry
- genomics
- analyzing ICD9 codes and computing comorbidity indexes
- downloading all annotated NHANES datasets
- interfacing to clinicaltrials.gov
- detecting whether a dataset contains personal identifiers of human subjects
- analysis of early phase cardiovascular drug safety studies
The main
R
web site is www.r-project.org.
1.2 Learning R
Start with R Tutorials at www.r-bloggers.com/how-to-learn-r-2, R Programming Tutorials from Mike Marin at
www.youtube.com/user/marinstatlectures, or Fast Lane to Learning R at github.com//matloff/fasteR by
Norm Matloff. Or look at
swirlstats.com for an interactive way to learn R
. Those
who have used SPSS or SAS before will profit from R
for SAS and SPSS Users by Robert Muenchen. A current list of R
books on
amazon.com may be found at amzn.to/15URiF6.
stats.idre.ucla.edu/r/ and
www.introductoryr.co.uk/R_Resources_for_Beginners.html
are useful web sites.
An excellent resource is R
for Data Science by Grolemund and Wickham.
See also R
in Action, second ed. by Robert
Kabacoff. The online open source book on statistical modeling by Legler and Roback at
bookdown.org/roback/bookdown-bysh contains a lot of R
code.
Jenny Bryan’s STAT 545 Data Wrangling, Exploration, and Analysis with R
course at
stat545.com is an excellent resource.
stackoverflow.com/tags/r is the best place for asking
questions about the language and for learning from answers to past questions
asked (see also the R-help
email list).
Three of the best ways to learn how to analyze data in R
quickly are
- Avoid importing and manipulating data, instead using the
R
readRDS
function to load datasets that are already annotated and analysis-ready (see here for information about importing your own datasets) - Use example
R
scripts as analysis templates - Use
RStudio
(rstudio.org) to runR
On the first approach, theR
Hmisc
packages’sgetHdata
function finds datasets on the Vanderbilt BiostatisticsDataSets
wiki, downloads them, andload()
s them in yourR
session. These notes use only datasets available via this mechanism. These datasets are fully annotated with variable labels and units of measurements for many of the continuous variables. Concerning analysis scripts, Vanderbilt Biostatistics has collected template analysis scripts on github.com/harrelfe/rscripts and theR
Hmisc
package has a functiongetRs
to download these scripts and to automatically populate anRStudio
script editor window with the script. Many of the scripts are in RMarkdown format for use with theR
knitr
package to allow mixing of text andR
code to make reproducible reports.knitr
is described (TBD).
The RMarkdown scripts accessed through getRs
use a template that makes the
result part of
a reproducible research process by documenting the versions of R
and
attached packages at the end of the report. Some of the scripts make use of the
knitrSet
function in the Hmisc
package. When running
Rmarkdown, call knitrSet(lang='markdown')
. knitrSet
gets
rid of ## at
the start of R output lines, and makes it easy to specify things like
figure sizes in knitr chunk headers. It also causes annoying messages
such as those generated from attaching R
packages to be put in a
separate file messages.txt
rather than in the report.
1.3 Setting up R
Before running examples in these notes and R
markdown example
scripts, you need to do the following:
- Make sure your operating system is up to date enough to run the
most current version of
R
at www.r-project.org. For Mac you must have OS X Maverick or later. - Install
R
from www.r-project.org or upgrade your installation ofR
to the latest version. - Install
RStudio
from rstudio.org or update yourRStudio
to the latest version. - Run
RStudio
and get it to install the packages that allowRmarkdown
to run, by clicking on . Make sure that theknitr
package is installed. - Have
RStudio
install theHmisc
andrms
packages (which will makeRStudio
install several other packages). For packages you had installed previously, make sure you update them to have the latest versions. - Configure
RStudio
Tools ... Global Options
to match the images below
Here are some examples of how getRs
is used once you
load the Hmisc
package using a menu or by typing
require(Hmisc)
or library(Hmisc)
in the console.
require(Hmisc) # do this once per session (or library(Hmisc))
options(url.method='libcurl') # sometimes needed if using Windows
getRs() # list available scripts
getRs(browse='browser') # open scripts contents in your web browser
<- getRs() # store directory of scripts in an object that can easily
scripts # be viewed on demand in RStudio (right upper pane)
getRs('introda.r') # download introda.r and open in script editor
getRs(cats=TRUE) # list available major and minor categories
<- getRs(cats=TRUE) # store results in a list for later viewing
categories getRs(cats='reg') # list all scripts in a major category containing 'reg'
getRs('importREDCap.r', put='source') # source() to define a function
You can also point your browser to
github.com/harrelfe/rscripts/blob/master/contents.md to
see the available scripts and categories, and to be able to click on
links to see html
report output.
To get started using R
in RStudio
to create reproducible
annotated reports, finish the above configuration instructions and
type the following in the RStudio
console:
getRs('descriptives.Rmd')
. The above the script editor window
click on Knit HTML
.
1.4 Using R Markdown
See kbroman.org/knitr_knutshell/pages/Rmarkdown.html and
print the R Markdown
cheat sheet from
www.rstudio.com/resources/cheatsheets.
To make the code listing pretty, put this chunk at the top of your
report. echo=FALSE
suppresses this setup chunk from printing in
the report.
```{r setup,echo=FALSE}
require(Hmisc)
knitrSet('myreport', lang='markdown')
```
The argument 'myreport'
is replaced with a string to use as a
prefix to all the graphics file names, making each report in your
working directory use graphics file names that do not collide with
each other. For example if your report is called
acidity_analysis.Rmd
you might specify
knitrSet('acidity_analysis.Rmd', lang='markdown')
. There are
many other options to knitrSet
. A commonly used one is
width=n
to specify a line width for printing code of n
letters. The
default is 61. You can also specify echo,results
, and other
options. Type ?knitrSet
for help.
The R
knitr
package is used to run the markdown report and
insert graphics and text output into the report at appropriate slots.
It is best to specify a name for each chunk, and you must use unique
names. Each R
code chunk must begin exactly
with ```{r …} and the chunk name is the first set of
characters that appear after the space after r. Here are some
example chunk headers. Chunk names must not contain a space.
```{r descriptives}
```{r anova}
```{r anova-y1}
```{r anova_y1}
```{r acidity_plot}
```{r plot_residuals,top=1}
```{r plot_residuals,mfrow=c(2,2),left=1,top=1,rt=1,bot=1}
```{r plot-residuals,w=5,h=4}
Chunk options that were used above are:
Options | Description |
---|---|
top=1 |
Leave an extra line of space at top of graph for title |
mfrow=c(2,2) |
Use base graphics and put the next 4 plots into a single figure with 2 rows, 2 columns |
left=1,rt=1,bot=1 |
Leave one extra line for margin for left, right, bottom of figure |
w=5,h=4 |
Make the figure larger than the default that knitrSet uses (4 inch width by 3 inch height) |
Always having a chunk name also allows easy navigation of chunks by
clicking to the right of the green C
at the bottom of your
script. This will show the names of all chunks and you can click on
one to go there.
1.5 Debugging R
Code
When using RStudio
and knitr
as with RMarkdown
, it is
best to debug your commands a piece at a time. The fastest way to do
this is to go to some line inside your first chunk and click the green
C
just above and to the right of your script. Click on then on Run Next Chunk
. Shortcut keys for these
are Ctrl+Alt+C
and Ctrl+Alt+N
(Command+Option+C
and Command+Option+N
for Mac). You can also click on a
single line of code and run it by clicking on Run
.
Whenever you get a strange execution error it is sometimes helpful to
show the history of all the function calls leading to that error.
This is done by typing traceback()
at the command prompt.
1.6 Importing Other Datasets
Most of the work of getting some data sources ready for analysis involves
reshaping datasets from wide to tall and thin, recoding variables,
and merging multiple datasets. R
has first-class capabilities for
all of these tasks but this part of R
is harder to learn, partly
because there are so many ways to accomplish these tasks in R
.
Getting good variable names, variable labels, and value labels, can
also be tedious but is highly worth the time investment.
1.6.1 Stata and SPSS
If you have Stata or SPSS files that are
already shaped correctly and have variable labels and value
labels, the R
Hmisc
package’s stata.get
and
spss.get
functions will produce fully annotated
ready-to-analyze R
data frames.
1.6.2 REDCap
REDCap
exports data to R
, and Biostatistics has an R
function
to make the import process much easier. Here is an example showing
how to fetch and use the function. In this example, the user did not
provide the name of the file to import but rather let the function
find the last created REDCap export files in the current working directory.
require(Hmisc)
getRs('importREDCap.r', put='source') # source() code to define function
<- importREDCap() # by default operates on last downloaded export
mydata saveRDS(mydata, 'mydata.rds', compress='xz')
Advanced users can hook into
REDCap
dynamically with R
to avoid the need to export/import.
1.6.3 Spreadsheets
If you have a properly
formatted csv
file (e.g., exported from a spreadsheet), the
Hmisc
csv.get
function will read it, facilitate handling
of date variables, convert column names to legal R
names, and
save the original column names as variable labels.
Here is an example of importing a csv
file into R
. First of all
make sure your spreadsheet is a “spreadsheet from heaven” and not a “spreadsheet from hell” by reading
biostat.app.vumc.org/DataTransmissionProcedures.
Then use your spreadsheet software to export a single worksheet to
create a csv
file. Small csv
files may be pasted into your
R
script as is done in the following example, but in most cases you
will call csv.get
with an external file name as the first
argument.
# What is between data <- .. and ') is exactly like an external .csv file
<- textConnection('
data Age in Years,sex,race,visit date,m/s
23,m,w,10/21/2014,1.1
14,f,b,10/22/2014,1.3
,f,w,10/15/2014,1.7
')
require(Hmisc)
<- csv.get(data, lowernames=TRUE, datevars='visit.date',
d dateformat='%m/%d/%Y')
close(data)
# lowernames=TRUE: convert variable names to lower case
# Omit dateformat if dates are in YYYY-MM-DD format
contents(d)
Data frame:d 3 observations and 5 variables Maximum # NAs:1
Labels Class Storage NAs
age.in.years Age in Years integer integer 1
sex sex character character 0
race race character character 0
visit.date visit date Date double 0
m.s m/s numeric double 0
d
age.in.years sex race visit.date m.s
1 23 m w 2014-10-21 1.1
2 14 f b 2014-10-22 1.3
3 NA f w 2014-10-15 1.7
In the contents
output above you can see that the original column
names have been placed in the variable labels, and the new names have
periods in place of blanks or a slash, since these characters are
illegal in R
names.
You can have as the first argument to csv.get
not only a file
name but a URL to a file on the web. You can also specify delimiters
other than commas.
Also see the excellent tutorial on importing from Excel found at www.r-bloggers.com/r-tutorial-on-reading-and-importing-excel-files-into-r.
TheHmisc
upData
function may be used to rename variables
and provide variable and value labels and units of measurement.
Here is another example where there is a junk
variable to
delete after importing, and a categorical variable is coded as
integers and need to have value labels defined after
importing. We show how csv.get
automatically renamed one illegal
(to R
) variable name, how to redefine a variable label, and how to
define the value labels. Suppose that file test.csv
exists in
our project directory and has the following contents.
age,sys bp,sex,junk,state 23,140,male,1,1 42,131,female,2,1 45,127,female,3,2 37,141,male,4,2
Now import and modify the file.
require(Hmisc)
<- csv.get('test.csv')
d names(d) # show names after modification by csv.get
[1] "age" "sys.bp" "sex" "junk" "state"
contents(d) # show labels created by csv.get
Data frame:d 4 observations and 5 variables Maximum # NAs:0
Labels Class Storage
age age integer integer
sys.bp sys bp integer integer
sex sex character character
junk junk integer integer
state state integer integer
<- upData(d,
d state=factor(state, 1:2, c('Alabama','Alaska')),
rename=c(sys.bp='sbp'),
labels=c(age = 'Age',
sbp = 'Systolic Blood Pressure'),
drop='junk', # for > 1: drop=c('junk1','junk2',...)
units=c(sbp='mmHg'))
Input object size: 4040 bytes; 5 variables 4 observations
Renamed variable sys.bp to sbp
Modified variable state
Dropped variable junk
New object size: 3608 bytes; 4 variables 4 observations
contents(d)
Data frame:d 4 observations and 4 variables Maximum # NAs:0
Labels Units Levels Class Storage
age Age integer integer
sbp Systolic Blood Pressure mmHg integer integer
sex sex character character
state 2 integer
+--------+--------------+
|Variable|Levels |
+--------+--------------+
| state |Alabama,Alaska|
+--------+--------------+
describe(d)
d
4 Variables 4 Observations
--------------------------------------------------------------------------------
age : Age
n missing distinct Info Mean Gmd
4 0 4 1 36.75 11.83
Value 23 37 42 45
Frequency 1 1 1 1
Proportion 0.25 0.25 0.25 0.25
--------------------------------------------------------------------------------
sbp : Systolic Blood Pressure [mmHg]
n missing distinct Info Mean Gmd
4 0 4 1 134.8 8.5
Value 127 131 140 141
Frequency 1 1 1 1
Proportion 0.25 0.25 0.25 0.25
--------------------------------------------------------------------------------
sex
n missing distinct
4 0 2
Value female male
Frequency 2 2
Proportion 0.5 0.5
--------------------------------------------------------------------------------
state
n missing distinct
4 0 2
Value Alabama Alaska
Frequency 2 2
Proportion 0.5 0.5
--------------------------------------------------------------------------------
dim(d); nrow(d); ncol(d); length(d) # length is no. of variables
[1] 4 4
[1] 4
[1] 4
[1] 4
1.6.4 Defining Small Datasets Inline
For tiny datasets it is easiest to define them as follows:
<- data.frame(age=c(10,20,30), sex=c('male','female','male'),
d sbp=c(120,125,NA))
Large files may be stored in R
binary format using saveRDS(..., compress='xz')
, which creates an incredibly compact representation of the data in a file usually suffixed with .rds
. This allows extremely fast loading of the data frame in your next R
session using readRDS(...)
.
1.7 Suggestions for Initial Data Look
The datadensity
function in the Hmisc
package gives an
overall univariable graphical summary of all variables in the imported
dataset. The contents
and describe
functions are handy for
describing the variables, labels, number of NA
s, extreme values,
and other values.
1.8 Operating on Data Frames
One of the most common operations is subsetting. In the following example we subset on males older than 26.
<- subset(d, sex == 'male' & age > 26)
young.males # If you want to exclude rows that are missing on sex or age:
<- subset(d, sex == 'male' & age > 26 & ! is.na(sex) &
young.males ! is.na(age))
# f <- lrm(y ~ sex + age, data=subset(d, sex == 'male' & ...))
# f <- lrm(y ~ sex + age, data=d, subset=sex == 'male' & age > 26 ...)