9 Introduction to the R
rms
Package: The Linear Model
Some of the purposes of the rms
package are to
- make everyday statistical modeling easier to do
- make modern statistical methods easy to incorporate into everyday work
- make it easy to use the bootstrap to validate models
- provide “model presentation graphics”
9.1 Formula Language and Fitting Function
- Statistical formula in
R
:B
~ x1 + x2 + x3 y
y
is modeled as \(\alpha + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3} x_{3}\)
y
is the dependent variable/response/outcome,x
’s are predictors (independent variables)- The formula is the first argument to a fitting
function (just as it is the first argument to a
trellis
graphics function) rms
(regression modeling strategies) package Harrell (2020) makes many aspects of regression modeling and graphical display of model results easier to dorms
does a lot of bookkeeping to remember details about the design matrix for the model and to use these details in making automatic hypothesis tests, estimates, and plots. The design matrix is the matrix of independent variables after coding them numerically and adding nonlinear and product terms if needed.rms
package fitting function for ordinary least squares regression (what is often called the linear model or multiple linear regression):ols
- Example: C
<- ols(y ~ age + sys.bp, data=mydata) f
age
andsys.bp
are the two predictors (independent variables) assumed to have linear and additive effects (do not interact or have synergism)mydata
is anR
data frame containing at least three columns for the model’s variablesf
(the fit object) is anR
list object, containing coefficients, variances, and many other quantities- Below, the fit object will be
f
throughout. In practice, use any legalR
name, e.g.fit.full.model
9.2 Operating on the Fit Object
- Regression coefficient estimates may be obtained by any of the methods listed below
$coefficients
f$coef # abbreviation
fcoef(f) # use the coef extractor function
coef(f)[1] # get intercept
$coef[2] # get 2nd coefficient (1st slope)
f$coef['age'] # get coefficient of age
fcoef(f)['age'] # ditto
- But often we use methods which do something more interesting with the model fit.
print(f)
: print coefficients, standard errors, \(t\)-test, other statistics; can also just typef
to printfitted(f)
: compute \(\hat{y}\)predict(f, newdata)
: get predicted values, for subjects described in data framenewdata
1- r <- resid(f): compute the vector of \(n\) residuals
(here, store it in
r
) formula(f)
: print the regression formula fittedanova(f)
: print ANOVA table for all total and partial effectssummary(f)
: print estimates partial effects using meaningful changes in predictorsPredict(f)
: compute predicted values varying a few predictors at a time (convenient for plotting)ggplot(p)
: plot partial effects, with predictor ranging over the \(x\)-axis, wherep
is the result ofPredict
g <- Function(f)
: create anR
function that evaluates the analytic form of the fitted functionnomogram(f)
: draw a nomogram of the model
Note: Some of othe functions can output html
if options(prType='html')
is set, as is done in this chapter. When using that facility you often need to add results='asis'
in the knitr
chunk header.
9.3 The rms
datadist
Function
To use Predict, summary
, or nomogram
in the rms
package, you need to let rms
first compute summaries of the
distributional characteristics of the predictors:
<- datadist(x1,x2,x3,...) # generic form
dd <- datadist(age, sys.bp, sex)
dd <- datadist(mydataframe) # for a whole data frame
dd options(datadist='dd') # let rms know where to find
Note that the name dd
can be any name you choose as long as you
use the same name in quotes to options
that you specify
(unquoted) to the left of <- datadist(...)
. It is best to
invoke datadist
early in your program before fitting any
models. That way the datadist
information is stored in the fit
object so the model is self-contained. That allows you to make
plots in later sessions without worrying about datadist
. datadist
must be re-run if you add a new predictor or recode an
old one. You can update it using for example
<- datadist(dd, cholesterol, height)
dd # Adds or replaces cholesterol, height summary stats in dd
9.4 Short Example
Consider the lead exposure dataset from B. Rosner Fundamentals of Biostatistics and originally from Landrigan PJ et al, Lancet 1:708-715, March 29, 1975. The study was of psychological and neurologic well-being of children who lived near a lead smelting plant. The primary outcome measures are the Wechsler full-scale IQ score (
iqf
) and the finger-wrist tapping score maxfwt
. The
dataset is available at hbiostat.org/data
and can be automatically downloaded and load()
’d into R
using
the Hmisc
package getHdata
function. For now we just
analyze lead exposure levels in 1972 and 1973, age, and
maxfwt
2.
8-rmsintro.r
contained in the .zip
file using RStudio
. Commands listed in previous sections were not actually executed so they are marked with the R
comment symbol
(#) and can be ignored.
# For an Rmarkdown version of similar analyses see
# https://github.com/harrelfe/rscripts/raw/master/lead-ols.md
require(rms) # also loads the Hmisc package
getHdata(lead)
# Subset variables just so contents() and describe() output is short
# Override units of measurement to make them legal R expressions
<- upData(lead,
lead keep=c('ld72', 'ld73', 'age', 'maxfwt'),
labels=c(age='Age'),
units=c(age='years', ld72='mg/100*ml', ld73='mg/100*ml'))
Input object size: 53928 bytes; 39 variables 124 observations
Kept variables ld72,ld73,age,maxfwt
New object size: 14096 bytes; 4 variables 124 observations
html(contents(lead))
Data frame:lead
124 observations and 4 variables, maximum # NAs:25Name | Labels | Units | Storage | NAs |
---|---|---|---|---|
age | Age | years | double | 0 |
ld72 | Blood Lead Levels, 1972 | mg/100*ml | integer | 0 |
ld73 | Blood Lead Levels, 1973 | mg/100*ml | integer | 0 |
maxfwt | Maximum mean finger-wrist tapping score | integer | 25 |
html(describe(lead))
4 Variables 124 Observations
age: Age years
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 73 | 1 | 8.935 | 4.074 | 3.929 | 4.333 | 6.167 | 8.375 | 12.021 | 14.000 | 15.000 |
lowest : | 3.750000 | 3.833333 | 3.916667 | 4.000000 | 4.166667 |
highest: | 14.250000 | 15.000000 | 15.250000 | 15.416667 | 15.916667 |
ld72: Blood Lead Levels, 1972 mg/100*ml
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 47 | 0.999 | 36.16 | 17.23 | 18.00 | 21.00 | 27.00 | 34.00 | 43.00 | 57.00 | 61.85 |
ld73: Blood Lead Levels, 1973 mg/100*ml
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
124 | 0 | 37 | 0.998 | 31.71 | 11.06 | 18.15 | 21.00 | 24.00 | 30.50 | 37.00 | 47.00 | 50.85 |
maxfwt: Maximum mean finger-wrist tapping score
n | missing | distinct | Info | Mean | Gmd | .05 | .10 | .25 | .50 | .75 | .90 | .95 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
99 | 25 | 40 | 0.998 | 51.96 | 13.8 | 33.2 | 38.0 | 46.0 | 52.0 | 59.0 | 65.0 | 72.2 |
<- datadist(lead); options(datadist='dd')
dd # show what datadist computed dd
age ld72 ld73 maxfwt
Low:effect 6.166667 27.00 24.00 46.0
Adjust to 8.375000 34.00 30.50 52.0
High:effect 12.020833 43.00 37.00 59.0
Low:prediction 3.929167 18.00 18.15 33.2
High:prediction 15.000000 61.85 50.85 72.2
Low 3.750000 1.00 15.00 13.0
High 15.916667 99.00 58.00 84.0
# Fit an ordinary linear regression model with 3 predictors assumed linear
<- ols(maxfwt ~ age + ld72 + ld73, data=lead) f
# same as print(f) f
Linear Regression Model
ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)Frequencies of Missing Values Due to Each Variable
maxfwt age ld72 ld73 25 0 0 0
Model Likelihood Ratio Test |
Discrimination Indexes |
|
---|---|---|
Obs 99 | LR χ2 62.25 | R2 0.467 |
σ 9.5221 | d.f. 3 | R2adj 0.450 |
d.f. 95 | Pr(>χ2) 0.0000 | g 10.104 |
Residuals
Min 1Q Median 3Q Max -33.9958 -4.9214 0.7596 5.1106 33.2590
β | S.E. | t | Pr(>|t|) | |
---|---|---|---|---|
Intercept | 34.1059 | 4.8438 | 7.04 | <0.0001 |
age | 2.6078 | 0.3231 | 8.07 | <0.0001 |
ld72 | -0.0246 | 0.0782 | -0.31 | 0.7538 |
ld73 | -0.2390 | 0.1325 | -1.80 | 0.0744 |
coef(f) # retrieve coefficients
Intercept age ld72 ld73
34.1058551 2.6078450 -0.0245978 -0.2389695
specs(f, long=TRUE) # show how parameters are assigned to predictors,
ols(formula = maxfwt ~ age + ld72 + ld73, data = lead)
Units Label Assumption Parameters d.f.
age years Age asis 1
ld72 mg/100*ml Blood Lead Levels, 1972 asis 1
ld73 mg/100*ml Blood Lead Levels, 1973 asis 1
age ld72 ld73
Low:effect 6.166667 27.00 24.00
Adjust to 8.375000 34.00 30.50
High:effect 12.020833 43.00 37.00
Low:prediction 3.929167 18.00 18.15
High:prediction 15.000000 61.85 50.85
Low 3.750000 1.00 15.00
High 15.916667 99.00 58.00
# and predictor distribution summaries driving plots
<- Function(f) # create an R function that represents the fitted model
g # Note that the default values for g's arguments are medians
g
function (age = 8.375, ld72 = 34, ld73 = 30.5)
{
34.105855 + 2.607845 * age - 0.024597799 * ld72 - 0.23896951 *
ld73
}
<environment: 0x55d32b7af780>
# Estimate mean maxfwt at age 10, .1 quantiles of ld72, ld73 and .9 quantile of ld73
# keeping ld72 at .1 quantile
g(age=10, ld72=21, ld73=c(21, 47)) # more exposure in 1973 decreased y by 6
[1] 54.64939 48.43618
# Get the same estimates another way but also get std. errors
predict(f, data.frame(age=10, ld72=21, ld73=c(21, 47)), se.fit=TRUE)
$linear.predictors
1 2
54.64939 48.43618
$se.fit
1 2
1.391858 3.140361
9.5 Operating on Residuals
Residuals may be summarized and plotted just like any raw data variable.
- To plot residuals vs. each predictor, and to make a q-q plot to check normality of residuals, use these examples:
<- resid(f)
r par(mfrow=c(2,2)) # 2x2 matrix of plots
plot(fitted(f), r); abline(h=0) # yhat vs. r
with(lead, plot(age, r)); abline(h=0)
with(lead, plot(ld73, r)); abline(h=0)
qqnorm(r) # linearity indicates normality
qqline(as.numeric(r))
9.6 Plotting Partial Effects
Predict
andggplot
makes one plot for each predictor- Predictor is on \(x\)-axis, \(\hat{y}\) on the \(y\)-axis
- Predictors not shown in plot are set to constants
- median for continuous predictors
- mode for categorical ones
- For categorical predictor, estimates are shown only at data values
- 0.95 pointwise confidence limits for \(\hat{E}(y|x)\) are shown
(add
conf.int=FALSE
toPredict()
to suppress CLs) - Example: P
ggplot(Predict(f))
- To take control of which predictors are plotted, or to specify
customized options: Q
ggplot(Predict(f, age)) # plot age effect, using default range,
# 10th smallest to 10th largest age
ggplot(Predict(f, age=3:15)) # plot age=3,4,...,15