R allows interaction spline functions, wide variety of predictor parameterizations, wide variety of models, unifying model formula language, model validation by resampling.

R is comprehensive:

6.1 The R Modeling Language

R statistical modeling language:

Code
response ~ terms

y ~ age + sex            # age + sex main effects
y ~ age + sex + age:sex  # add second-order interaction
y ~ age*sex              # second-order interaction +
                         # all main effects
y ~ (age + sex + pressure)^2
                         # age+sex+pressure+age:sex+age:pressure...
y ~ (age + sex + pressure)^2 - sex:pressure
                         # all main effects and all 2nd order
                         # interactions except sex:pressure
y ~ (age + race)*sex     # age+race+sex+age:sex+race:sex
y ~ treatment*(age*race + age*sex) # no interact. with race,sex
sqrt(y) ~ sex*sqrt(age) + race
# functions, with dummy variables generated if
# race is an R factor (classification) variable
y ~ sex + poly(age,2)    # poly generates orthogonal polynomials
race.sex <- interaction(race,sex)
y ~ age + race.sex       # for when you want dummy variables for
                         # all combinations of the factors

The formula for a regression model is given to a modeling function, e.g.

Code
lrm(y ~ rcs(x,4))

is read “use a logistic regression model to model y as a function of x, representing x by a restricted cubic spline with 4 default knots”1.

  • 1 lrm andrcs are in the rms package.

  • update function: re-fit model with changes in terms or data:

    Code
    f  <- lrm(y ~ rcs(x,4) + x2 + x3)
    f2 <- update(f, subset=sex=="male")
    f3 <- update(f, .~.-x2)         # remove x2 from model
    f4 <- update(f, .~. + rcs(x5,5))# add rcs(x5,5) to model
    f5 <- update(f, y2 ~ .)         # same terms, new response var.

    6.2 User-Contributed Functions

    • R is high-level object-oriented language.
    • R runs on all platforms, and can be run on chromebooks using RStudio Server
    • Multitude of user-contributed functions freely available
    • International community of users

    Some R functions:

    • See Venables and Ripley
    • Hierarchical clustering: hclust
    • Principal components: princomp, prcomp
    • Canonical correlation: cancor
    • Nonparametric transform-both-sides additive models:
      ace, avas
    • Parametric transform-both-sides additive models:
      areg, areg.boot (Hmisc package in R))
    • Rank correlation methods:
      rcorr, hoeffd, spearman2 (Hmisc)
    • Variable clustering: varclus (Hmisc)
    • Single imputation: transcan (Hmisc)
    • Multiple imputation: aregImpute (Hmisc)
    • Restricted cubic splines:
      rcspline.eval (Hmisc)
    • Re-state restricted spline in simpler form:
      rcspline.restate (Hmisc)

    6.3 The rms Package

    • datadist function to compute predictor distribution summaries
    Code
    y ~ sex + lsp(age,c(20,30,40,50,60)) +
        sex %ia% lsp(age,c(20,30,40,50,60))

    E.g. restrict age \(\times\) cholesterol interaction to be of form \(AF(B) + BG(A)\):

    Code
    y ~ lsp(age,30) + rcs(cholesterol,4) +
        lsp(age,30) %ia% rcs(cholesterol,4)

    Special fitting functions by Harrell to simplify procedures described in these notes:

    rms Fitting Functions
    Function Purpose Related R Functions
    ols Ordinary least squares linear model lm
    lrm Binary and ordinal logistic regression model with options for penalize MLE glm
    orm Ordinal semi-parametric regression model for continuous \(Y\) and several link functions polr,lrm
    psm Accelerated failure time parametric survival models survreg
    cph Cox proportional hazards regression coxph
    bj Buckley-James censored least squares model survreg,lm
    Glm rms version of glm glm
    Gls rms version of gls gls (nlme package)
    Rq rms version of rq rq (quantreg package)
    rms Transformation Functions
    Function Purpose
    asis No post-transformation (seldom used explicitly)
    rcs Restricted cubic splines
    pol Polynomial using standard notation
    lsp Linear spline
    catg Categorical predictor (seldom)
    scored Ordinal categorical variables
    matrx Keep variables as group for anova and fastbw
    strat Non-modeled stratification factors (used for cph only)

    The transformation functions work also with regular R functions, e.g. when predict() is called the predicted values are computed by looking up the knot locations for rcs.

    Below notice that there are three graphic models implemented for depicting the effects of predictors in the fitted model: lattice graphics, a ggplot method using the ggplot2 package (which has an option to convert the result to plotly), and a direct plotly method. plotly is used to create somewhat interactive graphics with drill-down capability, and the rms package takes advantage of this capability. plotly graphics are best used with RStudio Rmarkdown html output.

    rms After-fit Functions
    Function Purpose Related Functions
    print Print parameters and statistics of fit
    coef Fitted regression coefficients
    formula Formula used in the fit
    specs Detailed specifications of fit
    vcov Fetch covariance matrix
    logLik Fetch maximized log-likelihood
    AIC Fetch AIC with option to put on chi-square basis
    lrtest Likelihood ratio test for two nested models
    univarLR Compute all univariable LR \(\chi^{2}\)
    robcov Robust covariance matrix estimates
    bootcov Bootstrap covariance matrix estimates and bootstrap distribution of estimates
    pentrace Find optimum penalty factors by tracing effective AIC for a grid of penalties
    effective.df Print effective d.f. for each type of variable in model, for penalized fit or pentrace results
    summary Summary of effects of predictors
    plot.summary Plot continuously shaded confidence bars for results of summary
    anova Wald and LR tests of most meaningful hypotheses
    plot.anova Graphical depiction of anova
    contrast General contrasts, C.L., tests
    gendata Easily generate predictor combinations
    predict Obtain predicted values or design matrix
    Predict Obtain predicted values and confidence limits easily varying a subset of predictors and others set at default values
    plot.Predict Plot the result of Predict using lattice
    ggplot.Predict Plot the result of Predict using ggplot2
    plotp.Predict Plot the result of Predict using plotly
    fastbw Fast backward step-down variable selection step
    residuals (or resid) Residuals, influence stats from fit
    sensuc Sensitivity analysis for unmeasured confounder
    which.influence Which observations are overly influential residuals
    latex \(\LaTeX\) representation of fitted model Function
    Function R function analytic representation of \(X\hat{\beta}\) from a fitted regression model latex
    Hazard R function analytic representation of a fitted hazard function (for psm)
    Survival R function analytic representation of fitted survival function (for psm, cph)
    Quantile R function analytic representation of fitted function for quantiles of survival time (for psm, cph)
    Mean R function analytic representation of fitted function for mean survival time or for ordinal logistic
    nomogram Draws a nomogram for the fitted model latex, plot
    survest Estimate survival probabilities (psm, cph) survfit
    survplot Plot survival curves (psm, cph) plot.survfit
    survplotp Plot survival curves with plotly features survplot
    validate Validate indexes of model fit using resampling
    val.prob External validation of a probability model lrm
    val.surv External validation of a survival model calibrate
    calibrate Estimate calibration curve using resampling val.prob
    vif Variance inflation factors for fitted model
    naresid Bring elements corresponding to missing data back into predictions and residuals
    naprint Print summary of missing values
    impute Impute missing values aregImpute
    rmsb: Bayesian Regression Modeling Strategies Package, Focusing on Semiparametric Univariate and Longitudinal Models
    Function Purpose
    blrm Bayesian binary and ordinal logistic model
    stackMI Bayesian posterior stacking for multiple imputation
    stanDx Stan diagnostics on fit
    stanDxplot Trace plots to check posterior sampling convergence
    PostF Creates R function for computing posterior probabilities
    plot.rmsb Plot posterior densities, intervals, point summaries
    compareBmods Compare two models using LOO-cv
    HPDint Compute highest posterior density interval
    distSym Compute meaure of symmetry of posterior distribution

    An extensive overview of Bayesian capabilities of the rmsb package may be found at hbiostat.org/R/rmsb/blrm.html.

    Global options prType and grType control printed and some graphical output, respectively as shown in example code below. The default is plain output and static graphics. If using plotly interactive graphics through ggplot or plotp or with anova or summary functions it is best to do so with RStudio html output or html notebooks. If using html output you must be producing an html document or notebook. When setting grType to use \(\LaTeX\) or html it is highly recommended that you use the knitr package.

    Example:

    • treat: categorical variable with levels "a","b","c"
    • num.diseases: ordinal variable, 0-4
    • age: continuous
      Restricted cubic spline
    • cholesterol: continuous
      (3 missings; use median)
      log(cholesterol+10)
    • Allow treat \(\times\) cholesterol interaction
    • Program to fit logistic model, test all effects in design, estimate effects (e.g. inter-quartile range odds ratios), plot estimated transformations
    Code
    require(rms)                     # make new functions available
    options(prType='html')           # print, summary, anova, validate: html output
                                     # others: 'latex', 'plain'
    options(grType='plotly')         # plotly graphics for ggplot, anova, summary
                                     # default is 'base' for static graphics
    ddist <- datadist(cholesterol, treat, num.diseases, age)
    # Could have used ddist <- datadist(data.frame.name)
    options(datadist="ddist")        # defines data dist. to rms
    cholesterol <- impute(cholesterol)
    fit <- lrm(y ~ treat + scored(num.diseases) + rcs(age) +
                     log(cholesterol+10) + treat:log(cholesterol+10),
                         x=TRUE, y=TRUE)  # needed for robcov, anova test='LR'
    fit    # outputs plain, LaTeX, or html markup
    describe(y ~ treat + scored(num.diseases) + rcs(age))
    # or use describe(formula(fit)) for all variables used in fit
    # describe function (in Hmisc) gets simple statistics on variables
    # fit <- robcov(fit)             # Would make all statistics that follow
                                     # use a robust covariance matrix
                                     # would need x=T, y=T in lrm()
    specs(fit)                       # Describe the design characteristics
    anova(fit)                       # Wald tests, plain, LaTex, or html
    anova(fit, test='LR')            # Likelihood ratio tests
    anova(fit, treat, cholesterol)   # Test these 2 by themselves
    plot(anova(fit))                 # Summarize anova graphically
    summary(fit)                     # Estimate effects using default ranges
                                     # prints plain, LaTeX, or html
    plot(summary(fit))               # Graphical display of effects with C.I.
    summary(fit, treat="b", age=60)  # Specify reference cell and adjustment val
    summary(fit, age=c(50,70))       # Estimate effect of increasing age from
                                     # 50 to 70
    summary(fit, age=c(50,60,70))    # Increase age from 50 to 70, adjust to
                                     # 60 when estimating effects of other
                                     # factors
    # If had not defined datadist, would have to define ranges for all var.
    
    # Estimate and test treatment (b-a) effect averaged over 3 cholesterols
    contrast(fit, list(treat='b', cholesterol=c(150,200,250)),
                  list(treat='a', cholesterol=c(150,200,250)),
             type='average')
    # See the help file for contrast.rms for several examples of
    # how to obtain joint tests of multiple contrasts and how to get
    # double differences (interaction contrasts)
    
    p <- Predict(fit, age=seq(20,80,length=100), treat, conf.int=FALSE)
    plot(p)                          # Plot relationship between age and log
    # or ggplot(p), plotp(p)         # odds, separate curve for each treat,
                                     # no C.I.
    plot(p, ~ age | treat)           # Same but 2 panels
    ggplot(p, groups=FALSE)
    bplot(Predict(fit, age, cholesterol, np=50))
                                     # 3-dimensional perspective plot for age,
                                     # cholesterol, and log odds using default
                                     # ranges for both variables
    plot(Predict(fit, num.diseases, fun=function(x) 1/(1+exp(-x)), conf.int=.9),
         ylab="Prob")                # Plot estimated probabilities instead of
                                     # log odds (or use ggplot())
                                     # can also use plotp() for plotly
    # Again, if no datadist were defined, would have to tell plot all limits
    logit <- predict(fit, expand.grid(treat="b",num.dis=1:3,age=c(20,40,60),
                     cholesterol=seq(100,300,length=10)))
    # Could also obtain list of predictor settings interactively}
    logit <- predict(fit, gendata(fit, nobs=12))
    
    # Since age doesn't interact with anything, we can quickly and
    # interactively try various transformations of age, taking the spline
    # function of age as the gold standard. We are seeking a linearizing
    # transformation.
    
    ag <- 10:80
    logit <- predict(fit, expand.grid(treat="a", num.dis=0, age=ag,
                     cholesterol=median(cholesterol)), type="terms")[,"age"]
    # Note: if age interacted with anything, this would be the age
    #        "main effect" ignoring interaction terms
    # Could also use
    #    logit <- Predict(f, age=ag, ...)$yhat,
    # which allows evaluation of the shape for any level of interacting
    #  factors.  When age does not interact with anything, the result from
    # predict(f, ..., type="terms") would equal the result from
    # Predict if all other terms were ignored
    
    # Could also specify
    #    logit <- predict(fit, gendata(fit, age=ag, cholesterol=...))
    # Un-mentioned variables set to reference values
    
    plot(ag^.5, logit)               # try square root vs. spline transform.
    plot(ag^1.5, logit)              # try 1.5 power
    
    latex(fit)                       # fit in math notation
    # Draw a nomogram for the model fit
    plot(nomogram(fit))
    
    # Compose R function to evaluate linear predictors analytically
    g <- Function(fit)
    g(treat='b', cholesterol=260, age=50)
    # Letting num.diseases default to reference value

    To examine interactions in a simpler way, you may want to group age into tertiles:

    Code
    age.tertile <- cut2(age, g=3)
    # For automatic ranges later, add age.tertile to datadist input
    fit <- lrm(y ~ age.tertile * rcs(cholesterol))

    6.4 Other Functions

    • processMI: works with Hmisc::fit.mult.impute to process resampling validation statistics and likelihood ratio \(\chi^2\) statistics accounting for multiple imputation
    • supsmu: Friedman’s “super smoother”
    • lowess: Cleveland’s scatterplot smoother
    • glm: generalized linear models (see Glm)
    • gam: Generalized additive models
    • rpart: Like original CART with surrogate splits for missings, censored data extension (Atkinson & Therneau)
    • validate.rpart: in rms; validates recursive partitioning with respect to certain accuracy indexes
    • loess: multi-dimensional scatterplot smoother
    Code
    f <- loess(y ~ age * pressure)
    plot(f)                           # cross-sectional plots
    ages <- seq(20,70,length=40)
    pressures <- seq(80,200,length=40)
    pred <- predict(f, expand.grid(age=ages, pressure=pressures))
    persp(ages, pressures, pred)      # 3-d plot