Regression Modeling Strategies
Course Overview
All standard regression models have assumptions that must be verified
for the model to have power to test hypotheses and for it to be able to
predict accurately. Of the principal assumptions (linearity, additivity,
distributional), this course will emphasize methods for assessing and
satisfying the first two. For the last, emphasis is placed on
semiparametric ordinal regression models that do not assume a
distribution.
Practical but powerful tools are presented for validating model
assumptions and presenting model results. This course provides methods
for estimating the shape of the relationship between predictors and
response using the widely applicable method of augmenting the design
matrix using restricted cubic splines.
Even when assumptions are satisfied, overfitting can ruin a model’s
predictive ability for future observations. Methods for data reduction
will be introduced to deal with the common case where the number of
potential predictors is large in comparison with the number of
observations. Methods of model validation (bootstrap and
cross-validation) will be covered, as will auxiliary topics such as
modeling interaction surfaces, variable selection, overly influential
observations, collinearity, predictive accuracy, variable importance,
shrinkage, model interpretation, and chunk tests. A brief introduction
to the rms package in R for handling these
problems will also be covered. The course also introduces the Bayesian
approach to modeling.
The methods covered will apply to almost any regression model,
including:
- Ordinary least squares
- Longitudinal models
- Logistic regression models
- Ordinal regression for discrete and continuous Y
- Quantile regression
- Longitudinal data analysis
- Survival analysis
Statistical models will be contrasted with machine learning so that
the student can make an informed choice of predictive tools.
The 4-day course also has a session introducing causal inference with
special attention to how causal inference should inform model
specification.
Course Outline
Numbers in brackets refer to section numbers in the course notes.
Day 1
- Orientation to RMS book and RMS resources
- Philosophy of Multivariable
Modeling
- [1.1.1] Quality of estimates
- [1.3] Prediction vs. classification
- [1.6] Model uncertainty / Data-driven Model Specification
- [2.4] Relaxing Linearity Assumption for Continuous Predictors
- [2.5] Recursive Partitioning: Tree-Based Models (2.5.2 covered on
Day 4)
- [2.6] Multiple Degree of Freedom Tests of Association
- [2.7] Assessment of Model Fit
- [2.8] Complex Curve Fitting Example
- [2.9] Contrasts and Model Reparameterization
- [4.1] Prespecification of Predictor Complexity Without Later
Simplification
- [4.3] Variable Selection
- [4.4] Overfitting and Limits on Number of Predictors
- [4.5] Shrinkage
- [4.6] Collinearity
- [4.7] Data Reduction
- [4.8] Other Approaches to Predictive Modeling
- [4.9] Overly Influential Observations
- [4.10] Comparing Two Models
- [4.11] Improving the Practice of Multivariable Prediction
- [4.12] Summary: Possible Modeling Strategies
Day 2
- [5.1] Describing the Fitted Model
- [5.2] The Bootstrap
- [5.3] Model Validation
- [5.4] Bootstrapping Ranks of Predictors
- [5.5] Simplifying the Final Model by Approximating It
- [5.6] How Do We Break Bad Habits?
- [10.1] Model Assumptions and Interpretation of Parameters
- [10.2] Estimation
- [10.3] Test Statistics
- [10.5] Assessment of Model Fit
- [10.8] Quantifying Predictive Ability
- [10.9] Validating the Fitted Model
- [10.10] Describing the Fitted Model
- [12] Case Study:
Survival of Titanic Passengers
- [13.1] OLR Background
- [13.3] Proportional Odds Model
- [13.4] Assumptions and Interpretation of Parameters
- [BBR 7.6, 7.8, 7.9] Ordinal Regression Examples
Note: Order for May 2026 was reversed for Sessions
7-8
- [15.2] The Linear Model
- [15.3] Quantile Regression
- [15.4] Ordinal Regression for Continuous Y
- [15.5] Case Study
- [7.2] Model Specification for Effects on E(Y)
- [7.3] Modeling Within-Subject Dependence
- [7.4] Parameter Estimation Procedure
- [7.5] Common Correlation Structures
- [7.6] Checking Model Fit
- [7.8.1,7.8.2] Longitudinal Model Case Study
Day 3
Session
9: Causal Models for Variable Selection 9:00-10:15
- Overall Process of Evidence Generation
- The Data Generating Process
- Structural Causal Models / Directed Acyclic Graphs (DAGs)
- Toxic Adjustments in Regression Models
- Biasing Structures
- Collider Example
- How to DAG
- Limitations
- [22.1] Longitudinal Ordinal Models as Unifying Concepts
- [22.1.1] General Outcome Attributes
- [22.1.2] What is a Fundamental Outcome Assessment?
- [22.1.3] Examples of Longitudinal Ordinal Outcomes
- [22.1.4] Statistical Model for Ordinal Longitudinal Outcome
- [22.2] Case Studies for Ordinal Longitudinal Outcomes (until
11:30)
- General discussion and catch-up
Session 11: Bayesian
Modeling 1:00-2:30
- [2.10] Advantages of Bayesian Modeling
- [2.10.2] Constraining regression models using Bayesian priors
- [10.11] Binary logistic model example
- [7.8.3] Longitudinal ordinal model with random effects
- [BBR 7.11] Bayesian proportional odds model
- [BBR 7.9.1] Ordinal regression with random effects for paired
data
- See also examples using rmsb
package
Day 4
- [20.1] CPH Model Preliminaries
- [20.2] Estimation of Survival Probability and Secondary
Parameters
- [20.3] Sample Size Considerations
- [20.4] CPH Test Statistics
- [20.5] CPH Residuals
- [20.6] Assessment of CPH Model Fit
- [20.7] What to Do When PH Fails
- [20.10] Quantifying Predictive Ability
- [20.11] Validating the Fitted Model
- [20.12] Describing the Fitted Model
- [21] Case Study in Cox
Regression
- [25.1] Background and Rationale
- [25.2] Cumulative Probability Models
- [25.3] Special Cases of Ordinal Cumulative Probability Models
- [25.4] The Continuum of Links
- [25.5] Effective Sample Sizes
- [25.6] The rms Package for Survival Analysis Using Ordinal
Regression
- [25.7] Goodness of Fit Overview
- [25.8] Simple Examples With Goodness of Fit Assessments
- [25.9] Accelerated Failure Time Example
Session
15: General Likelihood Ratio Test and Profile Confidence Limits
1:00-1:45
- [2.9] Contrasts and Model Reparameterization
Session
16: Wrap-up: RMS Summary and Discussion 1:55-4:00
- Review of RMS Philosophy & Principles
- Discussion
- [2.5.2] How to think about ML vs. regression modeling
rms package vs. more general frameworks
e.g. marginaleffects
- Roles for AI
- General Q&A
Target Audience
Statisticians and persons from other quantitative disciplines who are
interested in multivariable regression analysis of univariate responses,
in developing, validating, and graphically describing multivariable
predictive models, and in covariable adjustment in clinical trials. The
course will be of particular interest to:
- Applied statisticians
- Developers of applied statistics methodology
- Graduate students
- Clinical and pre-clinical biostatisticians
- Health services and outcomes researchers
- Econometricians
- Psychometricians
- Quantitative epidemiologists
A good command of ordinary multiple regression is a prerequisite. The
one-day pre-RMS provides this
prerequisite.
Learning Outcomes
Students will:
- Be able to fit multivariable regression models accurately without
overfitting.
- Uncover complex non-linear or non-additive relationships.
- Test for and quantify associations between predictors and response,
adjusting for other factors.
- Validate models for predictive accuracy and detect overfitting.
- Learn techniques of “safe data mining.”
- Learn how to interpret fitted models using parameter estimates and
graphics.
- Understand the advantages of semiparametric ordinal models for
continuous and censored Y.
- Compare frequentist and Bayesian approaches to statistical
modeling.
- Distinguish between machine learning and statistical models and make
informed decisions about their application.
- Understand how Markov ordinal state transition (MOST) models
generalize survival analysis, recurrent events analysis, the Wilcoxon
test, and longitudinal analysis of continuous Y.
- Understand how ordinal semiparametric regression models generalize
single-event survival analysis.
- Gain an appreciation for how study design and causal inference need
to drive model formulation.
Instructional Methods
Extensive and tested handouts will be given to students. The course
will be informal enough for students to be able to ask questions
throughout the day. The style will be a mixture of lecture and
presentation of moderately comprehensive case studies. Handouts make
heavy use of graphics to facilitate learning.
The presentation and handouts show output from R
functions, but software use is not covered in detail in the course.
Students who are interested in later using free R software
to run examples presented in the case studies may do so by installing
the rms package available at www.r-project.org.
Presenters
Prof. Frank E. Harrell Jr.
Dr. Harrell is Professor of Biostatistics, Founding Chair of the
Department of Biostatistics of Vanderbilt University School of
Medicine.
He is author of the book Regression Modeling Strategies, Second
Edition (Springer, 2015) and teaches courses in biostatistical
modeling. He is a Fellow of the American Statistical Association and was
the recipient of the ASA’s WJ Dixon award for excellence in statistical
consulting in 2014. He is active on BlueSky and Twitter under
@f2harrell and leads datamethods.org for
in-depth discussion of data-related methodologies.
Drew G. Levy PhD
Dr. Levy has a PhD in Epidemiology from the Unviversity of Washington
(Seattle) and heads Good Science,
Inc.. He is moderator for the 4-day course and is instructor for the
causal inference part of the course.
Textbook
Harrell, F.E. (2015). Regression Modeling Strategies with
Applications to Linear Models, Logistic and Ordinal Regression, and
Survival Analysis, Second Edition. New York: Springer.
Handouts
Handouts are here.
Software
- R (not used “live” in the course)