Regression Modeling Strategies
All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. For the last, emphasis is placed on semiparametric ordinal regression models that do not assume a distribution.
Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines.
Even when assumptions are satisfied, overfitting can ruin a model’s
predictive ability for future observations. Methods for data reduction
will be introduced to deal with the common case where the number of
potential predictors is large in comparison with the number of
observations. Methods of model validation (bootstrap and
cross-validation) will be covered, as will auxiliary topics such as
modeling interaction surfaces, efficiently utilizing partial covariable
data by using multiple imputation, variable selection, overly
influential observations, collinearity, predictive accuracy, variable
importance, and shrinkage. A brief introduction to the rms
package in R
for handling these problems will also be
covered.
The methods covered will apply to almost any regression model, including:
Statistical models will be contrasted with machine learning so that the student can make an informed choice of predictive tools.
The 4-day course also has a session introducing causal inference with special attention to how causal inference should inform model specification.
Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models, and in covariable adjustment in clinical trials. The course will be of particular interest to:
A good command of ordinary multiple regression is a prerequisite.
Students will:
Extensive and tested handouts will be given to students. The course will be informal enough for students to be able to ask questions throughout the day. The style will be a mixture of lecture and presentation of moderately comprehensive case studies. Handouts make heavy use of graphics to facilitate learning.
The presentation and handouts show output from R
functions, but software use is not covered in detail in the course.
Students who are interested in later using free R
software
to run examples presented in the case studies may do so by installing
the rms
package available at www.r-project.org.
Prof. Frank E. Harrell Jr.
Dr. Harrell is Professor of Biostatistics, Founding Chair of the Department of Biostatistics of Vanderbilt University School of Medicine, and is Expert Biostatistics Advisor, Center for Drug Evaluation and Research, US FDA.
He is author of the book Regression Modeling Strategies, Second
Edition (Springer, 2015) and teaches courses in biostatistical
modeling. He is a Fellow of the American Statistical Association and was
the recipient of the ASA’s WJ Dixon award for excellence in statistical
consulting in 2014. He is active on Twitter under
@f2harrell
and leads datamethods.org
for
in-depth discussion of data-related methodologies.
Drew G. Levy PhD
Dr. Levy has a PhD in Epidemiology from the Unviversity of Washington (Seattle) and heads Good Science, Inc.. He is moderator for the 4-day course and is instructor for the causal inference part of the course.
Harrell, F.E. (2015). Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Second Edition. New York: Springer.
Handouts are here.