Introduction

Course Overview

Regression Modeling Strategies

All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. For the last, emphasis is placed on semiparametric ordinal regression models that do not assume a distribution.

Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines.

Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, efficiently utilizing partial covariable data by using multiple imputation, variable selection, overly influential observations, collinearity, predictive accuracy, variable importance, and shrinkage. A brief introduction to the rms package in R for handling these problems will also be covered.

The methods covered will apply to almost any regression model, including:

Statistical models will be contrasted with machine learning so that the student can make an informed choice of predictive tools.

The 4-day course also has a session introducing causal inference with special attention to how causal inference should inform model specification.

Course Outline

  1. Introduction; Advantages of prediction over classification
  2. Hypothesis Testing vs. Estimation vs. Prediction vs. Classification
  3. How Many Degrees of Freedom does a Data Mining Procedure Actually Have?
  4. Advantages of regression models and contrasts with machine learning
  5. Regression Model Notation
  6. Model Formulations
  7. Interpreting Model Parameters
  8. Relaxing Linearity Assumption for Continuous Predictors
  9. New Directions in Predictive Modeling
  10. How to Make the Choice of Statistical Models vs. Machine Learning
  11. Multiple Degree of Freedom Tests of Association
  12. Assessment of Model Fit
  1. Missing Data
  1. Multivariable Modeling Strategy
  1. Overview of the Bootstrap
  1. Model Validation
  1. Graphical Methods for Interpreting Complex Regression Fits
  2. Detailed Case Studies
  1. Causal Inference

Target Audience

Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models, and in covariable adjustment in clinical trials. The course will be of particular interest to:

A good command of ordinary multiple regression is a prerequisite.

Learning Outcomes

Students will:

Instructional Methods

Extensive and tested handouts will be given to students. The course will be informal enough for students to be able to ask questions throughout the day. The style will be a mixture of lecture and presentation of moderately comprehensive case studies. Handouts make heavy use of graphics to facilitate learning.

The presentation and handouts show output from R functions, but software use is not covered in detail in the course. Students who are interested in later using free R software to run examples presented in the case studies may do so by installing the rms package available at www.r-project.org.

Presenters

Prof. Frank E. Harrell Jr.

Dr. Harrell is Professor of Biostatistics, Founding Chair of the Department of Biostatistics of Vanderbilt University School of Medicine, and is Expert Biostatistics Advisor, Center for Drug Evaluation and Research, US FDA.

He is author of the book Regression Modeling Strategies, Second Edition (Springer, 2015) and teaches courses in biostatistical modeling. He is a Fellow of the American Statistical Association and was the recipient of the ASA’s WJ Dixon award for excellence in statistical consulting in 2014. He is active on Twitter under @f2harrell and leads datamethods.org for in-depth discussion of data-related methodologies.

Drew G. Levy PhD

Dr. Levy has a PhD in Epidemiology from the Unviversity of Washington (Seattle) and heads Good Science, Inc.. He is moderator for the 4-day course and is instructor for the causal inference part of the course.

Textbook

Harrell, F.E. (2015). Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Second Edition. New York: Springer.

Handouts

Handouts are here.

Software