Introduction

Course Overview

Regression Modeling Strategies

All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. For the last, emphasis is placed on semiparametric ordinal regression models that do not assume a distribution.

Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines.

Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, efficiently utilizing partial covariable data by using multiple imputation, variable selection, overly influential observations, collinearity, predictive accuracy, variable importance, and shrinkage. A brief introduction to the rms package in R for handling these problems will also be covered.

The methods covered will apply to almost any regression model, including:

Ordinary least squares
Longitudinal models
Logistic regression models
Ordinal regression for discrete and continuous Y
Quantile regression
Longitudinal data analysis
Survival analysis

Statistical models will be contrasted with machine learning so that the student can make an informed choice of predictive tools.

The 4-day course also has a session introducing causal inference with special attention to how causal inference should inform model specification.

Course Outline

Introduction; Advantages of prediction over classification
Hypothesis Testing vs. Estimation vs. Prediction vs. Classification
How Many Degrees of Freedom does a Data Mining Procedure Actually Have?
Advantages of regression models and contrasts with machine learning
Regression Model Notation
Model Formulations
Interpreting Model Parameters
- Nominal Predictors
- Interactions
Relaxing Linearity Assumption for Continuous Predictors
- Categorization is not an alternative
- Simple Nonlinear Terms
- Splines for Estimating Shape of Regression Function and Determining Predictor Transformations
- Cubic Spline Functions
- Restricted Cubic Splines
- Choosing Number and Position of Knots
- Nonparametric smoothers and regression trees
- Advantages of Splines over Other Methods
New Directions in Predictive Modeling
How to Make the Choice of Statistical Models vs. Machine Learning
Multiple Degree of Freedom Tests of Association
Assessment of Model Fit

Regression Assumptions
Modeling and Testing Interactions

Missing Data

Types of Missing Data
Problems Caused by Simple Solutions
Multiple Imputation

Multivariable Modeling Strategy

Why and How To Pre-specify Model Complexity
Problems Caused by Ordinary Stepwise Variable Selection
Collinearity
Shrinkage
Data Reduction
Overly Influential Observations
Modeling Strategies for Prediction, Estimation, Hypothesis Testing

Overview of the Bootstrap

Using the bootstrap to check feature selection stability and to get confidence intervals for variable importance

Model Validation

Cross-validation
Bootstrap

Graphical Methods for Interpreting Complex Regression Fits
Detailed Case Studies

Generalized Least Squares and Bayesian Semiparametric Proportional Odds Models for Longitudinal Data
Ordinal Regression for Continuous Y: Predicting glycohemoglobin (and pre-diabetes) from body size characteristics using NHANES data
Binary Logistic Regression: Survival Patterns of Passengers on the Titanic
Survival Modeling: Using semiparametric ordinal models with allowance for left, right, and interval censoring

Causal Inference

Included in the 4-day course
Emphasis is on how causal inferential thinking should inform statistical model specification

Target Audience

Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models, and in covariable adjustment in clinical trials. The course will be of particular interest to:

Applied statisticians
Developers of applied statistics methodology
Graduate students
Clinical and pre-clinical biostatisticians
Health services and outcomes researchers
Econometricians
Psychometricians
Quantitative epidemiologists

A good command of ordinary multiple regression is a prerequisite.

Learning Outcomes

Students will:

Be able to fit multivariable regression models accurately without overfitting.
Uncover complex non-linear or non-additive relationships.
Test for and quantify associations between predictors and response, adjusting for other factors.
Make maximum use of partial data rather than deleting observations containing missing values.
Validate models for predictive accuracy and detect overfitting.
Learn techniques of “safe data mining.”
Learn how to interpret fitted models using parameter estimates and graphics.
Understand the advantages of semiparametric ordinal models for continuous and censored Y.
Compare frequentist and Bayesian approaches to statistical modeling.
Distinguish between machine learning and statistical models and make informed decisions about their application.
Gain an appreciation for how study design and causal inference need to drive model formulation.

Instructional Methods

Extensive and tested handouts will be given to students. The course will be informal enough for students to be able to ask questions throughout the day. The style will be a mixture of lecture and presentation of moderately comprehensive case studies. Handouts make heavy use of graphics to facilitate learning.

The presentation and handouts show output from R functions, but software use is not covered in detail in the course. Students who are interested in later using free R software to run examples presented in the case studies may do so by installing the rms package available at www.r-project.org.

Presenters

Prof. Frank E. Harrell Jr.

Dr. Harrell is Professor of Biostatistics, Founding Chair of the Department of Biostatistics of Vanderbilt University School of Medicine, and is Expert Biostatistics Advisor, Center for Drug Evaluation and Research, US FDA.

He is author of the book Regression Modeling Strategies, Second Edition (Springer, 2015) and teaches courses in biostatistical modeling. He is a Fellow of the American Statistical Association and was the recipient of the ASA’s WJ Dixon award for excellence in statistical consulting in 2014. He is active on Twitter under @f2harrell and leads datamethods.org for in-depth discussion of data-related methodologies.

Drew G. Levy PhD

Dr. Levy has a PhD in Epidemiology from the Unviversity of Washington (Seattle) and heads Good Science, Inc.. He is moderator for the 4-day course and is instructor for the causal inference part of the course.

Textbook

Harrell, F.E. (2015). Regression Modeling Strategies with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis, Second Edition. New York: Springer.

Handouts

Handouts are here.

Software

R (not used “live” in the course)