Regression Modeling Strategies

Author
Affiliation

Department of Biostatistics
School of Medicine
Vanderbilt University

Published

March 3, 2024

flowchart LR
rms[Multivariable Model Development] --> est[Estimation] --> pred[Prediction] --> val[Validation]

Preface

All standard regression models have assumptions that must be verified for the model to have power to test hypotheses and for it to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two. Practical but powerful tools are presented for validating model assumptions and presenting model results. This course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of augmenting the design matrix using restricted cubic splines. Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be covered, as will auxiliary topics such as modeling interaction surfaces, efficiently utilizing partial covariable data by using multiple imputation, variable selection, overly influential observations, collinearity, and shrinkage, and a brief introduction to the R rms package for handling these problems. The methods covered will apply to almost any regression model, including ordinary least squares, longitudinal models, logistic regression models, ordinal regression, quantile regression, longitudinal data analysis, and survival models. Statistical models will be contrasted with machine learning so that the student can make an informed choice of predictive tools.

Target Audience

Statisticians and persons from other quantitative disciplines who are interested in multivariable regression analysis of univariate responses, in developing, validating, and graphically describing multivariable predictive models and in covariable adjustment in clinical trials. The course will be of particular interest to applied statisticians and developers of applied statistics methodology, graduate students, clinical and pre-clinical biostatisticians, health services and outcomes researchers, econometricians, psychometricians, and quantitative epidemiologists. A good command of ordinary multiple regression is a prerequisite.

Learning Goals

Students will

  • be able to fit multivariable regression models:
    • accurately
    • in a way the sample size will allow, without overfitting
    • uncovering complex non–linear or non–additive relationships
    • testing for and quantifying the association between one or more predictors and the response, with possible adjustment for other factors
    • making maximum use of partial data rather than deleting observations containing missing variables
  • be able to validate models for predictive accuracy and to detect overfitting and understand problems caused by overfitting.
  • learn techniques of “safe data mining” in which significance levels, confidence limits, and measures such as \(R^2\) have the claimed properties.
  • learn how to interpret fitted models using both parameter estimates and graphics
  • learn about the advantages of semiparametric ordinal models for continuous \(Y\)
  • learn about some of the differences between frequentist and Bayesian approaches to statistical modeling
  • learn differences between machine learning and statistical models, and how to determine the better approach depending on the nature of the problem

Course Philosophy

The audio narration starts with the third bullet point listed here.
  • Modeling is the endeavor to transform data into information and information into either prediction or evidence about the data generating mechanism1
  • Models are usually the best descriptive statistics
    • adjust for one variable while displaying the association with \(Y\) and another variable
    • descriptive statistics do not work in higher dimensions
  • Satisfaction of model assumptions improves precision and increases statistical power
    • Be aware of assumptions, especially those mattering the most
  • It is more productive to make a model fit step by step (e.g., transformation estimation) than to postulate a simple model and find out what went wrong
    • Model diagnostics are often not actionable
    • Changing the model in reaction to observed patterns \(\uparrow\) uncertainty but is reflected by an apparent \(\downarrow\) in uncertainty
  • Graphical methods should be married to formal inference
  • Overfitting occurs frequently, so data reduction and model validation are important
  • Software without multiple facilities for assessing and fixing model fit may only seem to be user-friendly
  • Carefully fitting an improper model is better than badly fitting (and overfitting) a well-chosen one
    • E.g. small \(N\) and overfitting vs. carefully formulated right hand side of model
  • Methods which work for all types of regression models are the most valuable.
  • In most research projects the cost of data collection far outweighs the cost of data analysis, so it is important to use the most efficient and accurate modeling techniques, to avoid categorizing continuous variables, and to not remove data from the estimation sample just to be able to validate the model.
    • A $100 analysis can make a $1,000,000 study worthless.
  • The bootstrap is a breakthrough for statistical modeling and model validation.
  • Bayesian modeling is ready for prime time.
    • Can incorporate non-data knowledge
    • Provides full exact inferential tools even when penalizing \(\beta\)
    • Rational way to account for model uncertainty
    • Direct inference: evidence for all possible values of \(\beta\)
    • More accurate way of dealing with missing data
  • Using the data to guide the data analysis is almost as dangerous as not doing so.
  • A good overall strategy is to decide how many degrees of freedom (i.e., number of regression parameters) can be "spent", where they should be spent, to spend them with no regrets. See the excellent text Clinical Prediction Models (Steyerberg, 2009)

1 Thanks to Drew Levy for ideas that greatly improved this section.

For information about adding annotations, comments, and questions inside the text click here: Comments

Symbols Used in the Right Margin of the Text

  • in the right margin is a hyperlink to a YouTube video related to the subject.
  • is a hyperlink to the discussion topic in datamethods.org devoted to the specific topic. You can go directly to the discussion about chapter n by going to datamethods.org/rmsn.
  • An audio player symbol indicates that narration elaborating on the notes is available for the section. Red letters and numbers in the right margin are cues referred to within the audio recordings.
  • blog in the right margin is a link to a blog entry that further discusses the topic.

Other Information

R Packages

To be able to run all the examples in the book, install current versions of the following CRAN packages:

Hmisc, rms, data.table, nlme, rmsb, ggplot2
kableExtra, pcaPP, VGAM, MASS, leaps, rpart

License

Creative Commons License
Regression Modeling Strategies Course is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at https://hbiostat.org/rmsc.

Date Sections Changes
2024-03-03 5.3.7 Validation of Data Reduction New section of validatiion of data reduction
2024-02-18 Contrasts New section on contrasts
2023-09-17 5.1.3 Relative Explained Variation New subsection on relative explained variation
2023-08-02 3.7.1 Predictive Mean Matching With Constraints New subsection on constraints for imputed values
2023-08-01 24  Bacteremia: Case Study in Nonlinear Data Reduction with Imputation New chapter for bacteremia case study
2023-07-28 1.1 Hypothesis Testing, Estimation, and Prediction Added paired tests
2023-07-21 23  Body Fat: Case Study in Linear Modeling New chapter: linear model case study
2023-07-14 9.7 AIC & BIC New material and links for AIC/BIC
2023-05-30 4.12 Summary: Possible Modeling Strategies Added consideration of confounding
2023-05-24 4.4 Overfitting and Limits on Number of Predictors Better effective sample size for binary \(Y\)
2023-05-20 11.4 Regression on Original Variables, Principal Components and Pretransformations, 8.6 Data Reduction Using Principal Components Added graphical display of PC loadings
2023-04-30 10  Binary Logistic Regression Many improvements in graphics, and code using data.table
2023-04-29 2.8 Complex Curve Fitting Example Add likelihood ratio tests
2023-04-22 9.4 The Hauck-Donner Effect New section on Hauck-Donner effect ruining Wald statistics
2023-04-22 10  Binary Logistic Regression, 12.3 Binary Logistic Model with Casewise Deletion of Missing Values Added new anova(..., test='LR')
2023-03-06 15  Regression Models for Continuous Y and Case Study in Ordinal Regression Several changes; replaced lattice graphics with ggplot2 and added validation with simultaneous multiple imputation
2023-03-01 5.3.6 Multiple Imputation and Resampling-Based Model Validation New section on simultaneous validation and imputation
2023-02-20 8  Case Study in Data Reduction, 11  Binary Logistic Regression Case Study 1 Used new Hmisc 5.0-0 function princmp for principal components
2023-02-12 Used rms 6.5-0 to improve code, removing results=‘asis’ from chunk headers
2023-02-07 Started moving study questions to end of chapters
2022-10-28 1  Introduction 3 new flowcharts
2022-10-28 1.6.1 Model Uncertainty and Model Checking New subsection of model uncertainty and GOF
2022-09-16 9.5 Confidence Intervals Link to nice profile likelihood CI example

Review Questions

  1. Consider inference from comparing central tendencies of two groups on a continuous response variable Y. What assumptions are you willing to make when selecting a statistical test? Why are you willing to make those assumptions?
  2. Consider the comparison of 5 groups on a continuous Y. Suppose you observe that two of the groups have a similar mean and the other three also have a similar sample mean. What is wrong with combining the two samples and combining the three samples, then comparing two means? How does this compare to stepwise variable selection?
  3. Name a specific statistical test for which we don’t have a corresponding statistical model
  4. Concerning a multi-group problem or a sequential testing problem what is the frequentist approach to multiplicity correction? The Bayesian approach?