Bios 7330: Regression Modeling Strategies
Frank E. Harrell, Jr.
f.harrell@vumc.org
Professor of Biostatistics
Department of Biostatistics
Vanderbilt University School of Medicine
Teaching Assistants: Chiara di Gravio and Michael Williams (contact through Zulip)
Office Hours: Send direct message on Zulip to instructor to arrange a Zoom conference
11 January - 21 April 2022, Final Project Due 2022-05-02
Grades are due by 11:59pm on 2022-05-07.
Tuesday, Thursday 3:30-5:00
8102, 8th Floor, 2525 West End
This course covers many aspects of multivariable regression modeling as it is commonly used in prognostic, diagnostic, and epidemiologic modeling, clinical trials, and prediction in general.
Course schedule
Resources
- Syllabus
- Handouts
- Study questions
- R scripts for RMS 2nd edition
- Papers to read
- 🆕 Self-reported participation
- Concepts to master
- Supplemental Material on Biostatistical Modeling including interactive R demonstrations
- Document updates
- Biostatistics in Biomedical Research course
Note: If you use Google Chrome or Chromium to view the handouts, the first time you click on a sound file the browser will download the playlist file (.m3u) for that .mp3 sound file. Click on the down arrow next to the name of the downloaded file on the bottom left of the browser window, and select “Always open files of this type”.
Text
The instructor’s book Regression Modeling Strategies, 2nd edition, 2015 is available from Amazon and other book sellers in addition to the Vanderbilt bookstore.
Motivation
Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons.
- Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
- A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
- Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
- A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
- Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.
Description
Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Popular models include logistic models for binary and ordinal responses, survival models, quantile regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.
In the course much attention is paid to dealing with missing data using multiple imputation, the use of bootstrapping, enhancements to ordinary maximum likelihood estimation and inference, and testing general or complex hypotheses using general contrasts and likelihood ratio tests. Quantifying the predictive discrimination and calibration accuracy of models are also key areas of emphasis.
Prerequisites
Students must have mastered ordinary linear regression and have had an introduction to maximum likelihood estimation. Mastery of regular algebra is assumed, and students must have been introduced to linear algebra. Good working knowledge of R is required.
Learning Objectives
To become familiar with modern methods for fitting multivariable regression models
- accurately
- in a way the sample size will allow, without overfitting
- uncovering complex non-linear or non-additive relationships
- testing for and quantifying the association between one or more predictors and the response, with possible adjustment for other factors Students will be introduced to the bootstrap and will learn how to deal with missing data, how to validate models for predictive accuracy and to detect overfitting, will be able to interpret fitted models using both parameter estimates and graphics, and will be able to critique the literature to determine when models are likely to be unreliable.
Reading Assignments
- Papers may be obtained below, along with a schedule of reading assignments
- Simulation study of logistic model validation methods
- Model uncertainty, penalization, and parsimony with examples using AIC to select penalties
Recommended Supplemental Reading
- Steyerberg EW. Clinical Prediction Models. New York: Springer; 2019, 2nd edition
Datasets
- From here, accessed by
Hmisc::getHdata()
- Students are encouraged to find their own datasets for the final project
Communication
Communications Highly Specific to Course
For class announcments, logistics, private individual and group messaging, questions about assignments or their solutions use the bios7330
channel on vandystats.zulipchat.com.
Q&A and Discussions About Concepts and Methods
Use the RMS Discussions topic on datamethods.org
to navigate to the appropriate discussion in datamethods.org
.
For very general questions about statistics use stats.stackexchange.com. This is the world’s best statistics Q&A site and is the best place to ask questions that are not particular to the course. But tag questions related to course topics as regression-strategies
.
Be sure to check existing topics for posting your message, to avoid creating any unnecessary new topics that will make it more difficult for others to navigate the discussion board.
Software
R and the rms
and Hmisc
packages plus several other R packages to be listed here as the class progresses. Students are expected to turn in their assignments in html format created using R Markdown. Examples may be found here. See also this.
Class Format
The majority of the course is flipped so that class time can be devoted to clarifying concepts, methods, and strategies, and problem solving. Students must read assigned sections in the primary textbook or accompanying course handouts, and listen to audio narration and watch videos linked from the handouts when helpful, in advance of the class for which those topics are to be discussed. Study questions are provided before class and students should attempt to answer them by themselves or in small groups. These questions form the basis for in-class discussions.
Help Sessions
The instructor has an open office hour after each Thursday’s class. Other meetings can be scheduled as needed through Zulip.
Assignments and Grading
Assignments are due by 5p on date listed. Projects must be done independently unless marked as group assignments. Work turned in must be as concise as clarity will allow. Students should pay attention to interpreting results, not just obtaining them. knitr
must be used (see above). Assignments must list those who actively participated. html files which include code should be sent to the teaching assistant via Zulip
personal message.
For the final project you will do an in-depth analysis of a dataset you are interested in which contains many predictors of various types (at least one being continuous unless you receive special instructor permission) and having a binary, continuous, ordinal, or possibly a right-censored response variable. The dataset may not be one used in the course or any of the texts. The dataset should have a sufficient number of observations and the meaning of the data should be such that development of a predictive model makes sense. The analyses you perform on the dataset should use several of the methods we learned in the course. Extra weight is given to selection of appropriate methods, when grading the project. The analysis must include at least one simulation studying the properties of one of the procedures used in developing the model.
Homework Assignments
Cumulative assignments are here. After the due date, solution sets will be distributed to solutions for approximately 2/3 of the assignments (including assignment #s 1 and 3). For other assignments, individuals or groups submitting the best solution in LaTeX/knitr/Markdown will receive extra credit and will have their solution (with attribution) added to the solution set for future students.
Assignments 2-3 and 8 are group assignments. Constitution of groups is shown at the top of the assignment. Group members are randomized separately for each group assignment.
Assignment 0 is a reproducible R report that you should run during the first week of class to make sure that you have all software properly installed.
Turn in your solutions by sending a direct message on Zulip to the instructor and TAs and attaching the html or pdf file.
Cumulative solutions to selected problems are here
Weights Used for Final Grade
- Individual projects (n=5): 3
- Group projects (n=4): 1
- Final project (n=1): 8
- Quizzes (n=6): 1/3
- Class participation : 5
All components are graded on a 0-1 scale before weighting, so group and individual projects get an effective weight of 15 vs. final project of 8.
Reading Assignments
Papers are here. See also this excellent resource on splines.
- By 2022-01-14: relaxLinear: smi79spl, gia14opt, col16qua
- By 2022-01-20: multivar: gra91eff
- By 2022-01-29: missingData: pen15mul, don06rev, hei06imp (skim), hip07reg (skim), jan10mis (skim), muchado
- By 2022-02-21: multivar: giu11spe, gre00whe, smi92pro, ril18min, ril18mina
- By 2022-03-01: datasetsCaseStudies: nic99reg spa89dif
- By 2022-03-03: accuracy: alr07cas Figure 2, hua20tut (skim), kor91exp, lin89ass (skim), nag91not, hou90pre (skim)
- By 2022-03-05: validation: arc20min (skim), aus14gra, aus19int (skim), bun84boo (skim), efr14est (abstract), mil91val (skim), Molinaro (abstract), nom20con (abstract), Peek (abstract), plo14mod (abstract), ste01int (abstract)
- By 2022-03-05: mle: bus82lik, jen86jud (abstract and intro), zha21reg (abstract)
- modelUncertainty: bor15vie
- By 2022-04-12: Survival analysis: validation/aus20gra, cro16ass
Document Updates
Document | Last Revision | What |
---|---|---|
🆕 Syllabus | 2022-04-17 | |
Handouts | 2022-04-01 | rmsb basic logistic regression analysis |
R Scripts in Book | ||
🆕 Assignments | 2022-04-17 | Assignment 10 |
🆕 Solutions | 2022-04-21 | Assignment 9 |
Solution knitr source | ||
Study Questions | 2022-01-12 | |
🆕 Readings | 2022-04-09 | |
Schedule | 2022-01-12 | |
Final due date |
Bibliographic Databases
Useful Material From Courses at Other Universities
- Cosma Shalizi’s Undergraduate Advanced Data Analysis course
Other Links
- https:hbiostat.org
- TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): Explanation and Elaboration