MSCI 5015: Biostatistics II
February 2025


Key Persons Name Contact Zulip ID
Instructor Frank Harrell f.harrell@vumc.org @Frank Harrell
Teaching Assistant Heather Prigmore heather.prigmore@vumc.org @Heather

Important Items 

Course Handouts 

Course Format 

The Biostatistics II course is designed for students to do concentrated, intensive study before each class so that class time can be devoted to clarification, reviewing key concepts, answering student questions, and especially to problem solving. This design allows students to do the vast majority of “homework” assignments during class.

Pre-class: Intensive study of statistical methods and ideas 

In-class

Post-class

Texts 

Class Announcements & Discussion Board 

High-Level Overview 

Multivariable regression models the fundamental tools used for prediction, effect estimation, and hypothesis testing. This course covers the most commonly used regression models plus general methods applicable to all regression models. There is an emphasis on aspects related to clinical and translational study design.

Motivation 

Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons. 

  1. Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
  2. A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
  3. Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
  4. A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
  5. Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. It is also important to adjust for prognostic factors in randomized studies to achieve optimum power and precision. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.

Description 

Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Regression models are introduced, and first the basics of multivariable regression models are discussed, starting with the ordinary multiple linear regression model (ordinary least squares). Early topics include interpretation of regression coefficients, coding of categorical predictors, meaning of linearity assumptions, estimating the relationships between two variables nonparametrically, and coding and interpretation of interaction terms. Popular models include logistic models for binary and ordinal responses, survival models, ordinal regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., continuous partial effect plots and odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.

Additional Material for the Curious Student