MSCI 5015: Biostatistics II
February 2025

Key Persons	Name	Contact	Zulip ID
Instructor	Frank Harrell	`f.harrell@vumc.org`	`@Frank Harrell`
Teaching Assistant	Heather Prigmore	`heather.prigmore@vumc.org`	`@Heather`

Important Items

Detailed Course, Reading, and Assignment Schedule
About Grades & Assignments and Regular Homework Assignments
Whitlock & Schluter Datasets (ABD Datasets or Analysis of Biological Data Datasets)
Department collection of Datasets
Zulip discussion board for class
Study Questions
- BBR: at end of chapters of handouts
- RMS (to be moved to end of handout chapters)

Course Handouts

Syllabus
RMS Handouts (AKA Lecture Notes)
BBR Handouts
Clinical Prediction Model Development and Validation statistical analysis plan template
Key Course Concepts
Glossary
Introduction to Prediction Modelling Part I and Part II by Maarten van Smeden

Course Format

The Biostatistics II course is designed for students to do concentrated, intensive study before each class so that class time can be devoted to clarification, reviewing key concepts, answering student questions, and especially to problem solving. This design allows students to do the vast majority of “homework” assignments during class.

Pre-class: Intensive study of statistical methods and ideas

Read assigned sections of books and/or course notes, listening to audio narrative and watching short movies demonstrating statistical methods that are linked from the notes
Read assigned supplemental articles

In-class:

Review key elements of the assigned material
Ample time for students' questions about the material and the concepts
Interactive demonstrations of the methods using datasets from ABD
In-class assignments using R

Post-class:

Write interpretations of selected analyses done during class
Take self-quizzes to gauge understanding of key concepts

Texts

(Required) The Analysis of Biological Data, 3rd Edition by MC Whitlock and Dolph Schluter | Supplemental Material, Data, and R Code from ABD
(Required) Harrell FE: Regression Modeling Strategies, 2nd edition, 2015 (available at the VU bookstore at 2525 West End Ave. and at Amazon)

Class Announcements & Discussion Board

Class announcements and homework assignments will appear on the course Zulip stream. It is the way to keep in touch with the class and even more to ask and answer questions. We hope that all students will use it to:
- ask or answer any question whatsoever related to group assignments
- ask or answer any logistical or purely technical questions related to individual work assignments
- ask or answer any questions about modeling or statistical computing concepts that are not directly related to a pending individual work assignment
- Use the Zulip stream for statistical or study design questions related to what's in those notes
Please also take advantage of the general regression modeling strategies discussion board: stats.stackexchange
Use datamethods.org for questions and discussion about study design, measurement, clinical trials, epidemiology, machine learning, and medical applications of statistics

High-Level Overview

Multivariable regression models the fundamental tools used for prediction, effect estimation, and hypothesis testing. This course covers the most commonly used regression models plus general methods applicable to all regression models. There is an emphasis on aspects related to clinical and translational study design.

Motivation

Accurate estimation of patient prognosis or of the probability of a disease or other outcomes is important for many reasons.

Prognostic estimates can be used to inform the patient about likely outcomes of her disease.
A physician can use estimates of diagnosis or prognosis as a guide for ordering additional tests and selecting appropriate therapies.
Outcome assessments are useful in the evaluation of technologies; for example, diagnostic estimates derived both with and without using the results of a given test can be compared to measure the incremental diagnostic information provided by that test over what is provided by prior information.
A researcher may want to estimate the effect of a single factor (e.g., treatment given) on outcomes in an observational study in which many uncontrolled confounding factors are also measured. Here the simultaneous effects of the uncontrolled variables must be controlled (held constant mathematically if using a regression model) so that the effect of the factor of interest can be more purely estimated. An analysis of how variables (especially continuous ones) affect the patient outcomes of interest is necessary to ascertain how to control their effects.
Predictive modeling is useful in designing randomized clinical trials. Both the decision concerning which patients to randomize and the design of the randomization process (e.g., stratified randomization using prognostic factors) are aided by the availability of accurate prognostic estimates before randomization. It is also important to adjust for prognostic factors in randomized studies to achieve optimum power and precision. Lastly, accurate prognostic models can be used to test for differential therapeutic benefit or to estimate the clinical benefit for an individual patient in a clinical trial, taking into account the fact that low-risk patients must have less absolute benefit (e.g., lower change in survival probability). To accomplish these objectives, researchers must create multivariable models that accurately reflect the patterns existing in the underlying data and that are valid when applied to comparable data in other settings or institutions. Models may be inaccurate due to violation of assumptions, omission of important predictors, high frequency of missing data and/or improper imputation methods, and especially with small datasets, overfitting.

Description

Many types of regression models are increasingly being used in developing clinical prediction models for diagnosis, prognosis, and other applications in epidemiology, health services research, health economics, clinical trials, business, finance, and prediction in general. Regression models are introduced, and first the basics of multivariable regression models are discussed, starting with the ordinary multiple linear regression model (ordinary least squares). Early topics include interpretation of regression coefficients, coding of categorical predictors, meaning of linearity assumptions, estimating the relationships between two variables nonparametrically, and coding and interpretation of interaction terms. Popular models include logistic models for binary and ordinal responses, survival models, ordinal regression, and models for longitudinal data analysis, many of which are covered in this course. All regression models have assumptions that must be verified for them to have power to test hypotheses and to be able to predict accurately. Of the principal assumptions (linearity, additivity, distributional), this course will emphasize methods for assessing and satisfying the first two as these methods apply to all regression models. To deal with the linearity assumption, this course provides methods for estimating the shape of the relationship between predictors and response using the widely applicable method of piecewise polynomials. Emphasis will be given to interpreting fitted models using effect plots (e.g., continuous partial effect plots and odds ratio charts) and nomograms. Even when assumptions are satisfied, overfitting can ruin a model’s predictive ability for future observations. Methods for data reduction will be introduced to deal with the common case where the number of potential predictors is large in comparison with the number of observations. Methods of model validation (bootstrap and cross-validation) will be introduced, as will auxiliary topics such as modeling interaction surfaces, dealing with missing data, variable selection, collinearity, and shrinkage. All methods covered will apply to almost any regression model. The course will include detailed case studies in developing, validating, and interpreting clinical prediction and epidemiologic models.

Additional Material for the Curious Student

Steyerberg EW. Clinical Prediction Models, 2nd ed. New York: Springer; 2019.
Cosma Shalizi's Undergraduate Advanced Data Analysis course
TRIPOD: Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD):

MSCI 5015: Biostatistics IIFebruary 2025