---
pagetitle: Biostat2
---

<!-- To create html: mdm2htmlctoc index -->

<style>
    #TOC {
    position: fixed;
    width: 15em;
    left: 0;
    top: 0;
    height: 100%;
    font-size: 85%;
    background-color: cornsilk;
    overflow-y: scroll;
    padding: 1em;
}
body {
    padding-left: 16em;
}
</style>

# MSCI 5015: Biostatistics II<br><small>February 2025</small>

---

Key Persons | Name | Contact | Zulip ID
----|----|------|------
Instructor  | [Frank Harrell](https://hbiostat.org/fh) | `f.harrell@vumc.org` | `@Frank Harrell`
Teaching Assistant | [Heather Prigmore](https://www.vumc.org/biostatistics/person/heather-l-prigmore) | `heather.prigmore@vumc.org` | `@Heather`

---

## Important Items 

-   [Detailed Course, Reading, and Assignment Schedule](schedule.html)
-   [About Grades & Assignments](assign.html) and [Regular Homework Assignments](hw.html)
-   [Whitlock & Schluter Datasets (ABD Datasets or Analysis of
    Biological Data Datasets)](datasets.html)
-   [Department collection of Datasets](https://hbiostat.org/data)
-   [Zulip discussion board](https://vandystats.zulipchat.com/) for class
-   Study Questions 
    -   BBR: at end of chapters of handouts
    -   [RMS](http://hbiostat.org/doc/rms/qstudy.html) (to be moved to
        end of handout chapters)

## Course Handouts 

-   [Syllabus](syllabus.html)
-   [RMS Handouts (AKA Lecture Notes)](https://hbiostat.org/rmsc)
-   [BBR Handouts](https://hbiostat.org/bbr)
-   [Clinical Prediction Model Development and
    Validation](https://fharrell.com/post/modplan) statistical analysis plan template
-   [Key Course Concepts](concepts.html)
-   [Glossary](http://hbiostat.org/glossary)
-   Introduction to Prediction Modelling [Part I](https://www.slideshare.net/MaartenvanSmeden/introduction-to-prediction-modelling-berlin-2018-part-i) and [Part II](https://www.slideshare.net/MaartenvanSmeden/introduction-to-prediction-modelling-berlin-2018-part-ii) by Maarten van Smeden

## Course Format 

The Biostatistics II course is designed for students to do concentrated,
intensive study before each class so that class time can be devoted to
clarification, reviewing key concepts, answering student questions, and
especially to problem solving. This design allows students to do the
vast majority of "homework" assignments during class.

**Pre-class**: Intensive study of statistical methods and ideas 

-   Read assigned sections of books and/or course notes, listening to
    audio narrative and watching short movies demonstrating statistical
    methods that are linked from the notes
-   Read assigned supplemental articles

**In-class**: 

-   Review key elements of the assigned material
-   Ample time for students\' questions about the material and the
    concepts
-   Interactive demonstrations of the methods using datasets from ABD
-   In-class assignments using Stata

**Post-class**: 

-   Write interpretations of selected analyses done during class
-   Take self-quizzes to gauge understanding of key concepts

## Texts 

-   (Required) [The Analysis of Biological
    Data](https://www.amazon.com/Analysis-Biological-Data-Michael-Whitlock/dp/131922623X),
    3rd Edition by MC Whitlock and Dolph Schluter \| [Supplemental
    Material, Data, and R Code from
    ABD](http://whitlockschluter.zoology.ubc.ca/)
-   (Required) Harrell FE: *Regression Modeling Strategies*, 2nd
    edition, 2015 (available at the VU bookstore at 2525 West End Ave.
    and at Amazon)

## Class Announcements & Discussion Board 

-   Class announcements and homework assignments will appear on
    the [course Zulip stream](https://vandystats.zulipchat.com/). It is
    the way to keep in touch with the class and even more to ask and
    answer questions. We hope that all students will use it to: 
    -   ask or answer any question whatsoever related to group
        assignments
    -   ask or answer any logistical or purely technical questions
        related to individual work assignments
    -   ask or answer any questions about modeling or statistical
        computing concepts that are not directly related to a pending
        individual work assignment
    -   Use the Zulip stream for statistical or study design questions
        related to what\'s in those notes
-   Please also take advantage of the general regression modeling
    strategies discussion
    board: [stats.stackexchange](http://stats.stackexchange.com/questions/tagged/regression-strategies)
-   Use [datamethods.org](http://datamethods.org/) for questions and
    discussion about study design, measurement, clinical trials,
    epidemiology, machine learning, and medical applications of
    statistics

## High-Level Overview 

Multivariable regression models the fundamental tools used for
prediction, effect estimation, and hypothesis testing. This course
covers the most commonly used regression models plus general methods
applicable to all regression models. There is an emphasis on aspects
related to clinical and translational study design.

## Motivation 

Accurate estimation of patient prognosis or of the probability of a
disease or other outcomes is important for many reasons. 

1.  Prognostic estimates can be used to inform the patient about likely
    outcomes of her disease.
2.  A physician can use estimates of diagnosis or prognosis as a guide
    for ordering additional tests and selecting appropriate therapies.
3.  Outcome assessments are useful in the evaluation of technologies;
    for example, diagnostic estimates derived both with and without
    using the results of a given test can be compared to measure the
    incremental diagnostic information provided by that test over what
    is provided by prior information.
4.  A researcher may want to estimate the effect of a single factor
    (e.g., treatment given) on outcomes in an observational study in
    which many uncontrolled confounding factors are also measured. Here
    the simultaneous effects of the uncontrolled variables must be
    controlled (held constant mathematically if using a regression
    model) so that the effect of the factor of interest can be more
    purely estimated. An analysis of how variables (especially
    continuous ones) affect the patient outcomes of interest is
    necessary to ascertain how to control their effects.
5.  Predictive modeling is useful in designing randomized clinical
    trials. Both the decision concerning which patients to randomize and
    the design of the randomization process (e.g., stratified
    randomization using prognostic factors) are aided by the
    availability of accurate prognostic estimates before randomization.
    It is also important to adjust for prognostic factors in randomized
    studies to achieve optimum power and precision. Lastly, accurate
    prognostic models can be used to test for differential therapeutic
    benefit or to estimate the clinical benefit for an individual
    patient in a clinical trial, taking into account the fact that
    low-risk patients must have less absolute benefit (e.g., lower
    change in survival probability). To accomplish these objectives,
    researchers must create multivariable models that accurately reflect
    the patterns existing in the underlying data and that are valid when
    applied to comparable data in other settings or institutions. Models
    may be inaccurate due to violation of assumptions, omission of
    important predictors, high frequency of missing data and/or improper
    imputation methods, and especially with small datasets, overfitting.

## Description 

Many types of regression models are increasingly being used in
developing clinical prediction models for diagnosis, prognosis, and
other applications in epidemiology, health services research, health
economics, clinical trials, business, finance, and prediction in
general. Regression models are introduced, and first the basics of
multivariable regression models are discussed, starting with the
ordinary multiple linear regression model (ordinary least squares).
Early topics include interpretation of regression coefficients, coding
of categorical predictors, meaning of linearity assumptions, estimating
the relationships between two variables nonparametrically, and coding
and interpretation of interaction terms. Popular models include logistic
models for binary and ordinal responses, survival models, ordinal
regression, and models for longitudinal data analysis, many of which are
covered in this course. All regression models have assumptions that must
be verified for them to have power to test hypotheses and to be able to
predict accurately. Of the principal assumptions (linearity, additivity,
distributional), this course will emphasize methods for assessing and
satisfying the first two as these methods apply to all regression
models. To deal with the linearity assumption, this course provides
methods for estimating the shape of the relationship between predictors
and response using the widely applicable method of piecewise
polynomials. Emphasis will be given to interpreting fitted models using
effect plots (e.g., continuous partial effect plots and odds ratio
charts) and nomograms. Even when assumptions are satisfied, overfitting
can ruin a model's predictive ability for future observations. Methods
for data reduction will be introduced to deal with the common case where
the number of potential predictors is large in comparison with the
number of observations. Methods of model validation (bootstrap and
cross-validation) will be introduced, as will auxiliary topics such as
modeling interaction surfaces, dealing with missing data, variable
selection, collinearity, and shrinkage. All methods covered will apply
to almost any regression model. The course will include detailed case
studies in developing, validating, and interpreting clinical prediction
and epidemiologic models.

## Additional Material for the Curious Student 

-   Steyerberg EW. *Clinical Prediction Models*, 2nd ed. New York: Springer; 2019.
-   Cosma Shalizi\'s [Undergraduate Advanced Data
    Analysis](http://www.stat.cmu.edu/~cshalizi/uADA/13) course
-   [TRIPOD](http://annals.org/article.aspx?articleid=2088542):
    Transparent Reporting of a multivariable prediction model for
    Individual Prognosis Or Diagnosis (TRIPOD):