Introduction to Prediction Modelling Part
I and Part
II by Maarten van Smeden
Course Format
The Biostatistics II course is designed for students to do
concentrated, intensive study before each class so that class time can
be devoted to clarification, reviewing key concepts, answering student
questions, and especially to problem solving. This design allows
students to do the vast majority of “homework” assignments during
class.
Pre-class: Intensive study of statistical methods
and ideas
Read assigned sections of books and/or course notes, listening to
audio narrative and watching short movies demonstrating statistical
methods that are linked from the notes
Read assigned supplemental articles
In-class:
Review key elements of the assigned material
Ample time for students' questions about the material and the
concepts
Interactive demonstrations of the methods using datasets from
ABD
In-class assignments using Stata
Post-class:
Write interpretations of selected analyses done during class
Take self-quizzes to gauge understanding of key concepts
(Required) Harrell FE: Regression Modeling Strategies, 2nd
edition, 2015 (available at the VU bookstore at 2525 West End Ave. and
at Amazon)
Class Announcements &
Discussion Board
Class announcements and homework assignments will appear on the course Zulip stream. It is
the way to keep in touch with the class and even more to ask and answer
questions. We hope that all students will use it to:
ask or answer any question whatsoever related to group
assignments
ask or answer any logistical or purely technical questions related
to individual work assignments
ask or answer any questions about modeling or statistical computing
concepts that are not directly related to a pending individual work
assignment
Use the Zulip stream for statistical or study design questions
related to what's in those notes
Please also take advantage of the general regression modeling
strategies discussion board: stats.stackexchange
Use datamethods.org for
questions and discussion about study design, measurement, clinical
trials, epidemiology, machine learning, and medical applications of
statistics
High-Level Overview
Multivariable regression models the fundamental tools used for
prediction, effect estimation, and hypothesis testing. This course
covers the most commonly used regression models plus general methods
applicable to all regression models. There is an emphasis on aspects
related to clinical and translational study design.
Motivation
Accurate estimation of patient prognosis or of the probability of a
disease or other outcomes is important for many reasons.
Prognostic estimates can be used to inform the patient about likely
outcomes of her disease.
A physician can use estimates of diagnosis or prognosis as a guide
for ordering additional tests and selecting appropriate therapies.
Outcome assessments are useful in the evaluation of technologies;
for example, diagnostic estimates derived both with and without using
the results of a given test can be compared to measure the incremental
diagnostic information provided by that test over what is provided by
prior information.
A researcher may want to estimate the effect of a single factor
(e.g., treatment given) on outcomes in an observational study in which
many uncontrolled confounding factors are also measured. Here the
simultaneous effects of the uncontrolled variables must be controlled
(held constant mathematically if using a regression model) so that the
effect of the factor of interest can be more purely estimated. An
analysis of how variables (especially continuous ones) affect the
patient outcomes of interest is necessary to ascertain how to control
their effects.
Predictive modeling is useful in designing randomized clinical
trials. Both the decision concerning which patients to randomize and the
design of the randomization process (e.g., stratified randomization
using prognostic factors) are aided by the availability of accurate
prognostic estimates before randomization. It is also important to
adjust for prognostic factors in randomized studies to achieve optimum
power and precision. Lastly, accurate prognostic models can be used to
test for differential therapeutic benefit or to estimate the clinical
benefit for an individual patient in a clinical trial, taking into
account the fact that low-risk patients must have less absolute benefit
(e.g., lower change in survival probability). To accomplish these
objectives, researchers must create multivariable models that accurately
reflect the patterns existing in the underlying data and that are valid
when applied to comparable data in other settings or institutions.
Models may be inaccurate due to violation of assumptions, omission of
important predictors, high frequency of missing data and/or improper
imputation methods, and especially with small datasets,
overfitting.
Description
Many types of regression models are increasingly being used in
developing clinical prediction models for diagnosis, prognosis, and
other applications in epidemiology, health services research, health
economics, clinical trials, business, finance, and prediction in
general. Regression models are introduced, and first the basics of
multivariable regression models are discussed, starting with the
ordinary multiple linear regression model (ordinary least squares).
Early topics include interpretation of regression coefficients, coding
of categorical predictors, meaning of linearity assumptions, estimating
the relationships between two variables nonparametrically, and coding
and interpretation of interaction terms. Popular models include logistic
models for binary and ordinal responses, survival models, ordinal
regression, and models for longitudinal data analysis, many of which are
covered in this course. All regression models have assumptions that must
be verified for them to have power to test hypotheses and to be able to
predict accurately. Of the principal assumptions (linearity, additivity,
distributional), this course will emphasize methods for assessing and
satisfying the first two as these methods apply to all regression
models. To deal with the linearity assumption, this course provides
methods for estimating the shape of the relationship between predictors
and response using the widely applicable method of piecewise
polynomials. Emphasis will be given to interpreting fitted models using
effect plots (e.g., continuous partial effect plots and odds ratio
charts) and nomograms. Even when assumptions are satisfied, overfitting
can ruin a model’s predictive ability for future observations. Methods
for data reduction will be introduced to deal with the common case where
the number of potential predictors is large in comparison with the
number of observations. Methods of model validation (bootstrap and
cross-validation) will be introduced, as will auxiliary topics such as
modeling interaction surfaces, dealing with missing data, variable
selection, collinearity, and shrinkage. All methods covered will apply
to almost any regression model. The course will include detailed case
studies in developing, validating, and interpreting clinical prediction
and epidemiologic models.