1 Introduction

1.1 Hypothesis Testing, Estimation, and Prediction

flowchart LR
uses[Uses of models] --> test[Hypothesis testing]
uses --> estimat[Estimation]
uses --> pred[Prediction]
test --> ftest["Formal tests<br>Formal model<br>comparison<br>(e.g. AIC)"]
estimat --> festimat[Point and interval<br>estimation of one<br>predictor's effect]
pred --> fpred[Estimated outcome<br>or outcome<br>tendency for<br>a subject]

Even when only testing \(H_{0}\) a model based approach has advantages:

Permutation and rank tests not as useful for estimation
Cannot readily be extended to cluster sampling or repeated measurements
Models generalize tests
- 2-sample \(t\)-test, ANOVA \(\rightarrow\)
  multiple linear regression
- paired \(t\)-test \(\rightarrow\)
  linear regression with fixed effects for subjects (block on subjects); linear mixed model with random per-subject intercepts
- Wilcoxon, Kruskal-Wallis, Spearman \(\rightarrow\)
  proportional odds (PO) ordinal logistic model
- Wilcoxon signed-rank test \(\rightarrow\)
  replace with rank-difference test \(\rightarrow\)
  PO model blocking on subject; ordinal mixed model
- log-rank \(\rightarrow\) Cox
Models not only allow for multiplicity adjustment but for shrinkage of estimates
- Statisticians comfortable with \(P\)-value adjustment but fail to recognize that the difference between the most different treatments is badly biased

Statistical estimation is usually model-based

Relative effect of increasing cholesterol from 200 to 250 mg/dl on hazard of death, holding other risk factors constant
Adjustment depends on how other risk factors relate to hazard
Usually interested in adjusted (partial) effects, not unadjusted (marginal or crude) effects

1.2 Examples of Uses of Predictive Multivariable Modeling

Financial performance, consumer purchasing, loan pay-back
Ecology
Product life
Employment discrimination
Medicine, epidemiology, health services research
Probability of diagnosis, time course of a disease
Checking that a previously developed summary index (e.g., BMI) adequately summarizes its component variables
Developing new summary indexes by how variables predict an outcome
Comparing non-randomized treatments
Getting the correct estimate of relative effects in randomized studies requires covariable adjustment if model is nonlinear
- Crude odds ratios biased towards 1.0 if sample heterogeneous
Estimating absolute treatment effect (e.g., risk difference)
- Use e.g. difference in two predicted probabilities
Cost-effectiveness ratios
- incremental cost / incremental ABSOLUTE benefit
- most studies use avg. cost difference / avg. benefit, which may apply to no one

1.3 Misunderstandings about Prediction vs. Classification

Classification vs. Prediction

flowchart LR
goal[Goal] --> predest[Estimation or Prediction]
goal --> classif[Classification]
predest --> whatpre[Continuous output<br><br>Handles close<br>calls and<br>gray zones<br><br>Provides input to<br>decision maker]
classif --> whatclass[Categorical output<br><br>Hides close calls<br><br>Makes premature<br>decisions<br><br>Does not provide<br>sufficient input<br>to decision maker<br><br>Useful for quick<br>easy decisions or<br>when outcome<br>probabilities are<br>near 0 and 1]

Many analysts desire to develop “classifiers” instead of predictions
Outside of, for example, visual or sound pattern recognition, classification represents a premature decision
See this blog for details
Suppose that

response variable is binary
the two levels represent a sharp dichotomy with no gray zone (e.g., complete success vs. total failure with no possibility of a partial success)
one is forced to assign (classify) future observations to only these two choices
the cost of misclassification is the same for every future observation, and the ratio of the cost of a false positive to the cost of a false negative equals the (often hidden) ratio implied by the analyst’s classification rule

Then classification is still sub-optimal for driving the development of a predictive instrument as well as for hypothesis testing and estimation
Classification and its associated classification accuracy measure—the proportion classified “correctly”—are very sensitive to the relative frequencies of the outcome variable. If a classifier is applied to another dataset with a different outcome prevalence, the classifier may no longer apply.
Far better is to use the full information in the data to develop a probability model, then develop classification rules on the basis of estimated probabilities
- \(\uparrow\) power, \(\uparrow\) precision, \(\uparrow\) decision making
Classification is more problematic if response variable is ordinal or continuous or the groups are not truly distinct (e.g., disease or no disease when severity of disease is actually on a continuum); dichotomizing it up front for the analysis is not appropriate
- minimum loss of information (when dichotomization is at the median) is large
- may require the sample size to increase many–fold to compensate for loss of information Fedorov et al. (2009)
Two-group classification represents artificial forced choice
- best option may be “no choice, get more data”
Unlike prediction (e.g., of absolute risk), classification implicitly uses utility (loss; cost of false positive or false negative) functions
Hidden problems:
- Utility function depends on variables not collected (subjects’ preferences) that are available only at the decision point
- Assumes every subject has the same utility function
- Assumes this function coincides with the analyst’s
Formal decision analysis uses
- optimum predictions using all available data
- subject-specific utilities, which are often based on variables not predictive of the outcome
ROC analysis is misleading except for the special case of mass one-time group decision making with unknowable utilities¹

¹ To make an optimal decision you need to know all relevant data about an individual (used to estimate the probability of an outcome), and the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated that the probability of a disease given age, sex, and symptoms is 0.1 and the “cost” of a false positive equaled the “cost” of a false negative, one would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one gives the best estimate of the probability of the outcome to the decision maker and let them incorporate their own unspoken utilities in making an optimum decision for them.

Besides the fact that cutoffs do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity. For an individual we can compute \(\textrm{Prob}(Y=1 | X=x)\); we don’t care about \(\textrm{Prob}(Y=1 | X>c)\), and an individual having \(X=x\) would be quite puzzled if they were given \(\textrm{Prob}(X>c | \textrm{future unknown Y})\) when they already knows \(X=x\) so \(X\) is no longer a random variable.

Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order individuals by the estimated probability of buying the product, to create a lift curve. This is then used to target the \(k\) most likely buyers where \(k\) is chosen to meet total program cost constraints.

See Vickers (2008), Briggs & Zaretzki (2008), Gail & Pfeiffer (2005), Bordley (2007), Fan & Levine (2007), Gneiting & Raftery (2007).

Accuracy score used to drive model building should be a continuous score that utilizes all of the information in the data.

In summary:

Classification is a forced choice — a decision.
Decisions require knowledge of the cost or utility of making an incorrect decision.
Predictions are made without knowledge of utilities.
A prediction can lead to better decisions than classification. For example suppose that one has an estimate of the risk of an event, \(\hat{P}\). One might make a decision if \(\hat{P} < 0.10\) or \(\hat{P} > 0.90\) in some situations, even without knowledge of utilities. If on the other hand \(\hat{P} = 0.6\) or the confidence interval for \(P\) is wide, one might
- make no decision and instead opt to collect more data
- make a tentative decision that is revisited later
- make a decision using other considerations such as the infusion of new resources that allow targeting a larger number of potential customers in a marketing campaign

The Dichotomizing Motorist

The speed limit is 60.
I am going faster than the speed limit.
Will I be caught?

An answer by a dichotomizer:

Are you going faster than 70?

An answer from a better dichotomizer:

If you are among other cars, are you going faster than 73?
If you are exposed are your going faster than 67?

Better:

How fast are you going and are you exposed?

Analogy to most medical diagnosis research in which +/- diagnosis is a false dichotomy of an underlying disease severity:

The speed limit is moderately high.
I am going fairly fast.
Will I be caught?

1.4 Planning for Modeling

Chance that predictive model will be used (Reilly & Evans (2006))
Response definition, follow-up
Variable definitions
Observer variability
Missing data
Preference for continuous variables
Subjects
Sites

What can keep a sample of data from being appropriate for modeling:

Most important predictor or response variables not collected
Subjects in the dataset are ill-defined or not representative of the population to which inferences are needed
Data collection sites do not represent the population of sites
Key variables missing in large numbers of subjects
Data not missing at random
No operational definitions for key variables and/or measurement errors severe
No observer variability studies done

What else can go wrong in modeling?

The process generating the data is not stable.
The model is misspecified with regard to nonlinearities or interactions, or there are predictors missing.
The model is misspecified in terms of the transformation of the response variable or the model’s distributional assumptions.
The model contains discontinuities (e.g., by categorizing continuous predictors or fitting regression shapes with sudden changes) that can be gamed by users.
Correlations among subjects are not specified, or the correlation structure is misspecified, resulting in inefficient parameter estimates and overconfident inference.
The model is overfitted, resulting in predictions that are too extreme or positive associations that are false.
The user of the model relies on predictions obtained by extrapolating to combinations of predictor values well outside the range of the dataset used to develop the model.
Accurate and discriminating predictions can lead to behavior changes that make future predictions inaccurate.

Iezzoni (1994) lists these dimensions to capture, for patient outcome studies:

age
sex
acute clinical stability
principal diagnosis
severity of principal diagnosis
extent and severity of comorbidities
physical functional status
psychological, cognitive, and psychosocial functioning
cultural, ethnic, and socioeconomic attributes and behaviors
health status and quality of life
patient attitudes and preferences for outcomes

General aspects to capture in the predictors:

baseline measurement of response variable
current status
trajectory as of time zero, or past levels of a key variable
variables explaining much of the variation in the response
more subtle predictors whose distributions strongly differ between levels of the key variable of interest in an observational study

Efthimiou et al. (2024) has excellent information for planning clinical prediction studies.

1.5 Choice of the Model

In biostatistics and epidemiology and most other areas we usually choose model empirically
Model must use data efficiently
Should model overall structure (e.g., acute vs. chronic)
Robust models are better
- The most general robust models not requiring machine learning-level sample sizes are semiparametric ordinal models
Should have correct mathematical structure (e.g., constraints on probabilities)

1.6 Model uncertainty / Data-driven Model Specification

flowchart LR
ms[Model Selection] --> pre[Pre-specified] --> eps[Try to specify<br>a model flexible<br>enough to fit<br><br>Fit assumed<br>to be adequate<br><br>Need not be perfect<br>but as good as<br>any model not<br>requiring larger N] --> nomu[No model<br>uncertainty,<br>accurate statistical<br>inference]
ms --> bayes[Pre-specified<br>Bayesian model<br>with parameters<br>capturing departures<br>from simplicity] --> bac[No binary model<br>choices required] --> api[Accurate posterior<br>inference<br><br>Robust<br><br>Insights about<br>non-normality etc.]
ms --> cont[Contest between<br>desired and<br>more general model] --> pair[Check if more<br>general model is<br>better for the money] --> mmu[Better way to<br>check goodness<br>of fit<br><br>Minimal model<br>uncertainty]
ms --> emp[Empirical] --> gof[Goodness-of-fit<br>checking if<br>involves >2<br> pre-specified<br>models] --> dist
emp --> empus[May be highly<br>unstable if<br>entertain many<br>models or do<br>feature<br>selection] --> dist[Distorted statistical<br>inference]
ms --> ml[Machine learning] --> mluns[May be highly<br>unstable<br>unless N huge] --> noinf[No statistical inference]

Standard errors, C.L., \(P\)-values, \(R^2\) wrong if computed as if the model pre-specified
Stepwise variable selection is widely used and abused
Bootstrap can be used to repeat all analysis steps to properly penalize variances, etc.
Ye (1998): “generalized degrees of freedom” (GDF) for any “data mining” or model selection procedure based on least squares
- Example: 20 candidate predictors, \(n=22\), forward stepwise, best 5-variable model: GDF=14.1
- Example: CART, 10 candidate predictors, \(n=100\), 19 nodes: GDF=76
See Luo et al. (2006) for an approach involving adding noise to \(Y\) to improve variable selection
Another example: \(t\)-test to compare two means
- Basic test assumes equal variance and normal data distribution
- Typically examine the two sample distributions to decide whether to transform \(Y\) or switch to a different test
- Examine the two SDs to decide whether to use the standard test or switch to a Welch \(t\)-test
- Final confidence interval for mean difference is conditional on the final choices being correct
- Ignores model uncertainty
- Confidence interval will not have the claimed coverage
- Get proper coverage by adding parameters for what you don’t know
  - Bayesian \(t\)-test: parameters for variance ratio and for d.f. of a \(t\)-distribution for the raw data (allows heavy tails)

1.6.1 Model Uncertainty and Model Checking

As the Bayesian \(t\)-test exemplifies, there are advantages of a continuous approach to modeling instead of engaging in dichotomous goodness-of-fit (GOF) assessments. Some general comments:

In a frequentist setting, GOF checking can inflate type I assertion probability \(\alpha\) and make confidence intervals falsely narrow. In a Bayesian setting, posterior distributions and resulting uncertainty intervals can be too narrow.
Rather than accepting or not accepting a proposed model on the basis of a GOF assessment, embed the proposed model inside a more general model that relaxes the assumptions, and use AIC or a formal test to decide between the two. Comparing only two pre-specified models will result in minimal model uncertainty. It is often more useful to think of GOF as a contest between the proposed model and a more general model. If the more general model is the most general one that the effective sample size will support, it doesn’t do any good to worry about the adequacy of the more general model.
- More general model could include nonlinear terms and interactions
- It could also relax distributional assumptions, as done with the non-normality parameter in the Bayesian \(t\)-test
- Often the sample size is not large enough to allow model assumptions to be relaxed without overfitting; AIC assesses whether additional complexities are “good for the money”. If a more complex model results in worse predictions due to overfitting, it is doubtful that such a model should be used for inference.
Instead of focusing on model assumption checking, focus on the impact of making those assumptions, using for example comparison of adjusted \(R^2\) measures and bootstrap confidence intervals for differences in predicted values from two models.
In many situations you can use a semiparametric model that makes many fewer assumptions than a parametric model
See this for more in-depth discussion

1.7 Study Questions

Can you estimate the effect of increasing age from 21 to 30 without a statistical model?
What is an example where machine learning users have used “classification” in the wrong sense?
When is classification (in the proper sense) an appropriate goal?
Why are so many decisions non-binary?
How do we normally choose statistical models—from subject matter theory or empirically?
What is model uncertainty?
Investigator feels that there are too many variables to analyze so they use significance testing to select which variables to analyze further. What is wrong with that?

Bordley, R. (2007). Statistical decisionmaking without math. Chance, 20(3), 39–44.

Briggs, W. M., & Zaretzki, R. (2008). The skill plot: A graphical technique for evaluating continuous diagnostic tests (with discussion). Biometrics, 64, 250–261.

"statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x_c. ... ROC curves lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio \(<\)i\(>\)receivers\(<\)/i\(>\) (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x_c, and is not especially interested in how well he would have done had he used some different cutoff."; in the discussion David Hand states "when integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches to the different kinds of misclassifications."; see Lin, Kvam, Lu Stat in Med 28:798-813;2009

Efthimiou, O., Seo, M., Chalkou, K., Debray, T., Egger, M., & Salanti, G. (2024). Developing clinical prediction models: A step-by-step guide. BMJ, e078276. https://doi.org/10.1136/bmj-2023-078276

Fan, J., & Levine, R. A. (2007). To amnio or not to amnio: That is the decision for Bayes. Chance, 20(3), 26–32.

Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of dichotomization. Pharm Stat, 8, 50–61. https://doi.org/10.1002/pst.331

optimal cutpoint depends on unknown parameters;should only entertain dichotomization when "estimating a value of the cumulative distribution and when the assumed model is very different from the true model";nice graphics

Gail, M. H., & Pfeiffer, R. M. (2005). On criteria for evaluating models of absolute risk. Biostatistics, 6(2), 227–239.

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102, 359–378.

wonderful review article except missing references from Scandanavian and German medical decision making literature

Iezzoni, L. I. (1994). Dimensions of Risk. In L. I. Iezzoni (Ed.), Risk Adjustment for Measuring Health Outcomes (pp. 29–118). Foundation of the American College of Healthcare Executives.

dimensions of risk factors to include in models

Luo, X., Stfanski, L. A., & Boos, D. D. (2006). Tuning variable selection procedures by adding noise. Technometrics, 48, 165–175.

adding a known amount of noise to the response and studying σ² to tune the stopping rule to avoid overfitting or underfitting;simulation setup

Reilly, B. M., & Evans, A. T. (2006). Translating clinical research into clinical practice: Impact of using prediction rules to make decisions. Ann Int Med, 144, 201–209.

impact analysis;example of decision aid being ignored or overruled making MD decisions worse;assumed utilities are constant across subjects by concluding that directives have more impact than predictions;Goldman-Cook clinical prediction rule in AMI

Vickers, A. J. (2008). Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. Am Statistician, 62(4), 314–320.

limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for taking some action;decision curve;has other good references to decision analysis

Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. J Am Stat Assoc, 93, 120–131.