1  Introduction

1.1 Hypothesis Testing, Estimation, and Prediction

flowchart LR
uses[Uses of models] --> test[Hypothesis testing]
uses --> estimat[Estimation]
uses --> pred[Prediction]
test --> ftest["Formal tests<br>Formal model<br>comparison<br>(e.g. AIC)"]
estimat --> festimat[Point and interval<br>estimation of one<br>predictor's effect]
pred --> fpred[Estimated outcome<br>or outcome<br>tendency for<br>a subject]

Even when only testing \(H_{0}\) a model based approach has advantages:

  • Permutation and rank tests not as useful for estimation
  • Cannot readily be extended to cluster sampling or repeated measurements
  • Models generalize tests
    • 2-sample \(t\)-test, ANOVA \(\rightarrow\)
      multiple linear regression
    • paired \(t\)-test \(\rightarrow\)
      linear regression with fixed effects for subjects (block on subjects); linear mixed model with random per-subject intercepts
    • Wilcoxon, Kruskal-Wallis, Spearman \(\rightarrow\)
      proportional odds (PO) ordinal logistic model
    • Wilcoxon signed-rank test \(\rightarrow\)
      replace with rank-difference test \(\rightarrow\)
      PO model blocking on subject; ordinal mixed model
    • log-rank \(\rightarrow\) Cox
  • Models not only allow for multiplicity adjustment but for shrinkage of estimates
    • Statisticians comfortable with \(P\)-value adjustment but fail to recognize that the difference between the most different treatments is badly biased

Statistical estimation is usually model-based

  • Relative effect of increasing cholesterol from 200 to 250 mg/dl on hazard of death, holding other risk factors constant
  • Adjustment depends on how other risk factors relate to hazard
  • Usually interested in adjusted (partial) effects, not unadjusted (marginal or crude) effects

1.2 Examples of Uses of Predictive Multivariable Modeling

  • Financial performance, consumer purchasing, loan pay-back
  • Ecology
  • Product life
  • Employment discrimination
  • Medicine, epidemiology, health services research
  • Probability of diagnosis, time course of a disease
  • Checking that a previously developed summary index (e.g., BMI) adequately summarizes its component variables
  • Developing new summary indexes by how variables predict an outcome
  • Comparing non-randomized treatments
  • Getting the correct estimate of relative effects in randomized studies requires covariable adjustment if model is nonlinear
    • Crude odds ratios biased towards 1.0 if sample heterogeneous
  • Estimating absolute treatment effect (e.g., risk difference)
    • Use e.g. difference in two predicted probabilities
  • Cost-effectiveness ratios
    • incremental cost / incremental ABSOLUTE benefit
    • most studies use avg. cost difference / avg. benefit, which may apply to no one

1.3 Misunderstandings about Prediction vs. Classification

flowchart LR
goal[Goal] --> predest[Estimation or Prediction]
goal --> classif[Classification]
predest --> whatpre[Continuous output<br><br>Handles close<br>calls and<br>gray zones<br><br>Provides input to<br>decision maker]
classif --> whatclass[Categorical output<br><br>Hides close calls<br><br>Makes premature<br>decisions<br><br>Does not provide<br>sufficient input<br>to decision maker<br><br>Useful for quick<br>easy decisions or<br>when outcome<br>probabilities are<br>near 0 and 1]

  • Many analysts desire to develop “classifiers” instead of predictions
  • Outside of, for example, visual or sound pattern recognition, classification represents a premature decision
  • See this blog for details
  • Suppose that
  1. response variable is binary
  2. the two levels represent a sharp dichotomy with no gray zone (e.g., complete success vs. total failure with no possibility of a partial success)
  3. one is forced to assign (classify) future observations to only these two choices
  4. the cost of misclassification is the same for every future observation, and the ratio of the cost of a false positive to the cost of a false negative equals the (often hidden) ratio implied by the analyst’s classification rule
  • Then classification is still sub-optimal for driving the development of a predictive instrument as well as for hypothesis testing and estimation
  • Classification and its associated classification accuracy measure—the proportion classified “correctly”—are very sensitive to the relative frequencies of the outcome variable. If a classifier is applied to another dataset with a different outcome prevalence, the classifier may no longer apply.
  • Far better is to use the full information in the data to develop a probability model, then develop classification rules on the basis of estimated probabilities
    • \(\uparrow\) power, \(\uparrow\) precision, \(\uparrow\) decision making
  • Classification is more problematic if response variable is ordinal or continuous or the groups are not truly distinct (e.g., disease or no disease when severity of disease is on a continuum); dichotomizing it up front for the analysis is not appropriate
    • minimum loss of information (when dichotomization is at the median) is large
    • may require the sample size to increase many–fold to compensate for loss of information Fedorov et al. (2009)
  • Two-group classification represents artificial forced choice
    • best option may be “no choice, get more data”
  • Unlike prediction (e.g., of absolute risk), classification implicitly uses utility (loss; cost of false positive or false negative) functions
  • Hidden problems:
    • Utility function depends on variables not collected (subjects’ preferences) that are available only at the decision point
    • Assumes every subject has the same utility function
    • Assumes this function coincides with the analyst’s
  • Formal decision analysis uses
    • optimum predictions using all available data
    • subject-specific utilities, which are often based on variables not predictive of the outcome
  • ROC analysis is misleading except for the special case of mass one-time group decision making with unknowable utilities1
  • 1 To make an optimal decision you need to know all relevant data about an individual (used to estimate the probability of an outcome), and the utility (cost, loss function) of making each decision. Sensitivity and specificity do not provide this information. For example, if one estimated that the probability of a disease given age, sex, and symptoms is 0.1 and the “cost”of a false positive equaled the “cost” of a false negative, one would act as if the person does not have the disease. Given other utilities, one would make different decisions. If the utilities are unknown, one gives the best estimate of the probability of the outcome to the decision maker and let her incorporate her own unspoken utilities in making an optimum decision for her.

    Besides the fact that cutoffs do not apply to individuals, only to groups, individual decision making does not utilize sensitivity and specificity. For an individual we can compute \(\textrm{Prob}(Y=1 | X=x)\); we don’t care about \(\textrm{Prob}(Y=1 | X>c)\), and an individual having \(X=x\) would be quite puzzled if she were given \(\textrm{Prob}(X>c | \textrm{future unknown Y})\) when she already knows \(X=x\) so \(X\) is no longer a random variable.

    Even when group decision making is needed, sensitivity and specificity can be bypassed. For mass marketing, for example, one can rank order individuals by the estimated probability of buying the product, to create a lift curve. This is then used to target the \(k\) most likely buyers where \(k\) is chosen to meet total program cost constraints.

  • See Vickers (2008), Briggs & Zaretzki (2008), Gail & Pfeiffer (2005), Bordley (2007), Fan & Levine (2007), Gneiting & Raftery (2007).

    Accuracy score used to drive model building should be a continuous score that utilizes all of the information in the data.

    In summary:

    • Classification is a forced choice — a decision.
    • Decisions require knowledge of the cost or utility of making an incorrect decision.
    • Predictions are made without knowledge of utilities.
    • A prediction can lead to better decisions than classification. For example suppose that one has an estimate of the risk of an event, \(\hat{P}\). One might make a decision if \(\hat{P} < 0.10\) or \(\hat{P} > 0.90\) in some situations, even without knowledge of utilities. If on the other hand \(\hat{P} = 0.6\) or the confidence interval for \(P\) is wide, one might
      • make no decision and instead opt to collect more data
      • make a tentative decision that is revisited later
      • make a decision using other considerations such as the infusion of new resources that allow targeting a larger number of potential customers in a marketing campaign

    The Dichotomizing Motorist

    • The speed limit is 60.
    • I am going faster than the speed limit.
    • Will I be caught?

    An answer by a dichotomizer:

    • Are you going faster than 70?

    An answer from a better dichotomizer:

    • If you are among other cars, are you going faster than 73?
    • If you are exposed are your going faster than 67?


    • How fast are you going and are you exposed?

    Analogy to most medical diagnosis research in which +/- diagnosis is a false dichotomy of an underlying disease severity:

    • The speed limit is moderately high.
    • I am going fairly fast.
    • Will I be caught?

    1.4 Planning for Modeling

    • Chance that predictive model will be used (Reilly & Evans (2006))
    • Response definition, follow-up
    • Variable definitions
    • Observer variability
    • Missing data
    • Preference for continuous variables
    • Subjects
    • Sites

    What can keep a sample of data from being appropriate for modeling:

    1. Most important predictor or response variables not collected
    2. Subjects in the dataset are ill-defined or not representative of the population to which inferences are needed
    3. Data collection sites do not represent the population of sites
    4. Key variables missing in large numbers of subjects
    5. Data not missing at random
    6. No operational definitions for key variables and/or measurement errors severe
    7. No observer variability studies done

    What else can go wrong in modeling?

    1. The process generating the data is not stable.
    2. The model is misspecified with regard to nonlinearities or interactions, or there are predictors missing.
    3. The model is misspecified in terms of the transformation of the response variable or the model’s distributional assumptions.
    4. The model contains discontinuities (e.g., by categorizing continuous predictors or fitting regression shapes with sudden changes) that can be gamed by users.
    5. Correlations among subjects are not specified, or the correlation structure is misspecified, resulting in inefficient parameter estimates and overconfident inference.
    6. The model is overfitted, resulting in predictions that are too extreme or positive associations that are false.
    7. The user of the model relies on predictions obtained by extrapolating to combinations of predictor values well outside the range of the dataset used to develop the model.
    8. Accurate and discriminating predictions can lead to behavior changes that make future predictions inaccurate.

    Iezzoni (1994) lists these dimensions to capture, for patient outcome studies:

    1. age
    2. sex
    3. acute clinical stability
    4. principal diagnosis
    5. severity of principal diagnosis
    6. extent and severity of comorbidities
    7. physical functional status
    8. psychological, cognitive, and psychosocial functioning
    9. cultural, ethnic, and socioeconomic attributes and behaviors
    10. health status and quality of life
    11. patient attitudes and preferences for outcomes

    General aspects to capture in the predictors:

    1. baseline measurement of response variable
    2. current status
    3. trajectory as of time zero, or past levels of a key variable
    4. variables explaining much of the variation in the response
    5. more subtle predictors whose distributions strongly differ between levels of the key variable of interest in an observational study

    1.5 Choice of the Model

    • In biostatistics and epidemiology and most other areas we usually choose model empirically
    • Model must use data efficiently
    • Should model overall structure (e.g., acute vs. chronic)
    • Robust models are better
    • Should have correct mathematical structure (e.g., constraints on probabilities)

    1.6 Model uncertainty / Data-driven Model Specification

    flowchart LR
    ms[Model Selection] --> pre[Pre-specified] --> eps[Try to specify<br>a model flexible<br>enough to fit<br><br>Fit assumed<br>to be adequate<br><br>Need not be perfect<br>but as good as<br>any model not<br>requiring larger N] --> nomu[No model<br>uncertainty,<br>accurate statistical<br>inference]
    ms --> bayes[Pre-specified<br>Bayesian model<br>with parameters<br>capturing departures<br>from simplicity] --> bac[No binary model<br>choices required] --> api[Accurate posterior<br>inference<br><br>Robust<br><br>Insights about<br>non-normality etc.]
    ms --> cont[Contest between<br>desired and<br>more general model] --> pair[Check if more<br>general model is<br>better for the money] --> mmu[Better way to<br>check goodness<br>of fit<br><br>Minimal model<br>uncertainty]
    ms --> emp[Empirical] --> gof[Goodness-of-fit<br>checking if<br>involves >2<br> pre-specified<br>models] --> dist
    emp --> empus[May be highly<br>unstable if<br>entertain many<br>models or do<br>feature<br>selection] --> dist[Distorted statistical<br>inference]
    ms --> ml[Machine learning] --> mluns[May be highly<br>unstable<br>unless N huge] --> noinf[No statistical inference]

    • Standard errors, C.L., \(P\)-values, \(R^2\) wrong if computed as if the model pre-specified

    • Stepwise variable selection is widely used and abused

    • Bootstrap can be used to repeat all analysis steps to properly penalize variances, etc.

    • Ye (1998): “generalized degrees of freedom” (GDF) for any “data mining” or model selection procedure based on least squares

      • Example: 20 candidate predictors, \(n=22\), forward stepwise, best 5-variable model: GDF=14.1
      • Example: CART, 10 candidate predictors, \(n=100\), 19 nodes: GDF=76
    • See Luo et al. (2006) for an approach involving adding noise to \(Y\) to improve variable selection

    • Another example: \(t\)-test to compare two means

      • Basic test assumes equal variance and normal data distribution

      • Typically examine the two sample distributions to decide whether to transform \(Y\) or switch to a different test

      • Examine the two SDs to decide whether to use the standard test or switch to a Welch \(t\)-test

      • Final confidence interval for mean difference is conditional on the final choices being correct

      • Ignores model uncertainty

      • Confidence interval will not have the claimed coverage

      • Get proper coverage by adding parameters for what you don’t know

        • Bayesian \(t\)-test: parameters for variance ratio and for d.f. of a \(t\)-distribution for the raw data (allows heavy tails)

    1.6.1 Model Uncertainty and Model Checking

    As the Bayesian \(t\)-test exemplifies, there are advantages of a continuous approach to modeling instead of engaging in dichotomous goodness-of-fit (GOF) assessments. Some general comments:

    • In a frequentist setting, GOF checking can inflate type I assertion probability \(\alpha\) and make confidence intervals falsely narrow. In a Bayesian setting, posterior distributions and resulting uncertainty intervals can be too narrow.
    • Rather than accepting or not accepting a proposed model on the basis of a GOF assessment, embed the proposed model inside a more general model that relaxes the assumptions, and use AIC or a formal test to decide between the two. Comparing only two pre-specified models will result in minimal model uncertainty.
      • More general model could include nonlinear terms and interactions
      • It could also relax distributional assumptions, as done with the non-normality parameter in the Bayesian \(t\)-test
      • Often the sample size is not large enough to allow model assumptions to be relaxed without overfitting; AIC assesses whether additional complexities are “good for the money”. If a more complex model results in worse predictions due to overfitting, it is doubtful that such a model should be used for inference.
    • Instead of focusing on model assumption checking, focus on the impact of making those assumptions, using for example comparison of adjusted \(R^2\) measures and bootstrap confidence intervals for differences in predicted values from two models.
    • In many situations you can use a semiparametric model that makes many fewer assumptions than a parametric model
    • See this for more in-depth discussion

    1.7 Study Questions

    1. Can you estimate the effect of increasing age from 21 to 30 without a statistical model?
    2. What is an example where machine learning users have used “classification” in the wrong sense?
    3. When is classification (in the proper sense) an appropriate goal?
    4. Why are so many decisions non-binary?
    5. How do we normally choose statistical models—from subject matter theory or empirically?
    6. What is model uncertainty?
    7. An investigator feels that there are too many variables to analyze so she uses significance testing to select which variables to analyze further. What is wrong with that?