18 Information Loss

Information allergy is defined as (1) refusing to obtain key information needed to make a sound decision, or (2) ignoring important available information. The latter problem is epidemic in biomedical and epidemiologic research and in clinical practice. Examples include

ignoring some of the information in confounding variables that would explain away the effect of characteristics such as dietary habits
ignoring probabilities and “gray zones” in genomics and proteomics research, making arbitrary classifications of patients in such a way that leads to poor validation of gene and protein patterns
failure to grasp probabilitistic diagnosis and patient-specific costs of incorrect decisions, thus making arbitrary diagnoses and placing the analyst in the role of the bedside decision maker
classifying patient risk factors and biomarkers into arbitrary “high/low” groups, ignoring the full spectrum of values
touting the prognostic value of a new biomarker, ignoring basic clinical information that may be even more predictive
using weak and somewhat arbitrary clinical staging systems resulting from a fear of continuous measurements
ignoring patient spectrum in estimating the benefit of a treatment

Examples of such problems will be discussed, concluding with an examination of how information-losing cardiac arrhythmia research may have contributed to the deaths of thousands of patients.

… wherever nature draws unclear boundaries, humans are happy to curate
— Alice Dreger, Galileo’s Middle Finger

18.1 Information & Decision Making

What is information?

Messages used as the basis for decision-making
Result of processing, manipulating and organizing data in a way that adds to the receiver’s knowledge
Meaning, knowledge, instruction, communication, representation, and mental stimulus¹

¹ pbs.org/weta, wikipedia.org/wiki/Information

Information resolves uncertainty.

Some types of information may be quantified in bits. A binary variable is represented by 0/1 in base 2, and it has 1 bit of information. This is the minimum amount of information other than no information. Systolic blood pressure measured accurately to the nearest 4mmHg has 6 binary digits—bits—of information ($\log_{2}\frac{256}{4} = 6$). Dichotomizing blood pressure reduces its information content to 1 bit, resulting in enormous loss of precision and power.

Value of information: Judged by the variety of outcomes to which it leads.

Optimum decision making requires the maximum and most current information the decision maker is capable of handling

Some important decisions in biomedical and epidemiologic research and clinical practice:

Pathways, mechanisms of action
Best way to use gene and protein expressions to diagnose or treat
Which biomarkers are most predictive and how should they be summarized?
What is the best way to diagnose a disease or form a prognosis?
Is a risk factor causative or merely a reflection of confounding?
How should patient outcomes be measured?
Is a drug effective for an outcome?
Who should get a drug?

18.1.1 Information Allergy

Failing to obtain key information needed to make a sound decision

Not collecting important baseline data on subjects Ignoring Available Information
Touting the value of a new biomarker that provides less information than basic clinical data
Ignoring confounders (alternate explanations)
Ignoring subject heterogeneity
Categorizing continuous variables or subject responses
Categorizing predictions as “right” or “wrong”
Letting fear of probabilities and costs/utilities lead an author to make decisions for individual patients

18.2 Ignoring Readily Measured Variables

Prognostic markers in acute myocardial infarction

$c$-index: concordance probability $\equiv$ receiver operating characteristic curve or ROC area
Measure of ability to discriminate death within 30d

Markers	$c$-index
CK–MB	0.63
Troponin T	0.69
Troponin T $> 0.1$	0.64
CK–MB + Troponin T	0.69
CK–MB + Troponin T + ECG	0.73
Age + sex	0.80
All	0.83

Ohman et al. (1996)

Though not discussed in the paper, age and sex easily trump troponin T. One can also see from the $c$-indexes that the common dichotomizatin of troponin results in an immediate loss of information.

Inadequate adjustment for confounders: Greenland (2000)

Case-control study of diet, food constituents, breast cancer
140 cases, 222 controls
35 food constituent intakes and 5 confounders
Food intakes are correlated
Traditional stepwise analysis not adjusting simultaneously for all foods consumed $\rightarrow$ 11 foods had $P < 0.05$
Full model with all 35 foods competing $\rightarrow$ 2 had $P < 0.05$
Rigorous simultaneous analysis (hierarchical random slopes model) penalizing estimates for the number of associations examined $\rightarrow$ no foods associated with breast cancer

Ignoring subject variability in randomized experiments

Randomization tends to balance measured and unmeasured subject characteristics across treatment groups
Subjects vary widely within a treatment group
Subject heterogeneity usually ignored
False belief that balance from randomization makes this irrelevant
Alternative: analysis of covariance
If any of the baseline variables are predictive of the outcome, there is a gain in power for every type of outcome (binary, time-to-event, continuous, ordinal)
Example for a binary outcome in Section 13.2.2.1

18.3 Categorization: Partial Use of Information

Patient: What was my systolic BP this time?
MD: It was $> 120$
Patient: How is my diabetes doing?
MD: Your Hb$_{\textrm A1c}$ was $> 6.5$
Patient: What about the prostate screen?
MD: If you have average prostate cancer, the chance that PSA $> 5$ in this report is $0.6$ Problem: Improper conditioning ($X > c$ instead of $X = x$)$\rightarrow$ information loss; reversing time flow
Sensitivity: $P(\mathrm{observed~} X > c \mathrm{~given~unobserved~} Y=y)$

18.3.1 Categorizing Continuous Predictors

Many physicians attempt to find cutpoints in continuous predictor variables
Mathematically such cutpoints cannot exist unless relationship with outcome is discontinuous
Even if the cutpoint existed, it must vary with other patient characteristics, as optimal decisions are based on risk
A simple 2-predictor example related to diagnosis of pneumonia will suffice
It is never appropriate to dichotomize an input variable other than time. Dichotomization, if it must be done, should only be done on $\hat{Y}$. In other words, dichotomization is done as late as possible in decision making. When more than one continuous predictor variable is relevant to outcomes, the example below shows that it is mathematically incorrect to do a one-time dichotomization of a predictor. As an analogy, suppose that one is using body mass index (BMI) by itself to make a decision. One would never categorize height and categorize weight to make the decision based on BMI. One could categorize BMI, if no other outcome predictors existed for the problem.

Code

require(rms)
getHdata(ari)
r <- ari[ari$age >= 42, Cs(age, rr, pneu, coh, s2)]
abn.xray <- r$s2==0
r$coh <- factor(r$coh, 0:1, c('no cough','cough'))
f <- lrm(abn.xray ~ rcs(rr,4)*coh, data=r)
anova(f)

                Wald Statistics          Response: abn.xray 

 Factor                                   Chi-Square d.f. P     
 rr  (Factor+Higher Order Factors)        37.45      6    <.0001
  All Interactions                         0.35      3    0.9507
  Nonlinear (Factor+Higher Order Factors)  3.27      4    0.5144
 coh  (Factor+Higher Order Factors)       28.91      4    <.0001
  All Interactions                         0.35      3    0.9507
 rr * coh  (Factor+Higher Order Factors)   0.35      3    0.9507
  Nonlinear                                0.31      2    0.8549
  Nonlinear Interaction : f(A,B) vs. AB    0.31      2    0.8549
 TOTAL NONLINEAR                           3.27      4    0.5144
 TOTAL NONLINEAR + INTERACTION             3.37      5    0.6431
 TOTAL                                    66.06      7    <.0001

Code

dd <- datadist(r); options(datadist='dd')
p <- Predict(f, rr, coh, fun=plogis, conf.int=FALSE)
ggplot(p, rdata=r,
       ylab='Probability of Pneumonia',
       xlab='Adjusted Respiratory Rate/min.',
       ylim=c(0,.7), legend.label='')

Figure 18.1: Estimated risk of pneumonia with respect to two predictors in WHO ARI study from Harrell et al. (1998). Tick marks show data density of respiratory rate stratified by cough. Any cutpoint for the rate **must** depend on cough to be consistent with optimum decision making, which must be risk-based.

18.3.2 What Kinds of True Thresholds Exist?

Natura non facit saltus
(Nature does not make jumps)
— Gottfried Wilhelm Leibniz

Figure 18.2: Two kinds of thresholds. The pattern on the left represents a discontinuity in the first derivative (slope) of the function relating a marker to outcome. On the right there is a lowest-order discontinuity.

What Do Cutpoints Really Assume?
Cutpoints assume discontinuous relationships of the type in the right plot of Figure 18.2, and they assume that the true cutpoint is known. Beyond the molecular level, such patterns do not exist unless $X=$time and the discontinuity is caused by an event. Cutpoints assume homogeneity of outcome on either side of the cutpoint.

18.3.3 Cutpoints are Disasters

Prognostic relevance of S-phase fraction in breast cancer: 19 different cutpoints used in literature
Cathepsin-D content and disease-free survival in node-negative breast cancer: 12 studies, 12 cutpoints
ASCO guidelines: neither cathepsin-D nor S-phrase fraction recommended as prognostic markers (Holländer et al. (2004))

Cutpoints may be found that result in both increasing and decreasing relationships with any dataset with zero correlation

Delay	Mean Score
0-11	210
11-20	215
21-30	217
31-40	218
41-	220

Wainer (2006); See “Dichotomania” (S. J. Senn (2005)) and Royston et al. (2006)

Wainer (2006)

In fact, virtually all published cutpoints are analysis artifacts caused by finding a threshold that minimizes $P$-values when comparing outcomes of subjects below with those above the “threshold”. Two-sample statistical tests suffer the least loss of power when cutting at the median because this balances the sample sizes. That this method has nothing to do with biology can be readily seen by adding observations on either tail of the marker, resulting in a shift of the median toward that tail even though the relationship between the continuous marker and the outcome remains unchanged.

Code

knitr::include_graphics('images/gia14opt-fig2c.png')

Figure 18.3: Thresholds in cardiac biomarkers

In “positive” studies: threshold 132–800 ng/L, correlation with study median $r=0.86$ (Giannoni et al. (2014))

Lack of Meaning of Effects Based on Cutpoints

Researchers often use cutpoints to estimate the high:low effects of risk factors (e.g., BMI vs. asthma)
Results in inaccurate predictions, residual confounding, impossible to interpret
high:low represents unknown mixtures of highs and lows
Effects (e.g., odds ratios) will vary with population
If the true effect is monotonic, adding subjects in the low range or high range or both will increase odds ratios (and all other effect measures) arbitrarily

Royston et al. (2006), Naggara et al. (2011), Giannoni et al. (2014)

Does a physician ask the nurse “Is this patient’s bilirubin $>$ 45” or does she ask “What is this patient’s bilirubin level?”. Imagine how a decision support system would trigger vastly different decisions just because bilirubin was 46 instead of 44.

As an example of how a hazard ratio for a dichotomized continuous predictor is an arbitrary function of the entire distribution of the predictor within the two categories, consider a Cox model analysis of simulated age where the true effect of age is linear. First compute the $\geq 50:< 50$ hazard ratio in all subjects, then in just the subjects having age $< 60$, then in those with age $< 55$. Then repeat including all older subjects but excluding subjects with age $\leq 40$. Finally, compute the hazard ratio when only those age 40 to 60 are included. Simulated times to events have an exponential distribution, and proportional hazards holds.

Code

require(survival)
set.seed(1)
n <- 1000
age <- rnorm(n, mean=50, sd=12)
# describe(age)
cens <- 15 * runif(n)
h  <- 0.02 * exp(0.04 * (age - 50))
dt <- -log(runif(n))/h
e  <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
S  <- Surv(dt, e)
d  <- data.frame(age, S)
# coef(cph(S ~ age))   # close to true value of 0.04 used in simulation
g <- function(sub=1 : n)
       exp(coef(cph(S ~ age >= 50, data=d, subset=sub)))
d <- data.frame(Sample=c('All', 'age < 60', 'age < 55', 'age > 40',
                         'age 40-60'),
                'Hazard Ratio'=c(g(), g(age < 60), g(age < 55),
                                 g(age > 40), g(age > 40 & age < 60)))
d

     Sample Hazard.Ratio
1       All     2.148554
2  age < 60     1.645141
3  age < 55     1.461928
4  age > 40     1.760201
5 age 40-60     1.354001

See this for excellent graphical examples of the harm of categorizing predictors, especially when using quantile groups.

18.3.4 Categorizing Outcomes

Arbitrary, low power, can be difficult to interpret
Example: “The treatment is called successful if either the patient has gone down from a baseline diastolic blood pressure of $\geq 95$ mmHg to $\leq 90$ mmHg or has achieved a 10% reduction in blood pressure from baseline.”
Senn derived the response probabililty function for this discontinuous concocted endpoint

S. J. Senn (2005) after Goetghebeur [1998]

Is a mean difference of 5.4mmHg more difficult to interpret than A:17% vs. B:22% hit clinical target?

“Responder” analysis in clinical trials results in huge information loss and arbitrariness. Some issue:

Responder analyses use cutpoints on continuous or ordinal variables and cite earlier data supporting their choice of cutpoints. No example has been produced where the earlier data actually support the cutpoint.
Many responder analyses are based on change scores when they should be based solely on the follow-up outcome variable, adjusted for baseline as a covariate.
The cutpoints are always arbitrary.
There is a huge power loss (see Section 18.3.4).
The responder probability is often a function of variables that one does not want it to be a function of (see graph above).

Fedorov et al. (2009) is one of the best papers quantifying the information and power loss from categorizing continuous outcomes. One of their examples is that a clinical trial of 100 subjects with continuous $Y$ is statistically equivalent to a trial of 158 dichotomized observations, assuming that the dichotomization is at the optimum point (the population median). They show that it is very easy for dichotomization of $Y$ to raise the needed sample size by a factor of 5.

Code

knitr::include_graphics('images/fed09conFig1.png')

Figure 18.4: Power loss from dichotomizing the response variable

Fedorov et al. (2009)

18.3.5 Classification vs. Probabilistic Thinking

Number needed to treat. The only way, we are told, that physicians can understand probabilities: odds being a difficult concept only comprehensible to statisticians, bookies, punters and readers of the sports pages of popular newspapers.
— S. Senn (2008)

Many studies attempt to classify patients as diseased/normal
Given a reliable estimate of the probability of disease and the consequences of +/- one can make an optimal decision
Consequences are known at the point of care, not by the authors; categorization only at point of care
Continuous probabilities are self-contained, with their own “error rates”
Middle probs. allow for “gray zone”, deferred decision

Patient	Prob[disease]	Decision	Prob[error]
1	0.03	normal	0.03
2	0.40	normal	0.40
3	0.75	disease	0.25

Note that much of diagnostic research seems to be aimed at making optimum decisions for groups of patients. The optimum decision for a group (if such a concept even has meaning) is not optimum for individuals in the group.

18.3.6 Components of Optimal Decisions

Statistical models reduce the dimensionality of the problem but not to unity

18.4 Problems with Classification of Predictions

Feature selection / predictive model building requires choice of a scoring rule, e.g. correlation coefficient or proportion of correct classifications
Prop. classified correctly is a discontinuous improper scoring rule
- Maximized by bogus model
Minimum information
- low statistical power
- high standard errors of regression coefficients
- arbitrary to choice of cutoff on predicted risk
- forces binary decision, does not yield a “gray zone” $\rightarrow$ more data needed
Takes analyst to be provider of utility function and not the treating physician
Sensitivity and specificity are also improper scoring rules See bit.ly/risk-thresholds: Three Myths About Risk Thresholds for Prediction Models by Wynants~

18.4.1 Example: Damage Caused by Improper Scoring Rule

Predicting probability of an event, e.g., Prob[disease]
$N=400$, 0.57 of subjects have disease
Classify as diseased if prob. $>0.5$

Model	$c$	$\chi^{2}$	Proportion
	Index		Correct
age	.592	10.5	.622
sex	.589	12.4	.588
age+sex	.639	22.8	.600
constant	.500	0.0	.573

Adjusted Odds Ratios:

age (IQR 58y:42y)	1.6 (0.95CL 1.2-2.0)
sex (f:m)	0.5 (0.95CL 0.3-0.7)

Test of sex effect adjusted for age $(22.8-10.5)$:
$P=0.0005$

Example where an improper accuracy score resulted in incorrect original analyses and incorrect re-analysis

Michiels et al. (2005) used an improper accuracy score (proportion classified “correctly”) and claimed there was really no signal in all the published gene microarray studies they could analyze. This is true from the standpoint of repeating the original analyses (which also used improper accuracy scores) using multiple splits of the data, exposing the fallacy of using single data-splitting for validation. Aliferis et al. (2009) used a semi-proper accuracy score ($c$-index) and they repeated 10-fold cross-validation 100 times instead of using highly volatile data splitting. They showed that the gene microarrays did indeed have predictive signals.²

² Aliferis et al. (2009) also used correct statistical models for time-to-event data that properly accounted for variable follow-up/censoring.

Michiels et al. (2005)	Aliferis et al. (2009)
% classified correctly	$c$-index
Single split-sample validation	Multiple repeats of 10-fold CV
Wrong tests	Correct tests
(censoring, failure times)
5 of 7 published microarray	6 of 7 have signals
studies had no signal

18.5 Value of Continuous Markers

Avoid arbitrary cutpoints
Better risk spectrum
Provides gray zone
Increases power/precision

18.5.1 Prognosis in Prostate Cancer

Code

load('~/doc/Talks/infoAllergy/kattan.rda')
attach(kattan)
t   <- t.stg
gs  <- bx.glsn
psa <- preop.psa
t12 <- t.stg %in% Cs(T1C,T2A,T2B,T2C)

s <- score.binary(t12 & gs<=6 & psa<10,
                  t12 & gs<=6 & psa >=10 & psa < 20,
                  t12 & gs==7 & psa < 20,
                  (t12 & gs<=6 & psa>=20) |
                  (t12 & gs>=8 & psa<20),
                  t12 & gs>=7 & psa>=20,
                  t.stg=='T3')
levels(s) <- c('none','I', 'IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC')
u <- is.na(psa + gs) | is.na(t.stg)
s[s=='none'] <- NA
s <- s[drop=TRUE]
s3 <- s
levels(s3) <- c('I','II','II','III','III','III')
# table(s3)
units(time.event) <- 'month'
dd <- datadist(data.frame(psa, gs)); options(datadist='dd')
S <- Surv(time.event, event=='YES')
label(psa) <- 'PSA'; label(gs) <- 'Gleason Score'
f <- cph(S ~ rcs(sqrt(psa), 4), surv=TRUE, x=TRUE, y=TRUE)
p <- Predict(f, psa, time=24, fun=function(x) 1 - x)
h <- cph(S ~ s3, surv=TRUE)
z <- 1 - survest(h, times=24)$surv
ggplot(p, rdata=data.frame(psa), xlim=c(0,60),
       ylab='2-year Disease Recurrence Risk') +
  geom_hline(yintercept=unique(z), col='red', size=0.2)

Figure 18.5: Relationship between post-op PSA level and 2-year recurrence risk. Horizontal lines represent the only prognoses provided by the new staging system. Data are courtesy of M Kattan from JNCI 98:715; 2006. Modification of AJCC staging by Roach *et al.* 2006.

Now examine the entire spectrum of estimated prognoses from variables models and from discontinuous staging systems.

Code

d <- data.frame(S, psa, s3, s, gs, t.stg, time.event, event, u)
f <- cph(S ~ rcs(sqrt(psa),4) + pol(gs,2), surv=TRUE, data=d)
g <- function(form, lab) {
  f <- cph(form, surv=TRUE, data=subset(d, ! u))
  cat(lab,'\n'); print(coef(f))
  s <- f$stats
  cat('N:', s['Obs'],'\tL.R.:', round(s['Model L.R.'],1),
      '\td.f.:',s['d.f.'],'\n\n')
  prob24 <- 1 - survest(f, times=24)$surv
  prn(sum(!is.na(prob24)))
  p2 <<- c(p2, prob24[2])  # save est. prognosis for one subject
  p1936 <<- c(p1936, prob24[1936])
  C <- rcorr.cens(1-prob24, S[!u,])['C Index']
  data.frame(model=lab, chisq=s['Model L.R.'], d.f.=s['d.f.'],
             C=C, prognosis=prob24)
}
p2 <- p1936 <- NULL
w <-          g(S ~ t.stg, 'Old Stage')

Old Stage 
t.stg=T2A t.stg=T2B t.stg=T2C  t.stg=T3 
0.2791987 1.2377218 1.0626197 1.7681393 
N: 1978     L.R.: 70.5  d.f.: 4


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ s3, 'New Stage'))

New Stage 
   s3=II   s3=III 
1.225296 1.990355 
N: 1978     L.R.: 135.8     d.f.: 2


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ s, 'New Stage, 6 Levels'))

New Stage, 6 Levels 
   s=IIA    s=IIB   s=IIIA   s=IIIB   s=IIIC 
1.181824 1.248864 1.829265 2.410810 1.954420 
N: 1978     L.R.: 140.3     d.f.: 5


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ pol(gs,2),        'Gleason'))

Gleason 
         gs        gs^2 
-0.42563792  0.07857747 
N: 1978     L.R.: 90.3  d.f.: 2


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ rcs(sqrt(psa),4), 'PSA'))

PSA 
         psa         psa'        psa'' 
 -0.03605674   4.34054135 -14.63415302 
N: 1978     L.R.: 95.3  d.f.: 3


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ rcs(sqrt(psa),4) + pol(gs,2), 'PSA+Gleason'))

PSA+Gleason 
         psa         psa'        psa''           gs         gs^2 
 -0.06275304   3.57869793 -11.81685711  -0.20430862   0.05458591 
N: 1978     L.R.: 160.1     d.f.: 5


sum(!is.na(prob24))

[1] 1978

Code

w <- rbind(w, g(S ~ rcs(sqrt(psa),4) + pol(gs,2) + t.stg,
                'PSA+Gleason+Old Stage'))

PSA+Gleason+Old Stage 
        psa        psa'       psa''          gs        gs^2   t.stg=T2A 
 0.16859186  2.36244764 -8.31695008 -0.01536731  0.03516561  0.27360400 
  t.stg=T2B   t.stg=T2C    t.stg=T3 
 0.93982804  0.69117036  1.07549638 
N: 1978     L.R.: 186.9     d.f.: 9


sum(!is.na(prob24))

[1] 1978

Code

w$z <- paste(w$model, '\n',
             'X2-d.f.=',round(w$chisq-w$d.f.),
             '  C=', sprintf("%.2f", w$C), sep='')
w$z <- with(w, factor(z, unique(z)))
require(lattice)
stripplot(z ~ prognosis, data=w, lwd=1.5,
          panel=function(x, y, ...) {
            llines(p2, 1:7, col=gray(.6))
            ## llines(p1936, 1:7, col=gray(.8), lwd=2)
            ## panel.stripplot(x, y, ..., jitter.data=TRUE, cex=.5)
            for(iy in unique(unclass(y))) {
              s <- unclass(y)==iy
              histSpike(x[s], y=rep(iy,sum(s)), add=TRUE, grid=TRUE)
            }
            panel.abline(v=0, col=gray(.7))
          },
          xlab='Predicted 2-year\nDisease Recurrence Probability')

Figure 18.6: Prognostic spectrum from various models with model $\chi^2$ - d.f., and generalized $c$-index. The mostly vertical segmented line connects different prognostic estimates for the same man.

18.6 Harm from Ignoring Information

18.6.1 Case Study: Cardiac Anti-arrhythmic Drugs

Premature ventricular contractions were observed in patients surviving acute myocardial infarction
Frequent PVCs $\uparrow$ incidence of sudden death

Moore (1995), p. 46

Arrhythmia Suppression Hypothesis

Any prophylactic program against sudden death must involve the use of anti-arrhythmic drugs to subdue ventricular premature complexes.
— Bernard Lown
Widely accepted by 1978

Moore (1995), p. 49; Multicenter Postinfarction Research Group (1983)

Are PVCs independent risk factors for sudden cardiac death?

Researchers developed a 4-variable model for prognosis after acute MI

left ventricular ejection fraction (EF) $< 0.4$
PVCs $>$ 10/hr
Lung rales
Heart failure class II,III,IV

Multicenter Postinfarction Research Group (1983)

Dichotomania Caused Severe Problems

EF alone provides same prognostic spectrum as the researchers’ model
Did not adjust for EF!; PVCs $\uparrow$ when EF$<0.2$
Arrhythmias prognostic in isolation, not after adjustment for continuous EF and anatomic variables
Arrhythmias predicted by local contraction abnorm., then global function (EF)

Multicenter Postinfarction Research Group (1983); Califf et al. (1982)

18.6.2 CAST: Cardiac Arrhythmia Suppression Trial

Randomized placebo, moricizine, and Class IC anti-arrhythmic drugs flecainide and encainide
Cardiologists: unethical to randomize to placebo
Placebo group included after vigorous argument
Tests design as one-tailed; did not entertain possibility of harm
Data and Safety Monitoring Board recommended early termination of flecainide and encainide arms
Deaths $\frac{56}{730}$ drug, $\frac{22}{725}$ placebo, RR 2.5 Investigators (1989)

Conclusions: Class I Anti-Arrhythmics

Estimate of excess deaths from Class I anti-arrhythmic drugs: 24,000–69,000
Estimate of excess deaths from Vioxx: 27,000–55,000

Arrhythmia suppression hypothesis refuted; PVCs merely indicators of underlying, permanent damage

Moore (1995), pp. 289,49; D Graham, FDA

18.7 Case Study in Faulty Dichotomization of a Clinical Outcome: Statistical and Ethical Concerns in Clinical Trials for Crohn’s Disease

18.7.1 Background

Many clinical trials are underway for studying treatments for Crohn’s disease. The primary endpoint for these studies is a discontinuous, information–losing transformation of the Crohn’s Disease Activity Index (CDAI) Best et al. (1976), which was developed in 1976 by using an exploratory stepwise regression method to predict four levels of clinicians’ impressions of patients’ current status³. The first level (“very well”) was assumed to indicate the patient was in remission. The model was overfitted and was not validated. The model’s coefficients were scaled and rounded, resulting in the following scoring system (see www.ibdjohn.com/cdai).

³ Ordinary least squares regression was used for the ordinal response variable. The levels of the response were assumed to be equally spaced in severity on a numerical scale of 1, 3, 5, 7 with no justification.

The original authors plotted the predicted scores against the four clinical categories as shown below.

The authors arbitrarily assigned a cutoff of 150, below which indicates “remission.”⁴ It can be seen that “remission” includes a good number of patients actually classified as “fair to good” or “poor.” A cutoff only exists when there is a break in the distribution of scores. As an example, data were simulated from a population in which every patient having a score below 100 had a probability of response of 0.2 and every patient having a score above 100 had a probability of response of 0.8. Histograms showing the distributions of non-responders (just above the $x$-axis) and responders (at the top of the graph) appear in the figure below. A flexibly fitted logistic regression model relating observed scores to actual response status is shown, along with 0.95 confidence intervals for the fit.

⁴ However, the authors intended for CDAI to be used on a continuum: “… a numerical index was needed, the numerical value of which would be proportional to degree of illness … it could be used as the principal measure of response to the therapy under trial … the CDAI appears to meet those needs. … The data presented … is an accurate numerical expression of the physician’s over-all assessment of degree of illness in a large group of patients … we believe that it should be useful to all physicians who treat Crohn’s disease as a method of assessing patient progress.”.

Code

require(rms)
set.seed(4)
n <- 900
X <- rnorm(n, 100, 20)
dd <- datadist(X); options(datadist='dd')

p <- ifelse(X < 100, .2, .8)
y <- ifelse(runif(n) <= p, 1, 0)

f <- lrm(y ~ rcs(X, c(90,95,100,105,110)))
hs <- function(yval, side)
  histSpikeg(yhat ~ X, data=subset(data.frame(X, y), y == yval),
             side = side, ylim = c(0, 1),
             frac = function(f) .03 * f / max(f))
ggplot(Predict(f, fun=plogis), ylab='Probability of Response') +
  hs(0, 1) + hs(1, 3) + geom_vline(xintercept=100, col=gray(.7))

One can see that the fitted curves justify the use of a cut-point of 100. However, the original scores from the development of CDAI do not justify the existence of a cutoff. The fitted logistic model used to relate “very well” to the other three categories is shown below.

Code

# Points from published graph were defined in code not printed
g <- trunc(d$x)
g <- factor(g, 0:3, c('very well', 'fair to good', 'poor', 'very poor'))
remiss <- 1 * (g == 'very well')
CDAI <- d$y
label(CDAI) <- "Crohn's Disease Activity Index"
label(remiss) <- 'Remission'
dd <- datadist(CDAI,remiss); options(datadist='dd')
f <- lrm(remiss ~ rcs(CDAI,4))
ggplot(Predict(f, fun=plogis), ylab='Probability of Remission')

It is readily seen that no cutoff exists, and one would have to be below CDAI of 100 for the probability of remission to fall below even 0.5. The probability does not exceed 0.9 until the score falls below 25. Thus there is no clinical justification for the 150 cut-point.

18.7.2 Loss of Information from Using Cut-points

The statistical analysis plan in the Crohn’s disease protocols specify that efficacy will be judged by comparing two proportions after classifying patients’ CDAIs as above or below the cutoff of 150. Even if one could justify a certain cutoff from the data, the use of the cutoff is usually not warranted. This is because of the huge loss of statistical efficiency, precision, and power from dichotomizing continuous variables as discussed in more detail in Section 18.3.4. If one were forced to dichotomize a continuous response $Y$, the cut-point that loses the least efficiency is the population median of $Y$ combining treatment groups. That implies a statistical efficiency of $\frac{2}{\pi}$ or 0.637 when compared to the efficient two-sample $t$-test if the data are normally distributed⁵. In other words, the optimum cut-point would require studying 158 patients after dichotomizing the response variable to get the same power as analyzing the continuous response variable in 100 patients.

⁵ Note that the efficiency of the Wilcoxon test compared to the $t$-test is $\frac{3}{\pi}$ and the efficiency of the sign test compared to the $t$-test is $\frac{2}{\pi}$. Had analysis of covariance been used instead of a simple two-group comparison, the baseline level of CDAI could have been adjusted for as a covariate. This would have increased the power of the continuous scale approach to even higher levels.

18.7.3 Ethical Concerns and Summary

The CDAI was based on a sloppily-fit regression model predicting a subjective clinical impression. Then a cutoff of 150 was used to classify patients as in remission or not. The choice of this cutoff is in opposition to the data used to support it. The data show that one must have CDAI below 100 to have a chance of remission of only 0.5. Hence the use of CDAI$<150$ as a clinical endpoint was based on a faulty premise that apparently has never been investigated in the Crohn’s disease research community. CDAI can easily be analyzed as a continuous variable, preserving all of the power of the statistical test for efficacy (e.g., two-sample $t$-test). The results of the $t$-test can readily be translated to estimate any clinical “success probability” of interest, using efficient maximum likelihood estimators⁶

⁶ Given $\bar{x}$ and $s$ as estimates of $\mu$ and $\sigma$, the estimate of the probability that CDAI $< 150$ is simply $\Phi(\frac{150-\bar{x}}{s})$, where $\Phi$ is the cumulative distribution function of the standard normal distribution. For example, if the observed mean were 150, we would estimate the probability of remission to be 0.5.

There are substantial ethical questions that ought to be addressed when statistical power is wasted:

Patients are not consenting to be put at risk for a trial that doesn’t yield valid results.
A rigorous scientific approach is necessary in order to allow enrollment of individuals as subjects in research.
Investigators are obligated to reduce the number of subjects exposed to harm and the amount of harm to which each subject is exposed. It is not known whether investigators are receiving per-patient payments for studies in which sample size is inflated by dichotomizing CDAI.

18.8 Information May Sometimes Be Costly

When the Missionaries arrived, the Africans had the Land and the Missionaries had the Bible. They taught how to pray with our eyes closed. When we opened them, they had the land and we had the Bible.
— Jomo Kenyatta, founding father of Kenya; also attributed to Desmond Tutu

Information itself has a liberal bias.
— The Colbert Report 2006-11-28

18.9 Other Reading

Bordley (2007), Briggs & Zaretzki (2008), Vickers (2008)

Notes

This material is from “Information Allergy” by FE Harrell, presented as the Vanderbilt Discovery Lecture 2007-09-13 and presented as invited talks at Erasmus University, Rotterdam, The Netherlands, University of Glasgow (Mitchell Lecture), Ohio State University, Medical College of Wisconsin, Moffitt Cancer Center, U. Pennsylvania, Washington U., NIEHS, Duke, Harvard, NYU, Michigan, Abbott Labs, Becton Dickinson, NIAID, Albert Einstein, Mayo Clinic, U. Washington, MBSW, U. Miami, Novartis, VCU, FDA. Material is added from “How to Do Bad Biomarker Research” by FE Harrell, presented at the NIH NIDDK Conference Towards Building Better Biomarkers—Statistical Methodology, 2014-12-02.

Aliferis, C. F., Statnikov, A., Tsamardinos, I., Schildcrout, J. S., Shepherd, B. E., & Harrell, F. E. (2009). Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS ONE, 4(3).

refutation of mic05pre

Best, W. R., Becktel, J. M., Singleton, J. W., & Kern, F. (1976). Development of a Crohn’s disease activity index. Gastroent, 70, 439–444.

development of CDAI

Bordley, R. (2007). Statistical decisionmaking without math. Chance, 20(3), 39–44.

Briggs, W. M., & Zaretzki, R. (2008). The skill plot: A graphical technique for evaluating continuous diagnostic tests (with discussion). Biometrics, 64, 250–261.

"statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x_c. ... ROC curves lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio $<$i$>$receivers$<$/i$>$ (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x_c, and is not especially interested in how well he would have done had he used some different cutoff."; in the discussion David Hand states "when integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches to the different kinds of misclassifications."; see Lin, Kvam, Lu Stat in Med 28:798-813;2009

Califf, R. M., McKinnis, R. A., Burks, J., Lee, K. L., Harrell FE, V. S., Pryor, D. B., Wagner, G. S., & Rosati, R. A. (1982). Prognostic implications of ventricular arrhythmias during 24 hour ambulatory monitoring in patients undergoing cardiac catheterization for coronary artery disease. Am J Card, 50, 23–31.

Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of dichotomization. Pharm Stat, 8, 50–61. https://doi.org/10.1002/pst.331

optimal cutpoint depends on unknown parameters;should only entertain dichotomization when "estimating a value of the cumulative distribution and when the assumed model is very different from the true model";nice graphics

Giannoni, A., Baruah, R., Leong, T., Rehman, M. B., Pastormerlo, L. E., Harrell, F. E., Coats, A. J., & Francis, D. P. (2014). Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. PLoS ONE, 9(1). https://doi.org/10.1371/journal.pone.0081699

Greenland, S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56, 915–921. https://doi.org/10.1111/j.0006-341X.2000.00915.x

use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;"if one feels compelled to estimate $\tau^{2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values";claim about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards compositional model;"models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an elephant (see Draper, 1995)’". See also gus06per.

Harrell, F. E., Margolis, P. A., Gove, S., Mason, K. E., Mulholland, E. K., Lehmann, D., Muhe, L., Gatchalian, S., & Eichenwald, H. F. (1998). Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat Med, 17, 909–944. http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract

Holländer, N., Sauerbrei, W., & Schumacher, M. (2004). Confidence intervals for the effect of a prognostic factor after selection of an “optimal” cutpoint. Stat Med, 23, 1701–1713. https://doi.org/10.1002/sim.1611

true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;"It should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the “optimal” cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology."; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub

Investigators, C. (1989). Preliminary report: Effect of Encainide and Flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. NEJM, 321(6), 406–412.

Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet, 365, 488–492.

comment on p. 454;validation;microarray;bioinformatics;machine learning;nearest centroid;severe problems with data splitting;high variability of list of genes;problems with published studies;nice results for effect of training sample size on misclassification error;nice use of confidence intervals on accuracy estimates;unstable molecular signatures;high instability due to dependence on selection of training sample

Moore, T. J. (1995). Deadly Medicine: Why Tens of Thousands of Patients Died in America’s Worst Drug Disaster. Simon & Shuster.

Multicenter Postinfarction Research Group. (1983). Risk stratification and survival after myocardial infarction. NEJM, 309, 331–336.

terrible example of dichotomizing continuous variables;figure ins Papers/modelingPredictors

Naggara, O., Raymond, J., Guilbert, F., Roy, D., Weill, A., & Altman, D. G. (2011). Analysis by categorizing or dichotomizing continuous variables is inadvisable: An example from the natural history of unruptured aneurysms. Am J Neuroradiol, 32(3), 437–440. https://doi.org/10.3174/ajnr.A2425

Ohman, E. M., Armstrong, P. W., Christenson, R. H., Granger, C. B., Katus, H. A., Hamm, C. W., O’Hannesian, M. A., Wagner, G. S., Kleiman, N. S., Harrell, F. E., Califf, R. M., Topol, E. J., Lee, K. L., & Investigators, T. G. (1996). Cardiac troponin T levels for risk stratification in acute myocardial ischemia. NEJM, 335, 1333–1341.

Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med, 25, 127–141. https://doi.org/10.1002/sim.2331

destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data

Senn, S. (2008). Statistical Issues in Drug Development (Second). Wiley.

Senn, S. J. (2005). Dichotomania: An obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. Proceedings of the International Statistical Institute, 55th Session. http://hbiostat.org/papers/Senn/dichotomania.pdf

Vickers, A. J. (2008). Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. Am Statistician, 62(4), 314–320.

limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for taking some action;decision curve;has other good references to decision analysis

Wainer, H. (2006). Finding what is not there through the unfortunate binning of results: The Mendel effect. Chance, 19(1), 49–56.

can find bins that yield either positive or negative association;especially pertinent when effects are small;"With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk." - John von Neumann