18  Information Loss

Information allergy is defined as (1) refusing to obtain key information needed to make a sound decision, or (2) ignoring important available information. The latter problem is epidemic in biomedical and epidemiologic research and in clinical practice. Examples include

Examples of such problems will be discussed, concluding with an examination of how information-losing cardiac arrhythmia research may have contributed to the deaths of thousands of patients.

… wherever nature draws unclear boundaries, humans are happy to curate
— Alice Dreger, Galileo’s Middle Finger

18.1 Information & Decision Making

What is information?

  • Messages used as the basis for decision-making
  • Result of processing, manipulating and organizing data in a way that adds to the receiver’s knowledge
  • Meaning, knowledge, instruction, communication, representation, and mental stimulus1

1 pbs.org/weta, wikipedia.org/wiki/Information

Information resolves uncertainty.

Some types of information may be quantified in bits. A binary variable is represented by 0/1 in base 2, and it has 1 bit of information. This is the minimum amount of information other than no information. Systolic blood pressure measured accurately to the nearest 4mmHg has 6 binary digits—bits—of information (\(\log_{2}\frac{256}{4} = 6\)). Dichotomizing blood pressure reduces its information content to 1 bit, resulting in enormous loss of precision and power.

Value of information: Judged by the variety of outcomes to which it leads.

Optimum decision making requires the maximum and most current information the decision maker is capable of handling

Some important decisions in biomedical and epidemiologic research and clinical practice:

  • Pathways, mechanisms of action
  • Best way to use gene and protein expressions to diagnose or treat
  • Which biomarkers are most predictive and how should they be summarized?
  • What is the best way to diagnose a disease or form a prognosis?
  • Is a risk factor causative or merely a reflection of confounding?
  • How should patient outcomes be measured?
  • Is a drug effective for an outcome?
  • Who should get a drug?

18.1.1 Information Allergy

Failing to obtain key information needed to make a sound decision

  • Not collecting important baseline data on subjects Ignoring Available Information

  • Touting the value of a new biomarker that provides less information than basic clinical data

  • Ignoring confounders (alternate explanations)

  • Ignoring subject heterogeneity

  • Categorizing continuous variables or subject responses

  • Categorizing predictions as “right” or “wrong”

  • Letting fear of probabilities and costs/utilities lead an author to make decisions for individual patients

18.2 Ignoring Readily Measured Variables

Prognostic markers in acute myocardial infarction

\(c\)-index: concordance probability \(\equiv\) receiver operating characteristic curve or ROC area
Measure of ability to discriminate death within 30d

Markers \(c\)-index
CK–MB 0.63
Troponin T 0.69
Troponin T \(> 0.1\) 0.64
CK–MB + Troponin T 0.69
CK–MB + Troponin T + ECG 0.73
Age + sex 0.80
All 0.83

Ohman et al. (1996)

Though not discussed in the paper, age and sex easily trump troponin T. One can also see from the \(c\)-indexes that the common dichotomizatin of troponin results in an immediate loss of information.

Inadequate adjustment for confounders: Greenland (2000)

  • Case-control study of diet, food constituents, breast cancer
  • 140 cases, 222 controls
  • 35 food constituent intakes and 5 confounders
  • Food intakes are correlated
  • Traditional stepwise analysis not adjusting simultaneously for all foods consumed \(\rightarrow\) 11 foods had \(P < 0.05\)
  • Full model with all 35 foods competing \(\rightarrow\) 2 had \(P < 0.05\)
  • Rigorous simultaneous analysis (hierarchical random slopes model) penalizing estimates for the number of associations examined \(\rightarrow\) no foods associated with breast cancer

Ignoring subject variability in randomized experiments

  • Randomization tends to balance measured and unmeasured subject characteristics across treatment groups
  • Subjects vary widely within a treatment group
  • Subject heterogeneity usually ignored
  • False belief that balance from randomization makes this irrelevant
  • Alternative: analysis of covariance
  • If any of the baseline variables are predictive of the outcome, there is a gain in power for every type of outcome (binary, time-to-event, continuous, ordinal)
  • Example for a binary outcome in Section 13.2.2.1

18.3 Categorization: Partial Use of Information

  • Patient: What was my systolic BP this time?
  • MD: It was \(> 120\)
  • Patient: How is my diabetes doing?
  • MD: Your Hb\(_{\textrm A1c}\) was \(> 6.5\)
  • Patient: What about the prostate screen?
  • MD: If you have average prostate cancer, the chance that PSA \(> 5\) in this report is \(0.6\) Problem: Improper conditioning (\(X > c\) instead of \(X = x\))\(\rightarrow\) information loss; reversing time flow
    Sensitivity: \(P(\mathrm{observed~} X > c \mathrm{~given~unobserved~} Y=y)\)

18.3.1 Categorizing Continuous Predictors

  • Many physicians attempt to find cutpoints in continuous predictor variables
  • Mathematically such cutpoints cannot exist unless relationship with outcome is discontinuous
  • Even if the cutpoint existed, it must vary with other patient characteristics, as optimal decisions are based on risk
  • A simple 2-predictor example related to diagnosis of pneumonia will suffice
  • It is never appropriate to dichotomize an input variable other than time. Dichotomization, if it must be done, should only be done on \(\hat{Y}\). In other words, dichotomization is done as late as possible in decision making. When more than one continuous predictor variable is relevant to outcomes, the example below shows that it is mathematically incorrect to do a one-time dichotomization of a predictor. As an analogy, suppose that one is using body mass index (BMI) by itself to make a decision. One would never categorize height and categorize weight to make the decision based on BMI. One could categorize BMI, if no other outcome predictors existed for the problem.
Code
require(rms)
getHdata(ari)
r <- ari[ari$age >= 42, Cs(age, rr, pneu, coh, s2)]
abn.xray <- r$s2==0
r$coh <- factor(r$coh, 0:1, c('no cough','cough'))
f <- lrm(abn.xray ~ rcs(rr,4)*coh, data=r)
anova(f)
                Wald Statistics          Response: abn.xray 

 Factor                                   Chi-Square d.f. P     
 rr  (Factor+Higher Order Factors)        37.45      6    <.0001
  All Interactions                         0.35      3    0.9507
  Nonlinear (Factor+Higher Order Factors)  3.27      4    0.5144
 coh  (Factor+Higher Order Factors)       28.91      4    <.0001
  All Interactions                         0.35      3    0.9507
 rr * coh  (Factor+Higher Order Factors)   0.35      3    0.9507
  Nonlinear                                0.31      2    0.8549
  Nonlinear Interaction : f(A,B) vs. AB    0.31      2    0.8549
 TOTAL NONLINEAR                           3.27      4    0.5144
 TOTAL NONLINEAR + INTERACTION             3.37      5    0.6431
 TOTAL                                    66.06      7    <.0001
Code
dd <- datadist(r); options(datadist='dd')
p <- Predict(f, rr, coh, fun=plogis, conf.int=FALSE)
ggplot(p, rdata=r,
       ylab='Probability of Pneumonia',
       xlab='Adjusted Respiratory Rate/min.',
       ylim=c(0,.7), legend.label='')
Figure 18.1: Estimated risk of pneumonia with respect to two predictors in WHO ARI study from Harrell et al. (1998). Tick marks show data density of respiratory rate stratified by cough. Any cutpoint for the rate must depend on cough to be consistent with optimum decision making, which must be risk-based.

18.3.2 What Kinds of True Thresholds Exist?

Natura non facit saltus
(Nature does not make jumps)
— Gottfried Wilhelm Leibniz

Figure 18.2: Two kinds of thresholds. The pattern on the left represents a discontinuity in the first derivative (slope) of the function relating a marker to outcome. On the right there is a lowest-order discontinuity.

What Do Cutpoints Really Assume?
Cutpoints assume discontinuous relationships of the type in the right plot of Figure 18.2, and they assume that the true cutpoint is known. Beyond the molecular level, such patterns do not exist unless \(X=\)time and the discontinuity is caused by an event. Cutpoints assume homogeneity of outcome on either side of the cutpoint.

18.3.3 Cutpoints are Disasters

  • Prognostic relevance of S-phase fraction in breast cancer: 19 different cutpoints used in literature
  • Cathepsin-D content and disease-free survival in node-negative breast cancer: 12 studies, 12 cutpoints
  • ASCO guidelines: neither cathepsin-D nor S-phrase fraction recommended as prognostic markers (Holländer et al. (2004))

Cutpoints may be found that result in both increasing and decreasing relationships with any dataset with zero correlation

Delay Mean Score
0-11 210
11-20 215
21-30 217
31-40 218
41- 220

Wainer (2006); See “Dichotomania” (S. J. Senn (2005)) and Royston et al. (2006)

Wainer (2006)

In fact, virtually all published cutpoints are analysis artifacts caused by finding a threshold that minimizes \(P\)-values when comparing outcomes of subjects below with those above the “threshold”. Two-sample statistical tests suffer the least loss of power when cutting at the median because this balances the sample sizes. That this method has nothing to do with biology can be readily seen by adding observations on either tail of the marker, resulting in a shift of the median toward that tail even though the relationship between the continuous marker and the outcome remains unchanged.

Code
knitr::include_graphics('gia14opt-fig2c.png')
Figure 18.3: Thresholds in cardiac biomarkers

In “positive” studies: threshold 132–800 ng/L, correlation with study median \(r=0.86\) (Giannoni et al. (2014))

Lack of Meaning of Effects Based on Cutpoints

  • Researchers often use cutpoints to estimate the high:low effects of risk factors (e.g., BMI vs. asthma)
  • Results in inaccurate predictions, residual confounding, impossible to interpret
  • high:low represents unknown mixtures of highs and lows
  • Effects (e.g., odds ratios) will vary with population
  • If the true effect is monotonic, adding subjects in the low range or high range or both will increase odds ratios (and all other effect measures) arbitrarily

Royston et al. (2006), Naggara et al. (2011), Giannoni et al. (2014)

Does a physician ask the nurse “Is this patient’s bilirubin \(>\) 45” or does she ask “What is this patient’s bilirubin level?”. Imagine how a decision support system would trigger vastly different decisions just because bilirubin was 46 instead of 44.

As an example of how a hazard ratio for a dichotomized continuous predictor is an arbitrary function of the entire distribution of the predictor within the two categories, consider a Cox model analysis of simulated age where the true effect of age is linear. First compute the \(\geq 50:< 50\) hazard ratio in all subjects, then in just the subjects having age \(< 60\), then in those with age \(< 55\). Then repeat including all older subjects but excluding subjects with age \(\leq 40\). Finally, compute the hazard ratio when only those age 40 to 60 are included. Simulated times to events have an exponential distribution, and proportional hazards holds.

Code
require(survival)
set.seed(1)
n <- 1000
age <- rnorm(n, mean=50, sd=12)
# describe(age)
cens <- 15 * runif(n)
h  <- 0.02 * exp(0.04 * (age - 50))
dt <- -log(runif(n))/h
e  <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
S  <- Surv(dt, e)
d  <- data.frame(age, S)
# coef(cph(S ~ age))   # close to true value of 0.04 used in simulation
g <- function(sub=1 : n)
       exp(coef(cph(S ~ age >= 50, data=d, subset=sub)))
d <- data.frame(Sample=c('All', 'age < 60', 'age < 55', 'age > 40',
                         'age 40-60'),
                'Hazard Ratio'=c(g(), g(age < 60), g(age < 55),
                                 g(age > 40), g(age > 40 & age < 60)))
d
     Sample Hazard.Ratio
1       All     2.148554
2  age < 60     1.645141
3  age < 55     1.461928
4  age > 40     1.760201
5 age 40-60     1.354001

See this for excellent graphical examples of the harm of categorizing predictors, especially when using quantile groups.

18.3.4 Categorizing Outcomes

  • Arbitrary, low power, can be difficult to interpret
  • Example: “The treatment is called successful if either the patient has gone down from a baseline diastolic blood pressure of \(\geq 95\) mmHg to \(\leq 90\) mmHg or has achieved a 10% reduction in blood pressure from baseline.”
  • Senn derived the response probabililty function for this discontinuous concocted endpoint

S. J. Senn (2005) after Goetghebeur [1998]

Is a mean difference of 5.4mmHg more difficult to interpret than A:17% vs. B:22% hit clinical target?

“Responder” analysis in clinical trials results in huge information loss and arbitrariness. Some issue:

  • Responder analyses use cutpoints on continuous or ordinal variables and cite earlier data supporting their choice of cutpoints. No example has been produced where the earlier data actually support the cutpoint.
  • Many responder analyses are based on change scores when they should be based solely on the follow-up outcome variable, adjusted for baseline as a covariate.
  • The cutpoints are always arbitrary.
  • There is a huge power loss (see Section 18.3.4).
  • The responder probability is often a function of variables that one does not want it to be a function of (see graph above).

Fedorov et al. (2009) is one of the best papers quantifying the information and power loss from categorizing continuous outcomes. One of their examples is that a clinical trial of 100 subjects with continuous \(Y\) is statistically equivalent to a trial of 158 dichotomized observations, assuming that the dichotomization is at the optimum point (the population median). They show that it is very easy for dichotomization of \(Y\) to raise the needed sample size by a factor of 5.

Code
knitr::include_graphics('fed09conFig1.png')
Figure 18.4: Power loss from dichotomizing the response variable

Fedorov et al. (2009)

18.3.5 Classification vs. Probabilistic Thinking

Number needed to treat. The only way, we are told, that physicians can understand probabilities: odds being a difficult concept only comprehensible to statisticians, bookies, punters and readers of the sports pages of popular newspapers.
S. Senn (2008)

  • Many studies attempt to classify patients as diseased/normal
  • Given a reliable estimate of the probability of disease and the consequences of +/- one can make an optimal decision
  • Consequences are known at the point of care, not by the authors; categorization only at point of care
  • Continuous probabilities are self-contained, with their own “error rates”
  • Middle probs. allow for “gray zone”, deferred decision
Patient Prob[disease] Decision Prob[error]
1 0.03 normal 0.03
2 0.40 normal 0.40
3 0.75 disease 0.25

Note that much of diagnostic research seems to be aimed at making optimum decisions for groups of patients. The optimum decision for a group (if such a concept even has meaning) is not optimum for individuals in the group.

18.3.6 Components of Optimal Decisions

Statistical models reduce the dimensionality of the problem but not to unity

18.4 Problems with Classification of Predictions

  • Feature selection / predictive model building requires choice of a scoring rule, e.g. correlation coefficient or proportion of correct classifications
  • Prop. classified correctly is a discontinuous improper scoring rule
    • Maximized by bogus model
  • Minimum information
    • low statistical power
    • high standard errors of regression coefficients
    • arbitrary to choice of cutoff on predicted risk
    • forces binary decision, does not yield a “gray zone” \(\rightarrow\) more data needed
  • Takes analyst to be provider of utility function and not the treating physician
  • Sensitivity and specificity are also improper scoring rules See bit.ly/risk-thresholds: Three Myths About Risk Thresholds for Prediction Models by Wynants~

18.4.1 Example: Damage Caused by Improper Scoring Rule

  • Predicting probability of an event, e.g., Prob[disease]
  • \(N=400\), 0.57 of subjects have disease
  • Classify as diseased if prob. \(>0.5\)
Model \(c\) \(\chi^{2}\) Proportion
Index Correct
age .592 10.5 .622
sex .589 12.4 .588
age+sex .639 22.8 .600
constant .500 0.0 .573

Adjusted Odds Ratios:

age (IQR 58y:42y) 1.6 (0.95CL 1.2-2.0)
sex (f:m) 0.5 (0.95CL 0.3-0.7)

Test of sex effect adjusted for age \((22.8-10.5)\):
\(P=0.0005\)

Example where an improper accuracy score resulted in incorrect original analyses and incorrect re-analysis

Michiels et al. (2005) used an improper accuracy score (proportion classified “correctly”) and claimed there was really no signal in all the published gene microarray studies they could analyze. This is true from the standpoint of repeating the original analyses (which also used improper accuracy scores) using multiple splits of the data, exposing the fallacy of using single data-splitting for validation. Aliferis et al. (2009) used a semi-proper accuracy score (\(c\)-index) and they repeated 10-fold cross-validation 100 times instead of using highly volatile data splitting. They showed that the gene microarrays did indeed have predictive signals.2

2 Aliferis et al. (2009) also used correct statistical models for time-to-event data that properly accounted for variable follow-up/censoring.

Michiels et al. (2005) Aliferis et al. (2009)
% classified correctly \(c\)-index
Single split-sample validation Multiple repeats of 10-fold CV
Wrong tests Correct tests
(censoring, failure times)
5 of 7 published microarray 6 of 7 have signals
studies had no signal

18.5 Value of Continuous Markers

  • Avoid arbitrary cutpoints
  • Better risk spectrum
  • Provides gray zone
  • Increases power/precision

18.5.1 Prognosis in Prostate Cancer

Code
load('~/doc/Talks/infoAllergy/kattan.rda')
attach(kattan)
t   <- t.stg
gs  <- bx.glsn
psa <- preop.psa
t12 <- t.stg %in% Cs(T1C,T2A,T2B,T2C)

s <- score.binary(t12 & gs<=6 & psa<10,
                  t12 & gs<=6 & psa >=10 & psa < 20,
                  t12 & gs==7 & psa < 20,
                  (t12 & gs<=6 & psa>=20) |
                  (t12 & gs>=8 & psa<20),
                  t12 & gs>=7 & psa>=20,
                  t.stg=='T3')
levels(s) <- c('none','I', 'IIA', 'IIB', 'IIIA', 'IIIB', 'IIIC')
u <- is.na(psa + gs) | is.na(t.stg)
s[s=='none'] <- NA
s <- s[drop=TRUE]
s3 <- s
levels(s3) <- c('I','II','II','III','III','III')
# table(s3)
units(time.event) <- 'month'
dd <- datadist(data.frame(psa, gs)); options(datadist='dd')
S <- Surv(time.event, event=='YES')
label(psa) <- 'PSA'; label(gs) <- 'Gleason Score'
f <- cph(S ~ rcs(sqrt(psa), 4), surv=TRUE, x=TRUE, y=TRUE)
p <- Predict(f, psa, time=24, fun=function(x) 1 - x)
h <- cph(S ~ s3, surv=TRUE)
z <- 1 - survest(h, times=24)$surv
ggplot(p, rdata=data.frame(psa), xlim=c(0,60),
       ylab='2-year Disease Recurrence Risk') +
  geom_hline(yintercept=unique(z), col='red', size=0.2)
Figure 18.5: Relationship between post-op PSA level and 2-year recurrence risk. Horizontal lines represent the only prognoses provided by the new staging system. Data are courtesy of M Kattan from JNCI 98:715; 2006. Modification of AJCC staging by Roach et al. 2006.

Now examine the entire spectrum of estimated prognoses from variables models and from discontinuous staging systems.

Code
d <- data.frame(S, psa, s3, s, gs, t.stg, time.event, event, u)
f <- cph(S ~ rcs(sqrt(psa),4) + pol(gs,2), surv=TRUE, data=d)
g <- function(form, lab) {
  f <- cph(form, surv=TRUE, data=subset(d, ! u))
  cat(lab,'\n'); print(coef(f))
  s <- f$stats
  cat('N:', s['Obs'],'\tL.R.:', round(s['Model L.R.'],1),
      '\td.f.:',s['d.f.'],'\n\n')
  prob24 <- 1 - survest(f, times=24)$surv
  prn(sum(!is.na(prob24)))
  p2 <<- c(p2, prob24[2])  # save est. prognosis for one subject
  p1936 <<- c(p1936, prob24[1936])
  C <- rcorr.cens(1-prob24, S[!u,])['C Index']
  data.frame(model=lab, chisq=s['Model L.R.'], d.f.=s['d.f.'],
             C=C, prognosis=prob24)
}
p2 <- p1936 <- NULL
w <-          g(S ~ t.stg, 'Old Stage')
Old Stage 
t.stg=T2A t.stg=T2B t.stg=T2C  t.stg=T3 
0.2791987 1.2377218 1.0626197 1.7681393 
N: 1978     L.R.: 70.5  d.f.: 4 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ s3, 'New Stage'))
New Stage 
   s3=II   s3=III 
1.225296 1.990355 
N: 1978     L.R.: 135.8     d.f.: 2 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ s, 'New Stage, 6 Levels'))
New Stage, 6 Levels 
   s=IIA    s=IIB   s=IIIA   s=IIIB   s=IIIC 
1.181824 1.248864 1.829265 2.410810 1.954420 
N: 1978     L.R.: 140.3     d.f.: 5 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ pol(gs,2),        'Gleason'))
Gleason 
         gs        gs^2 
-0.42563792  0.07857747 
N: 1978     L.R.: 90.3  d.f.: 2 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ rcs(sqrt(psa),4), 'PSA'))
PSA 
         psa         psa'        psa'' 
 -0.03605674   4.34054135 -14.63415302 
N: 1978     L.R.: 95.3  d.f.: 3 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ rcs(sqrt(psa),4) + pol(gs,2), 'PSA+Gleason'))
PSA+Gleason 
         psa         psa'        psa''           gs         gs^2 
 -0.06275304   3.57869793 -11.81685711  -0.20430862   0.05458591 
N: 1978     L.R.: 160.1     d.f.: 5 


sum(!is.na(prob24))

[1] 1978
Code
w <- rbind(w, g(S ~ rcs(sqrt(psa),4) + pol(gs,2) + t.stg,
                'PSA+Gleason+Old Stage'))
PSA+Gleason+Old Stage 
        psa        psa'       psa''          gs        gs^2   t.stg=T2A 
 0.16859186  2.36244764 -8.31695008 -0.01536731  0.03516561  0.27360400 
  t.stg=T2B   t.stg=T2C    t.stg=T3 
 0.93982804  0.69117036  1.07549638 
N: 1978     L.R.: 186.9     d.f.: 9 


sum(!is.na(prob24))

[1] 1978
Code
w$z <- paste(w$model, '\n',
             'X2-d.f.=',round(w$chisq-w$d.f.),
             '  C=', sprintf("%.2f", w$C), sep='')
w$z <- with(w, factor(z, unique(z)))
require(lattice)
stripplot(z ~ prognosis, data=w, lwd=1.5,
          panel=function(x, y, ...) {
            llines(p2, 1:7, col=gray(.6))
            ## llines(p1936, 1:7, col=gray(.8), lwd=2)
            ## panel.stripplot(x, y, ..., jitter.data=TRUE, cex=.5)
            for(iy in unique(unclass(y))) {
              s <- unclass(y)==iy
              histSpike(x[s], y=rep(iy,sum(s)), add=TRUE, grid=TRUE)
            }
            panel.abline(v=0, col=gray(.7))
          },
          xlab='Predicted 2-year\nDisease Recurrence Probability')
Figure 18.6: Prognostic spectrum from various models with model \(\chi^2\) - d.f., and generalized \(c\)-index. The mostly vertical segmented line connects different prognostic estimates for the same man.

18.6 Harm from Ignoring Information

18.6.1 Case Study: Cardiac Anti-arrhythmic Drugs

  • Premature ventricular contractions were observed in patients surviving acute myocardial infarction
  • Frequent PVCs \(\uparrow\) incidence of sudden death

Moore (1995), p. 46

Arrhythmia Suppression Hypothesis

Any prophylactic program against sudden death must involve the use of anti-arrhythmic drugs to subdue ventricular premature complexes.
— Bernard Lown
Widely accepted by 1978

Moore (1995), p. 49; Multicenter Postinfarction Research Group (1983)

Are PVCs independent risk factors for sudden cardiac death?

Researchers developed a 4-variable model for prognosis after acute MI

  • left ventricular ejection fraction (EF) \(< 0.4\)
  • PVCs \(>\) 10/hr
  • Lung rales
  • Heart failure class II,III,IV

Multicenter Postinfarction Research Group (1983)

Dichotomania Caused Severe Problems

  • EF alone provides same prognostic spectrum as the researchers’ model
  • Did not adjust for EF!; PVCs \(\uparrow\) when EF\(<0.2\)
  • Arrhythmias prognostic in isolation, not after adjustment for continuous EF and anatomic variables
  • Arrhythmias predicted by local contraction abnorm., then global function (EF)

Multicenter Postinfarction Research Group (1983); Califf et al. (1982)

18.6.2 CAST: Cardiac Arrhythmia Suppression Trial

  • Randomized placebo, moricizine, and Class IC anti-arrhythmic drugs flecainide and encainide
  • Cardiologists: unethical to randomize to placebo
  • Placebo group included after vigorous argument
  • Tests design as one-tailed; did not entertain possibility of harm
  • Data and Safety Monitoring Board recommended early termination of flecainide and encainide arms
  • Deaths \(\frac{56}{730}\) drug, \(\frac{22}{725}\) placebo, RR 2.5 Investigators (1989)

Conclusions: Class I Anti-Arrhythmics

Estimate of excess deaths from Class I anti-arrhythmic drugs: 24,000–69,000
Estimate of excess deaths from Vioxx: 27,000–55,000

Arrhythmia suppression hypothesis refuted; PVCs merely indicators of underlying, permanent damage

Moore (1995), pp. 289,49; D Graham, FDA

18.7 Case Study in Faulty Dichotomization of a Clinical Outcome: Statistical and Ethical Concerns in Clinical Trials for Crohn’s Disease

18.7.1 Background

Many clinical trials are underway for studying treatments for Crohn’s disease. The primary endpoint for these studies is a discontinuous, information–losing transformation of the Crohn’s Disease Activity Index (CDAI) Best et al. (1976), which was developed in 1976 by using an exploratory stepwise regression method to predict four levels of clinicians’ impressions of patients’ current status3. The first level (“very well”) was assumed to indicate the patient was in remission. The model was overfitted and was not validated. The model’s coefficients were scaled and rounded, resulting in the following scoring system (see www.ibdjohn.com/cdai).

3 Ordinary least squares regression was used for the ordinal response variable. The levels of the response were assumed to be equally spaced in severity on a numerical scale of 1, 3, 5, 7 with no justification.


The original authors plotted the predicted scores against the four clinical categories as shown below.

The authors arbitrarily assigned a cutoff of 150, below which indicates “remission.”4 It can be seen that “remission” includes a good number of patients actually classified as “fair to good” or “poor.” A cutoff only exists when there is a break in the distribution of scores. As an example, data were simulated from a population in which every patient having a score below 100 had a probability of response of 0.2 and every patient having a score above 100 had a probability of response of 0.8. Histograms showing the distributions of non-responders (just above the \(x\)-axis) and responders (at the top of the graph) appear in the figure below. A flexibly fitted logistic regression model relating observed scores to actual response status is shown, along with 0.95 confidence intervals for the fit.

4 However, the authors intended for CDAI to be used on a continuum: “… a numerical index was needed, the numerical value of which would be proportional to degree of illness … it could be used as the principal measure of response to the therapy under trial … the CDAI appears to meet those needs. … The data presented … is an accurate numerical expression of the physician’s over-all assessment of degree of illness in a large group of patients … we believe that it should be useful to all physicians who treat Crohn’s disease as a method of assessing patient progress.”.

Code
require(rms)
set.seed(4)
n <- 900
X <- rnorm(n, 100, 20)
dd <- datadist(X); options(datadist='dd')

p <- ifelse(X < 100, .2, .8)
y <- ifelse(runif(n) <= p, 1, 0)

f <- lrm(y ~ rcs(X, c(90,95,100,105,110)))
hs <- function(yval, side)
  histSpikeg(yhat ~ X, data=subset(data.frame(X, y), y == yval),
             side = side, ylim = c(0, 1),
             frac = function(f) .03 * f / max(f))
ggplot(Predict(f, fun=plogis), ylab='Probability of Response') +
  hs(0, 1) + hs(1, 3) + geom_vline(xintercept=100, col=gray(.7))

One can see that the fitted curves justify the use of a cut-point of 100. However, the original scores from the development of CDAI do not justify the existence of a cutoff. The fitted logistic model used to relate “very well” to the other three categories is shown below.

Code
# Points from published graph were defined in code not printed
g <- trunc(d$x)
g <- factor(g, 0:3, c('very well', 'fair to good', 'poor', 'very poor'))
remiss <- 1 * (g == 'very well')
CDAI <- d$y
label(CDAI) <- "Crohn's Disease Activity Index"
label(remiss) <- 'Remission'
dd <- datadist(CDAI,remiss); options(datadist='dd')
f <- lrm(remiss ~ rcs(CDAI,4))
ggplot(Predict(f, fun=plogis), ylab='Probability of Remission')

It is readily seen that no cutoff exists, and one would have to be below CDAI of 100 for the probability of remission to fall below even 0.5. The probability does not exceed 0.9 until the score falls below 25. Thus there is no clinical justification for the 150 cut-point.

18.7.2 Loss of Information from Using Cut-points

The statistical analysis plan in the Crohn’s disease protocols specify that efficacy will be judged by comparing two proportions after classifying patients’ CDAIs as above or below the cutoff of 150. Even if one could justify a certain cutoff from the data, the use of the cutoff is usually not warranted. This is because of the huge loss of statistical efficiency, precision, and power from dichotomizing continuous variables as discussed in more detail in Section 18.3.4. If one were forced to dichotomize a continuous response \(Y\), the cut-point that loses the least efficiency is the population median of \(Y\) combining treatment groups. That implies a statistical efficiency of \(\frac{2}{\pi}\) or 0.637 when compared to the efficient two-sample \(t\)-test if the data are normally distributed5. In other words, the optimum cut-point would require studying 158 patients after dichotomizing the response variable to get the same power as analyzing the continuous response variable in 100 patients.

5 Note that the efficiency of the Wilcoxon test compared to the \(t\)-test is \(\frac{3}{\pi}\) and the efficiency of the sign test compared to the \(t\)-test is \(\frac{2}{\pi}\). Had analysis of covariance been used instead of a simple two-group comparison, the baseline level of CDAI could have been adjusted for as a covariate. This would have increased the power of the continuous scale approach to even higher levels.

18.7.3 Ethical Concerns and Summary

The CDAI was based on a sloppily-fit regression model predicting a subjective clinical impression. Then a cutoff of 150 was used to classify patients as in remission or not. The choice of this cutoff is in opposition to the data used to support it. The data show that one must have CDAI below 100 to have a chance of remission of only 0.5. Hence the use of CDAI\(<150\) as a clinical endpoint was based on a faulty premise that apparently has never been investigated in the Crohn’s disease research community. CDAI can easily be analyzed as a continuous variable, preserving all of the power of the statistical test for efficacy (e.g., two-sample \(t\)-test). The results of the \(t\)-test can readily be translated to estimate any clinical “success probability” of interest, using efficient maximum likelihood estimators6

6 Given \(\bar{x}\) and \(s\) as estimates of \(\mu\) and \(\sigma\), the estimate of the probability that CDAI \(< 150\) is simply \(\Phi(\frac{150-\bar{x}}{s})\), where \(\Phi\) is the cumulative distribution function of the standard normal distribution. For example, if the observed mean were 150, we would estimate the probability of remission to be 0.5.

There are substantial ethical questions that ought to be addressed when statistical power is wasted:

  1. Patients are not consenting to be put at risk for a trial that doesn’t yield valid results.
  2. A rigorous scientific approach is necessary in order to allow enrollment of individuals as subjects in research.
  3. Investigators are obligated to reduce the number of subjects exposed to harm and the amount of harm to which each subject is exposed. It is not known whether investigators are receiving per-patient payments for studies in which sample size is inflated by dichotomizing CDAI.

18.8 Information May Sometimes Be Costly

When the Missionaries arrived, the Africans had the Land and the Missionaries had the Bible. They taught how to pray with our eyes closed. When we opened them, they had the land and we had the Bible.
— Jomo Kenyatta, founding father of Kenya; also attributed to Desmond Tutu

Information itself has a liberal bias.
— The Colbert Report 2006-11-28

18.9 Other Reading

Bordley (2007), Briggs & Zaretzki (2008), Vickers (2008)

Notes

This material is from “Information Allergy” by FE Harrell, presented as the Vanderbilt Discovery Lecture 2007-09-13 and presented as invited talks at Erasmus University, Rotterdam, The Netherlands, University of Glasgow (Mitchell Lecture), Ohio State University, Medical College of Wisconsin, Moffitt Cancer Center, U. Pennsylvania, Washington U., NIEHS, Duke, Harvard, NYU, Michigan, Abbott Labs, Becton Dickinson, NIAID, Albert Einstein, Mayo Clinic, U. Washington, MBSW, U. Miami, Novartis, VCU, FDA. Material is added from “How to Do Bad Biomarker Research” by FE Harrell, presented at the NIH NIDDK Conference Towards Building Better Biomarkers—Statistical Methodology, 2014-12-02.