High-Level View
- There are several approaches for analyzing patient-oriented outcome
scales longitudinally
- The oldest approaches are WIN ratio/odds and DOOR (desirability of
outcome rankings)
- Newer approaches are time-savings (TS) and ordinal longitudinal
models (OLM)
- TS uses either composite outcome measures or statistical summaries
of separate measures
- When an outcome scale changes linearly over time for both active and
control patients, TS’s estimate of time saved is the number of days
earlier that the control group suffered the same fate as the treated
group did at end of study.
- Both the TS method and ordinal longitudinal models (OLM) will offer
significant power gains/sample size reductions over prevailing
single-outcome-scale primary analyses.
- OLMs are less parametric, use more formal methods with more explicit
assumptions, and provide readouts that are more clinical.
Example Readouts from an OLM
- Mean covariate-adjusted time in outcome level y or better,
separately by treatment
- Covariate-adjusted difference in mean days in outcome level y or
better (in COVID-19 we estimated the mean days unwell, which is similar
to time to recovery but properly handles death and allows for
un-recovery for a recovered patient)
- Probability of being in outcome level y or worse by time and
treatment, for any y
- Treated:control odds ratio for transitioning to worse outcome states
from one visit to the next
- When the outcome scale is interval-valued, the differences in mean
scale values over time, by treatment
WIN and DOOR
Like the Wilcoxon test, WIN and DOOR provide treatment effectiveness
metrics that do not have meaning outside of the study, i.e, they provide
no clinical readouts such as treatment differences on the original
scale, or reduction in time unwell. They allow one to estimate how often
a randomly chosen treated patient fares better than a randomly chosen
control patient, but do not tell the researcher about how much better.
For the case of a response having a normal distribution with equal
variance for the two treatments, the concordance probability that is the
essence of Wilcoxon, WIN, and DOOR is a function of the difference in
means divided by the standard deviation. The concordance probability
does not reveal the clinical effectiveness (difference in means). WIN
and DOOR also do not have a way to handle missing component data and
often make the tie-breaking choices too difficult, e.g., how does one
rank an early myocardial infarction against a later non-debilitating
stroke? OLMs only require ranking of various patient states within a
single day of assessment, as time is handled by explicit trajectory
modeling. DOOR and WIN try to rank times and amounts jointly.
Comparison of TS and OLM
What I have been working on since 2020 and have used in several
ACTIV-6 COVID-19 therapeutic trials is a flexible OLM that OB and I are
now trying in the reanalysis of an ALS study. This model is somewhat of
a formalization of the time savings (TS) approach, with these
differences:
- TS best applies when treatment has the effect of expanding the time
axis, i.e., progression happens slower in a uniform way. The TS approach
may have difficulty defining exactly what is saved when the
time-response curves are not linear. Example: if both control and
treated response curves are initially linear with a faster progression
for placebo, but both curves come together by study end, time saved will
be zero but the treatment patients enjoyed more months of better
function (as would be indicated by a mean time in state from an
OLM).
- OLM directly estimates mean time in state y or worse, e.g., mean
number of months in which a patient has disability level greater than a
certain number, or is dead. The model does this for all possible values
of y, so multiple interpretable clinical readouts are provided.
- TS can’t fully handle missing component data
- The TS approach to handling longitudinal trajectories and
intra-patient correlation is a bit ad hoc
- OLM fits correlation patterns like those seen in multiple
longitudinal studies
- TS can’t handle clinical overrides, e.g., doesn’t know how to handle
deaths of other serious events when a clinical event makes it impossible
or meaningless to assess a functional outcome scale
- OLM using state transition models is the only available statistical
method that completely takes absorbing states such as death into
account
- TS does not allow formal examination of how the treatment may effect
different parts of the outcome scale differently
- TS may use multiple outcome scales, possibly having different units
of measurement, and combines them by some sort of averaging. When the
scales are unequally correlated with each other, the weighting used in
the combination process will not be correct.
- OLM tries not to average outcome scales but to use up-front
judgments that create a hierarchical outcome scale that captures the
worst thing that happens to a patient at a given time interval; OLM
could use the average of several scales as a component if there is
strong clinical consensus on how the various scales are weighted
- OLM reduces to a Cox proportional hazards model if there is only one
outcome and it is binary and terminal (e.g., death)
- OLM reduces to the Wilcoxon test if there is only one time
point
- OLM provides a unified way to do covariate adjustment
- OLM can be done in both frequentist and Bayesian settings and allows
the use of extra-study information (e.g., borrowing information from
adults in a pediatric study)
- TS can only be done in a frequentist setting and cannot borrow
information
Approaches
to Constructing Composite Outcomes Scales
Consider a patient status scale that is intended to capture important
aspects of what patients are experiencing in a single time period. A
gold standard approach is to present various scenarios to carefully
chosen participants in a cross-sectional study. For each scenario a
triangulation process is used to elicit the person’s time trade-off,
i.e., how many months of live would she sacrifice to be in perfect
health rather than to be in that scenario. What is learned from the time
trade-off experiment is used to assign utilities to patient status at
each assessment, and these utilities are analyzed using ordinal
regression (preferred because of odd distributions of utilities
including floor and ceiling effects) or linear models. The utilities
over time form the basis for efficacy assessment. In an ordinal
analysis, death is assigned a utility or may just be considered to be
the worst outcome (it doesn’t matter how much worse) if the
time-tradeoff experient did not find states worse than death.
The best statistical approach to analyzing patients’ trajectories for
efficacy assessment needs to approximate the fully utility-based
approach outlined above. OLMs tend to do that.
When utilities are not available, there are other approaches for
constructing good composite outcome scales, e.g.,
- Present a large number of patients’ current outcomes to a large
number of medical experts. These can be presented in pairs, and one
records the pairwise determination of which outcome is worse. Discrete
choice models can put all such data together to derive an overall
ranking of outcome severities that can be used as a per-visit ordinal
scale in an OLM. The choice model can also provide evidence for
sufficient consensus of rankings among experts.
- When the number of levels of an outcome scale is not very large,
have experts rank all the levels (without providing a series of separate
scenarios to them) and average the rankings, checking for adequate
consensus of rankings.
- For a large cohort study with accurately collected data, in which
there is an ultimate gold-standard outcome such as time to death, use a
regression model to relate all the candidate outcome scales to the
gold-standard outcome. Use the regression coefficients to compute a
time-to-death-anchored disease severity scale, and use this derived
scale in the RCT.
- Using a sufficiently large patient observational cohort in which a
series of outcome scales are assessed at a single follow-up time, take
one of the scales as the anchor. Predict the anchor scale from all of
the others, one-at-a-time, to obtain a recalibration of each non-anchor
scale to the anchor. Translate each non-anchor scale to the anchor scale
metric. Using the observed correlation matrix among all the scales,
compute an optimum weighted average from the anchor scale and all the
recalibrated non-anchor scales. Take the new weighted average scale as
an ordinal patient response, and add clinical event overrides at the top
of that scale.
- As is often done in migraine headache therapeutic studies, score all
the potential outcome scales on a similar Likert scale. At the time of
randomization, ask each patient in the RCT which facet of the disease
process is most important to them. Use the chosen Likert scale as an
ordinal outcome (with clinical event overrides) in an OLM. Thus each
patient may be scored on a different scale, but all the scales are
Likert-scored to be as comparable as possible.
A Completely Different
Bayesian Approach
It’s worth pausing a moment to consider a completely different
approach to multiple outcome scales that Bayesian joint modeling can
provide. This approach is a bit more complex but is possibly very
aligned with regulatory decision making.
- Specify a statistical model for each outcome scale and outcome
event
- Specify how the different outcomes are correlated with each other
(this is often done in the context of copulas)
- For m outcome measures compute the Bayesian posterior probability
that the treatment benefits the patient on at least q of these
- Example: if there are 5 outcome measures, a drug might be considered
effective if there is evidence that it benefits the patient on at least
3 of the outcomes, without having to specify up-front which 3.
- This doesn’t require multiple different scales to be transformed to
a single scale, and there is no multiplicity even in a frequentist
sense
- This approach doesn’t work so well when interrupting events (e.g.,
death) are common within the study’s follow-up duration
General
Considerations for Candidate Outcome Scales
No matter which analytic approach is used, it is important to
consider which outcome elements to include in light of sensitivity to
detect disease progression and not giving too much weight to scales that
change little over time.