Views on Composite Outcome Scales and Statistical Approaches for Analyzing Them

Frank Harrell

High-Level View

Example Readouts from an OLM


Like the Wilcoxon test, WIN and DOOR provide treatment effectiveness metrics that do not have meaning outside of the study, i.e, they provide no clinical readouts such as treatment differences on the original scale, or reduction in time unwell. They allow one to estimate how often a randomly chosen treated patient fares better than a randomly chosen control patient, but do not tell the researcher about how much better. For the case of a response having a normal distribution with equal variance for the two treatments, the concordance probability that is the essence of Wilcoxon, WIN, and DOOR is a function of the difference in means divided by the standard deviation. The concordance probability does not reveal the clinical effectiveness (difference in means). WIN and DOOR also do not have a way to handle missing component data and often make the tie-breaking choices too difficult, e.g., how does one rank an early myocardial infarction against a later non-debilitating stroke? OLMs only require ranking of various patient states within a single day of assessment, as time is handled by explicit trajectory modeling. DOOR and WIN try to rank times and amounts jointly.

Comparison of TS and OLM

What I have been working on since 2020 and have used in several ACTIV-6 COVID-19 therapeutic trials is a flexible OLM that OB and I are now trying in the reanalysis of an ALS study. This model is somewhat of a formalization of the time savings (TS) approach, with these differences:

Approaches to Constructing Composite Outcomes Scales

Consider a patient status scale that is intended to capture important aspects of what patients are experiencing in a single time period. A gold standard approach is to present various scenarios to carefully chosen participants in a cross-sectional study. For each scenario a triangulation process is used to elicit the person’s time trade-off, i.e., how many months of live would she sacrifice to be in perfect health rather than to be in that scenario. What is learned from the time trade-off experiment is used to assign utilities to patient status at each assessment, and these utilities are analyzed using ordinal regression (preferred because of odd distributions of utilities including floor and ceiling effects) or linear models. The utilities over time form the basis for efficacy assessment. In an ordinal analysis, death is assigned a utility or may just be considered to be the worst outcome (it doesn’t matter how much worse) if the time-tradeoff experient did not find states worse than death.

The best statistical approach to analyzing patients’ trajectories for efficacy assessment needs to approximate the fully utility-based approach outlined above. OLMs tend to do that.

When utilities are not available, there are other approaches for constructing good composite outcome scales, e.g.,

A Completely Different Bayesian Approach

It’s worth pausing a moment to consider a completely different approach to multiple outcome scales that Bayesian joint modeling can provide. This approach is a bit more complex but is possibly very aligned with regulatory decision making.

General Considerations for Candidate Outcome Scales

No matter which analytic approach is used, it is important to consider which outcome elements to include in light of sensitivity to detect disease progression and not giving too much weight to scales that change little over time.