- There are several approaches for analyzing patient-oriented outcome scales longitudinally
- The oldest approaches are WIN ratio/odds and DOOR (desirability of outcome rankings)
- Newer approaches are time-savings (TS) and ordinal longitudinal models (OLM)
- TS uses either composite outcome measures or statistical summaries of separate measures
- When an outcome scale changes linearly over time for both active and control patients, TS’s estimate of time saved is the number of days earlier that the control group suffered the same fate as the treated group did at end of study.
- Both the TS method and ordinal longitudinal models (OLM) will offer significant power gains/sample size reductions over prevailing single-outcome-scale primary analyses.
- OLMs are less parametric, use more formal methods with more explicit assumptions, and provide readouts that are more clinical.

- Mean covariate-adjusted time in outcome level y or better, separately by treatment
- Covariate-adjusted difference in mean days in outcome level y or better (in COVID-19 we estimated the mean days unwell, which is similar to time to recovery but properly handles death and allows for un-recovery for a recovered patient)
- Probability of being in outcome level y or worse by time and treatment, for any y
- Treated:control odds ratio for transitioning to worse outcome states from one visit to the next
- When the outcome scale is interval-valued, the differences in mean scale values over time, by treatment

Like the Wilcoxon test, WIN and DOOR provide treatment effectiveness metrics that do not have meaning outside of the study, i.e, they provide no clinical readouts such as treatment differences on the original scale, or reduction in time unwell. They allow one to estimate how often a randomly chosen treated patient fares better than a randomly chosen control patient, but do not tell the researcher about how much better. For the case of a response having a normal distribution with equal variance for the two treatments, the concordance probability that is the essence of Wilcoxon, WIN, and DOOR is a function of the difference in means divided by the standard deviation. The concordance probability does not reveal the clinical effectiveness (difference in means). WIN and DOOR also do not have a way to handle missing component data and often make the tie-breaking choices too difficult, e.g., how does one rank an early myocardial infarction against a later non-debilitating stroke? OLMs only require ranking of various patient states within a single day of assessment, as time is handled by explicit trajectory modeling. DOOR and WIN try to rank times and amounts jointly.

What I have been working on since 2020 and have used in several ACTIV-6 COVID-19 therapeutic trials is a flexible OLM that OB and I are now trying in the reanalysis of an ALS study. This model is somewhat of a formalization of the time savings (TS) approach, with these differences:

- TS best applies when treatment has the effect of expanding the time axis, i.e., progression happens slower in a uniform way. The TS approach may have difficulty defining exactly what is saved when the time-response curves are not linear. Example: if both control and treated response curves are initially linear with a faster progression for placebo, but both curves come together by study end, time saved will be zero but the treatment patients enjoyed more months of better function (as would be indicated by a mean time in state from an OLM).
- OLM directly estimates mean time in state y or worse, e.g., mean number of months in which a patient has disability level greater than a certain number, or is dead. The model does this for all possible values of y, so multiple interpretable clinical readouts are provided.
- TS can’t fully handle missing component data
- The TS approach to handling longitudinal trajectories and intra-patient correlation is a bit ad hoc
- OLM fits correlation patterns like those seen in multiple longitudinal studies
- TS can’t handle clinical overrides, e.g., doesn’t know how to handle deaths of other serious events when a clinical event makes it impossible or meaningless to assess a functional outcome scale
- OLM using state transition models is the only available statistical method that completely takes absorbing states such as death into account
- TS does not allow formal examination of how the treatment may effect different parts of the outcome scale differently
- TS may use multiple outcome scales, possibly having different units
of measurement, and combines them by some sort of averaging. When the
scales are unequally correlated with each other, the weighting used in
the combination process will not be correct.

- OLM tries not to average outcome scales but to use up-front judgments that create a hierarchical outcome scale that captures the worst thing that happens to a patient at a given time interval; OLM could use the average of several scales as a component if there is strong clinical consensus on how the various scales are weighted
- OLM reduces to a Cox proportional hazards model if there is only one outcome and it is binary and terminal (e.g., death)
- OLM reduces to the Wilcoxon test if there is only one time point
- OLM provides a unified way to do covariate adjustment
- OLM can be done in both frequentist and Bayesian settings and allows the use of extra-study information (e.g., borrowing information from adults in a pediatric study)
- TS can only be done in a frequentist setting and cannot borrow information

Consider a patient status scale that is intended to capture important aspects of what patients are experiencing in a single time period. A gold standard approach is to present various scenarios to carefully chosen participants in a cross-sectional study. For each scenario a triangulation process is used to elicit the person’s time trade-off, i.e., how many months of live would she sacrifice to be in perfect health rather than to be in that scenario. What is learned from the time trade-off experiment is used to assign utilities to patient status at each assessment, and these utilities are analyzed using ordinal regression (preferred because of odd distributions of utilities including floor and ceiling effects) or linear models. The utilities over time form the basis for efficacy assessment. In an ordinal analysis, death is assigned a utility or may just be considered to be the worst outcome (it doesn’t matter how much worse) if the time-tradeoff experient did not find states worse than death.

The best statistical approach to analyzing patients’ trajectories for efficacy assessment needs to approximate the fully utility-based approach outlined above. OLMs tend to do that.

When utilities are not available, there are other approaches for constructing good composite outcome scales, e.g.,

- Present a large number of patients’ current outcomes to a large number of medical experts. These can be presented in pairs, and one records the pairwise determination of which outcome is worse. Discrete choice models can put all such data together to derive an overall ranking of outcome severities that can be used as a per-visit ordinal scale in an OLM. The choice model can also provide evidence for sufficient consensus of rankings among experts.
- When the number of levels of an outcome scale is not very large, have experts rank all the levels (without providing a series of separate scenarios to them) and average the rankings, checking for adequate consensus of rankings.
- For a large cohort study with accurately collected data, in which there is an ultimate gold-standard outcome such as time to death, use a regression model to relate all the candidate outcome scales to the gold-standard outcome. Use the regression coefficients to compute a time-to-death-anchored disease severity scale, and use this derived scale in the RCT.
- Using a sufficiently large patient observational cohort in which a series of outcome scales are assessed at a single follow-up time, take one of the scales as the anchor. Predict the anchor scale from all of the others, one-at-a-time, to obtain a recalibration of each non-anchor scale to the anchor. Translate each non-anchor scale to the anchor scale metric. Using the observed correlation matrix among all the scales, compute an optimum weighted average from the anchor scale and all the recalibrated non-anchor scales. Take the new weighted average scale as an ordinal patient response, and add clinical event overrides at the top of that scale.
- As is often done in migraine headache therapeutic studies, score all the potential outcome scales on a similar Likert scale. At the time of randomization, ask each patient in the RCT which facet of the disease process is most important to them. Use the chosen Likert scale as an ordinal outcome (with clinical event overrides) in an OLM. Thus each patient may be scored on a different scale, but all the scales are Likert-scored to be as comparable as possible.

It’s worth pausing a moment to consider a completely different approach to multiple outcome scales that Bayesian joint modeling can provide. This approach is a bit more complex but is possibly very aligned with regulatory decision making.

- Specify a statistical model for each outcome scale and outcome event
- Specify how the different outcomes are correlated with each other (this is often done in the context of copulas)
- For m outcome measures compute the Bayesian posterior probability that the treatment benefits the patient on at least q of these
- Example: if there are 5 outcome measures, a drug might be considered effective if there is evidence that it benefits the patient on at least 3 of the outcomes, without having to specify up-front which 3.
- This doesn’t require multiple different scales to be transformed to a single scale, and there is no multiplicity even in a frequentist sense
- This approach doesn’t work so well when interrupting events (e.g., death) are common within the study’s follow-up duration

No matter which analytic approach is used, it is important to consider which outcome elements to include in light of sensitivity to detect disease progression and not giving too much weight to scales that change little over time.