13  Ordinal Logistic Regression

Resources: fharrell.com/post/rpo

13.1 Background

  • Levels of \(Y\) are ordered; no spacing assumed
  • If no model assumed, one can still assess association between \(X\) and \(Y\)
  • Example: \(Y=0,1,2\) corresponds to no event, heart attack, death. Test of association between race (3 levels) and outcome (3 levels) can be obtained from a \(2 \times 2\) d.f. \(\chi^2\) test for a contingency table
  • If willing to assuming an ordering of \(Y\) and a model, can test for association using \(2 \times 1\) d.f.
  • Proportional odds model: generalization of Wilcoxon-Mann-Whitney-Kruskal-Wallis-Spearman
  • Can have \(n\) categories for \(n\) observations!
  • Continuation ratio model: discrete proportional hazards model
A

13.2 Ordinality Assumption

  • Assume \(X\) is linearly related to some appropriate log odds
  • Estimate mean \(X | Y\) with and without assuming the model holds
  • For simplicity assume \(X\) discrete
  • Let \(P_{jx} = \Pr(Y=j | X=x, model)\)
B
\[\begin{array}{ccc} \Pr(X=x | Y=j) &=& \Pr(Y=j | X=x) \frac{\Pr(X=x)}{\Pr(Y=j)} \nonumber \\ E(X | Y=j) &=& \sum_{x} x P_{jx} \frac{\Pr(X=x)}{\Pr(Y=j)} , \end{array}\]

and the expectation can be estimated by \[ \hat{E}(X | Y=j) = \sum_{x} x \hat{P}_{jx} f_{x} / g_{j}, \] where \(\hat{P}_{jx}\) = estimate of \(P_{jx}\) from the 1-predictor model \[ \hat{E}(X | Y=j) = \sum_{i=1}^{n} x_{i} \hat{P}_{jx_{i}} / g_{j} . \]

13.3 Proportional Odds Model

13.3.1 Model

  • Walker & Duncan (1967) — most popular ordinal response model

  • For convenience \(Y=0, 1, 2, \ldots, k\) \[ \Pr[Y \geq j | X] = \frac{1}{1 + \exp[-(\alpha_{j} + X\beta)]} = \text{expit}(\alpha_{j} + X\beta) \] where \(j=1, 2, \ldots, k\).

  • \(\alpha_j\) is the logit of Prob\([Y \geq j]\) when all \(X\)s are zero

  • Odds \(Y \geq j | X = \exp(\alpha_{j}+X\beta)\)

  • Odds \(Y \geq j | X_{m}=a+1\) / Odds \(Y \geq j | X_{m}=a = e^{\beta_{m}}\)

  • Same odds ratio \(e^{\beta_{m}}\) for any \(j=1,2,\ldots,k\)

  • Odds\([Y \geq j | X]\) / Odds\([Y \geq v | X] = \frac{e^{\alpha_{j}+X\beta}}{e^{\alpha_{v}+X\beta}} = e^{\alpha_{j}-\alpha_{v}}\)

  • Odds \(Y \geq j | X\) = constant \(\times\) Odds \(Y \geq v | X\)

  • Assumes OR for 1 unit increase in age is the same when considering the probability of death as when considering the probability of death or heart attack

  • PO model only uses ranks of \(Y\); same \(\hat{\beta}\)s if transform \(Y\); is robust to outliers

C

13.4 Assumptions and Interpretation of Parameters

13.4.1 Estimation

13.4.2 Residuals

  • Construct binary events \(Y\geq j, j=1,2,\ldots,k\) and use corresponding predicted probabilities \[ \hat{P}_{ij} = \text{expit}(\hat{\alpha}_{j}+X_{i}\hat{\beta}), \]

  • Score residual for subject \(i\) predictor \(m\):

D

\[ U_{im} = X_{im} ( [Y_{i}\geq j] - \hat{P}_{ij} ), \]

  • For each column of \(U\) plot mean \(\bar{U}_{\cdot m}\) and C.L.
    against \(Y\)
  • Partial residuals are more useful as they can also estimate covariable transformations(Collett, 2002; Landwehr et al., 1984):

\[ r_{im} = \hat{\beta}_{m}X_{im} + \frac{Y_{i} - \hat{P}_{i}}{\hat{P}_{i} (1 - \hat{P_{i}})}, \] where

\[ \hat{P_{i}} = \frac{1}{1 + \exp[-(\alpha + X_{i}\hat{\beta})]} = \text{expit}(\alpha + X_{i}\hat{\beta}). \]

  • Smooth \(r_{im}\) vs. \(X_{im}\) to estimate how \(X_{m}\) relates to the log relative odds that \(Y=1 | X_{m}\)
  • For ordinal \(Y\) compute binary model partial res. for all cutoffs \(j\):

\[ r_{im} = \hat{\beta}_{m}X_{im} + \frac{ [Y_{i} \geq j] - \hat{P}_{ij}}{\hat{P}_{ij} (1 - \hat{P}_{ij})}, \]

Li & Shepherd (2012) have a residual for ordinal models that serves for the entire range of \(Y\) without the need to consider cutoffs. Their residual is useful for checking functional form of predictors but not the proportional odds assumption.

13.4.3 Assessment of Model Fit

E
  • Section 13.2
  • Stratified proportions \(Y \geq j, j = 1, 2, \ldots, k\), since \(\text{logit}(Y \geq j | X) - \text{logit}(Y \geq i | X) = \alpha_{j} - \alpha_{i}\), for any constant \(X\)
Code
spar(ps=7)
require(Hmisc)
getHdata(support)
sfdm <- as.integer(support$sfdm2) - 1
sf <- function(y)
  c('Y>=1'=qlogis(mean(y >= 1)), 'Y>=2'=qlogis(mean(y >= 2)),
    'Y>=3'=qlogis(mean(y >= 3)))
s <- summary(sfdm ~ adlsc + sex + age + meanbp, fun=sf, data=support)
plot(s, which=1:3, pch=1:3, xlab='logit', vnames='names', main='',
     width.factor=1.5)
Figure 13.1: Checking PO assumption separately for a series of predictors. The circle, triangle, and plus sign correspond to \(Y \geq 1, 2, 3\), respectively. PO is checked by examining the vertical constancy of distances between any two of these three symbols. Response variable is the severe functional disability scale sfdm2 from the \(1000\)-patient SUPPORT dataset, with the last two categories combined because of low frequency of coma/intubation.

Note that computing ORs for various cutoffs and seeing disagreements among them can cause reviewers to confuse lack of fit with sampling variation (random chance). For a 4-level Y having a given vector of probabilities in a control group, let’s assume PO with a true OR of 3 and simulate 10 experiments to show variation of observed ORs over all cutoffs of Y. First do it for a sample size of n=10,000 then for n=200.

Code
p <- c(.1, .2, .3, .4)
set.seed(7)
simPOcuts(10000, odds.ratio=3, p=p)
                  y>=2     y>=3     y>=4
Simulation 1  2.822446 2.996466 2.868826
Simulation 2  2.869895 2.990059 3.046847
Simulation 3  3.116256 2.883418 3.259568
Simulation 4  3.124129 3.169854 3.088819
Simulation 5  2.918214 3.075153 3.019572
Simulation 6  2.927433 2.990027 2.818097
Simulation 7  3.336213 3.221263 3.006214
Simulation 8  2.772421 3.110541 3.075125
Simulation 9  3.083166 3.226093 2.958005
Simulation 10 3.666162 3.248635 2.995330
Code
simPOcuts(  200, odds.ratio=3, p=p)
                  y>=2     y>=3     y>=4
Simulation 1  1.879121 2.203704 2.793400
Simulation 2  2.666667 2.666667 2.720430
Simulation 3       Inf 5.664179 3.304527
Simulation 4  4.260870 2.068376 3.672414
Simulation 5  3.272727 3.006689 4.515625
Simulation 6  5.705882 8.327586 3.618243
Simulation 7  1.642055 2.545894 2.071429
Simulation 8  9.791209 2.047619 3.073733
Simulation 9  2.811594 2.041667 2.666667
Simulation 10 2.966292 3.328767 2.470588

A better approach for discrete Y is to show the impact of making the PO assumption:

  • Select a set of covariate settings over which to evaluate accuracy of predictions
  • Vary at least one of the predictors, i.e., the one for which you want to assess the impact of the PO assumption
  • Fit a PO model the usual way
  • Fit other models that relax the PO assumption
    • to relax the PO assumption for all predictors fit a multinomial logistic model
    • to relax the PO assumption for a subset of predictors fit a partial PO model(Peterson & Harrell, 1990) (here using the R VGAM function)
  • For all the covariate combinations evaluate predicted probabilities for all levels of Y using the PO model and the relaxed assumption models
  • Use the bootstrap to compute confidence intervals for the difference in predicted probabilities between a PO and a relaxed model. This guards against over-emphasis of differences when the sample size does not support estimation, especially for the relaxed model with more parameters. Note that the sample problem occurs when comparing predicted unadjusted probabilities to observed proportions, as observed proportions can be noisy.
F

Example: re-do the assessment above. Note that the VGAM package vglm function did not converge when fitting the partial proportional odds model. So in what follows we only compare the fully PO model with the fully non-PO multinomial model.

Code
spar(ps=7)
require(rms)
require(ggplot2)
# One headache: since using a non-rms fitting function need to hard
# code knots in splines.  This is not necessary in rms 6.5-0 and later
kq   <- seq(0.05, 0.95, length=4)
kage <- quantile(support$age,    kq, na.rm=TRUE)
kbp  <- quantile(support$meanbp, kq, na.rm=TRUE)

d    <- expand.grid(adlsc=0:6, sex='male', age=65, meanbp=78)

# Because of very low frequency (7) of sfdm=3, combine categories 3, 4
support$sfdm3 <- pmin(sfdm, 3)

done.impact <- TRUE
if(done.impact) w <- readRDS('impactPO.rds') else {
  set.seed(1)
  w <- impactPO(sfdm3 ~ pol(adlsc, 2) + sex + rcs(age, kage) +
                  rcs(meanbp, kbp),
                newdata=d, B=300, data=support)
  saveRDS(w, 'impactPO.rds')
}
w
                          PO      Multinomial
Deviance                  1871.70 1795.93    
d.f.                      12      30         
AIC                       1895.70 1855.93    
p                          9      27         
LR chi^2                  124.11  199.89     
LR - p                    115.11  172.89     
LR chi^2 test for PO              75.77      
  d.f.                            18         
  Pr(>chi^2)                      <0.0001    
MCS R2                    0.137   0.212      
MCS R2 adj                0.128   0.186      
McFadden R2               0.062   0.100      
McFadden R2 adj           0.053   0.073      
Mean |difference| from PO         0.036      

Covariate combination-specific mean |difference| in predicted probabilities

       method adlsc  sex age meanbp Mean |difference|
1 Multinomial     0 male  65     78             0.020
2 Multinomial     1 male  65     78             0.030
3 Multinomial     2 male  65     78             0.035
4 Multinomial     3 male  65     78             0.030
5 Multinomial     4 male  65     78             0.021
6 Multinomial     5 male  65     78             0.032
7 Multinomial     6 male  65     78             0.081

Bootstrap 0.95 confidence intervals for differences in model predicted
probabilities based on 300 bootstraps


  adlsc  sex age meanbp
1     0 male  65     78

PO - Multinomial probability estimates

           0      1      2      3
Lower -0.052 -0.032 -0.011 -0.040
Upper  0.027  0.060  0.035 -0.002

  adlsc  sex age meanbp
2     1 male  65     78

PO - Multinomial probability estimates

           0     1      2      3
Lower -0.082 0.019 -0.037 -0.032
Upper -0.009 0.091  0.025  0.004

  adlsc  sex age meanbp
3     2 male  65     78

PO - Multinomial probability estimates

           0     1      2      3
Lower -0.083 0.032 -0.071 -0.030
Upper -0.004 0.101  0.018  0.017

  adlsc  sex age meanbp
4     3 male  65     78

PO - Multinomial probability estimates

           0     1      2      3
Lower -0.059 0.015 -0.084 -0.035
Upper  0.016 0.093  0.017  0.021

  adlsc  sex age meanbp
5     4 male  65     78

PO - Multinomial probability estimates

           0      1      2      3
Lower -0.030 -0.023 -0.071 -0.055
Upper  0.045  0.072  0.027  0.017

  adlsc  sex age meanbp
6     5 male  65     78

PO - Multinomial probability estimates

          0      1      2      3
Lower 0.005 -0.097 -0.047 -0.080
Upper 0.106  0.031  0.048  0.012

  adlsc  sex age meanbp
7     6 male  65     78

PO - Multinomial probability estimates

          0      1     2      3
Lower 0.049 -0.262 -0.01 -0.075
Upper 0.189 -0.039  0.07  0.051
Code
# Reverse levels of y so stacked bars have higher y located higher
revo <- function(z) {
  z <- as.factor(z)
  factor(z, levels=rev(levels(as.factor(z))))
}
ggplot(w$estimates, aes(x=method, y=Probability, fill=revo(y))) +
    facet_wrap(~ adlsc) + geom_col() +
    xlab('') + guides(fill=guide_legend(title='')) +
    theme(legend.position='bottom')
Figure 13.2: Checking the impact of the PO assumption by comparing predicted probabilities of all outcome categories from a PO model with a multinomial logistic model that assumes PO for no variables

AIC indicates that a model assuming PO nowhere is better than one that assumes PO everywhere. In an earlier run in which convergence was obtained using vglm, the PPO model with far fewer parameters is just as good, and is also better than the PO model, indicating non-PO with respect to adlsc. The fit of the PO model is such that cell probabilities become more inaccurate for higher level outcomes. This can also be seen by the increasing mean absolute differences with probability estimates from the PO model. Bootstrap nonparametric percentile confidence intervals (300 resamples, not all of which converged) for differences in predicted cell probabilities between the PO model and a relaxed model are also found above. Some of these intervals exclude 0, in line with the other evidence for non-PO.

See fharrell.com/post/impactpo for a similar example.

When \(Y\) is continuous or almost continuous and \(X\) is discrete, the PO model assumes that the logit of the cumulative distribution function of \(Y\) is parallel across categories of \(X\). The corresponding, more rigid, assumptions of the ordinary linear model (here, parametric ANOVA) are parallelism and linearity of the normal inverse cumulative distribution function across categories of \(X\). As an example consider the web site’s diabetes dataset, where we consider the distribution of log glycohemoglobin across subjects’ body frames.

Code
getHdata(diabetes)
d <- subset(diabetes, ! is.na(frame))
setDT(d)  # make it a data table
w <- d[, ecdfSteps(glyhb, extend=c(2.6, 16.2)), by=frame]
# ecdfSteps is in Hmisc
# Duplicate ECDF points for trying 2 transformations
u <- rbind(data.table(trans='paste(Phi^-1, (F[n](x)))', w[, z := qnorm(y) ]),
           data.table(trans='logit(F[n](x))',           w[, z := qlogis(y)]))
# See https://hbiostat.org/rflow/graphics.html#sec-graphics-ggplot2
ggplot(u, aes(x, z, color=frame)) + geom_step() +
  facet_wrap(~ trans, label='label_parsed', scale='free_y') +
  ylab('Transformed ECDF') +
  xlab(hlab(glyhb))   # hlab is in Hmisc; looks up label in d
Figure 13.3: Transformed empirical cumulative distribution functions stratified by body frame in the diabetes dataset. Left panel: checking all assumptions of the parametric ANOVA. Right panel: checking all assumptions of the PO model (here, Kruskal–Wallis test).

13.4.4 Quantifying Predictive Ability

13.4.5 Describing the Model

For PO models there are four and sometimes five types of relevant predictions:

  1. logit\([Y \geq j | X]\), i.e., the linear predictor
  2. Prob\([Y \geq j | X]\)
  3. Prob\([Y = j | X]\)
  4. Quantiles of \(Y | X\) (e.g., the median1)
  5. \(E(Y | X)\) if \(Y\) is interval scaled.
G
  • 1 If \(Y\) does not have very many levels, the median will be a discontinuous function of \(X\) and may not be satisfactory.

  • Graphics:

    1. Partial effect plot (prob. scale or mean)
    2. Odds ratio chart
    3. Nomogram (possibly including the mean)
    H

    13.4.6 Validating the Fitted Model

    13.4.7 R Functions

    The rms package’s lrm and orm functions fit the PO model directly, assuming that the levels of the response variable (e.g., the levels of a factor variable) are listed in the proper order. predict computes all types of estimates except for quantiles. orm allows for more link functions than the logistic and is intended to efficiently handle hundreds of intercepts as happens when \(Y\) is continuous.

    The R functions popower and posamsize (in the Hmisc package) compute power and sample size estimates for ordinal responses using the proportional odds model.

    The function plot.xmean.ordinaly in rms computes and graphs the quantities described in Section 13.2. It plots simple \(Y\)-stratified means overlaid with \(\hat{E}(X | Y=j)\), with \(j\) on the \(x\)-axis. The \(\hat{E}\)s are computed for both PO and continuation ratio ordinal logistic models.

    The Hmisc package’s summary.formula function is also useful for assessing the PO assumption.

    Generic rms functions such as validate, calibrate, and nomogram work with PO model fits from lrm as long as the analyst specifies which intercept(s) to use.

    rms has a special function generator Mean for constructing an easy-to-use function for getting the predicted mean \(Y\) from a PO model. This is handy with plot and nomogram. If the fit has been run through bootcov, it is easy to use the Predict function to estimate bootstrap confidence limits for predicted means.

    13.5 Continuation Ratio Model

    13.5.1 Model

    Unlike the PO model, which is based on cumulative probabilities, the continuation ratio (CR) model is based on conditional probabilities. The (forward) CR model(Armstrong & Sloan, 1989; Berridge & Whitehead, 1991; Fienberg, 2007) is stated as follows for \(Y=0,\ldots,k\):

    \[\begin{array}{c} \Pr(Y=j | Y \geq j, X) &=& \text{expit}(\theta_{j} + X\gamma) \nonumber \\ \text{logit}(Y=0 | Y \geq 0, X) &=& \text{logit}(Y=0 | X) \nonumber \\ &=& \theta_{0} + X\gamma \\ \text{logit}(Y=1 | Y \geq 1, X) &=& \theta_{1} + X\gamma \nonumber \\ \ldots & & \nonumber \\ \text{logit}(Y=k-1 | Y \geq k-1, X) &=& \theta_{k-1} + X\gamma . \nonumber \end{array} \tag{13.1}\]

    The CR model has been said to be likely to fit ordinal responses when subjects have to “pass through” one category to get to the next The CR model is a discrete version of the Cox proportional hazards model. The discrete hazard function is defined as \(\Pr(Y=j | Y \geq j)\).

    Advantage of CR model: easy to allow unequal slopes across \(Y\) for selected \(X\).

    13.5.2 Assumptions and Interpretation of Parameters

    13.5.3 Estimation

    13.5.4 Residuals

    To check CR model assumptions, binary logistic model partial residuals are again valuable. We separately fit a sequence of binary logistic models using a series of binary events and the corresponding applicable (increasingly small) subsets of subjects, and plot smoothed partial residuals against \(X\) for all of the binary events. Parallelism in these plots indicates that the CR model’s constant \(\gamma\) assumptions are satisfied.

    13.5.5 Assessment of Model Fit

    13.5.6 Extended CR Model

    13.5.7 Role of Penalization in Extended CR Model

    13.5.8 Validating the Fitted Model

    13.5.9 R Functions

    The cr.setup function in rms returns a list of vectors useful in constructing a dataset used to trick a binary logistic function such as lrm into fitting CR models.