Inner Workings of the Bayesian Proportional Odds Model

Ben Goodrich put a function Pr inside the Stan code. This is used just for checking that probabilities add up to one. You can use rstan::expose_stan_functions('foo.stan') to access those functions from R.

Existing Stan Approaches

QR Decomposition for Orthonormalization

From Ben Goodrich

The last \(\beta\) parameter is unchanged so its \(\beta\) equals its \(\theta\) parameter in the Stan program. This will sometimes create a warning message (e.g. when using pairs() that “beta[…] is duplicative”).

However, compiling an entire Stan program just to obtain a rescaled QR decomposition is overkill because the same can be accomplished with just R code. So, if you replace your stanQr() function with this function that does what your qr.stan program does:

stanQr <- function(X, center = TRUE) {
  p <- ncol(X)
  N <- nrow(X)
  if (center) X <- scale(X, center = TRUE, scale = FALSE)
  QR <- qr(X)
  Q <- qr.Q(QR)
  R <- qr.R(QR)
  sgns <- sign(diag(R))
  Q_ast <- sweep(Q, MARGIN = 2, STATS = sgns, FUN = `*`)
  R_ast <- sweep(R, MARGIN = 1, STATS = sgns, FUN = `*`)
  corner <- R_ast[p, p]
  R_ast_inverse <- backsolve(R_ast, diag(p))
  Q_ast <- Q_ast * corner
  R_ast <- R_ast / corner
  R_ast_inverse <- R_ast_inverse * corner
  return(list(Xqr = Q_ast, R = R_ast, R_inv = R_ast_inverse))

Then you also get the necessary equivalence. Moreover, the bottom right element of w\(R and w\)R_inv is 1.0 so the last coefficient is the same even though the design matrices are different.

All of the columns of X are different from all of the columns of Q_ast (which is what you are calling Xqr). They have to be; otherwise they wouldn’t be orthogonal. But the sequence of equalities still holds

eta = X * beta = (Q * R) * beta = (Q * corner * 1 / corner * R) * beta = (Q_ast * R_ast) * beta = Q_ast * (R_ast * beta) = Q_ast * theta

where theta = R_ast * beta. So, if you do the model in terms of Q_ast and get (the posterior distribution of) the K-vector theta. When you pre-multiply theta by the upper-triangular matrix R_ast^{-1}, since the lower-right corner of both R_ast and R_ast^{-1} is 1.0, beta[K] = 0 * theta[1] + 0 * theta[2] + … + 0 * theta[K - 1] + 1.0 * theta[K] = theta[K]. In other words, although you changed the design matrix, you left the K-th coefficient as is and changed the preceding K - 1 coefficients.

In the case of a continuous outcome and no link function, the same idea is used to get least-squares estimates

where theta = Q’ * y and then you pre-multiply theta by R^{-1}. We just added the renormalization to make the lower-right element 1.0 so that it is easy to specify a prior on theta[K].

Q: Does this guarantee that the prior for the last beta does really get applied to the treatment effect and only the treatment effect?

A: Yes, if the priors on the elements of theta are independent, which is the way we have written it in the .stan programs and would seem to be the only sensible choice since the columns of Q_ast are orthogonal to each other.

Q: If I were to split the design matrices into two matrices (and for this case the 2nd matrix has one column), apply QR to the first and not the second, and combine into one log likelihood evaluation, would I get exactly the same results as the “corner QR” method?

Generally, no. In the example you gave earlier, if you split the third column of X out, then it remains highly correlated with the first two:

w <- stanQr(X[,1:2])
zapsmall(cor(cbind(w$Xqr, scale(X[,3], center = TRUE, scale = FALSE))))
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.0000000 0.5685352
[2,] 0.0000000 1.0000000 0.5932746
[3,] 0.5685352 0.5932746 1.0000000

In the case where the last variable is randomized, then the correlation would be close to zero, but it isn’t exactly the same as orthogonalizing.

General Information About Random Effects

MLEs and Optimization for Random Effects Models

Prediction With Random Effects

AR(1) Modeling

The equation in the last reference p. 5 allows specification of the random effect at time t without passing through all the previous random effects, so it accounts for unequally spaced and missing time points. It suggests this model:

Would this specification lead to sampling problems? Do \(\sigma_\gamma\) and \(\sigma_w\) compete too much?

Ben Goodrich re warnings about Pareto k diagnostics from loo(): The loo() function is trying to estimate what would happen if 1 patient were dropped from the analysis and predicted conditional on all the other patients. But its importance sampling has infinite variance when the Pareto k for the left-out observation is greater than 1 and has too high variance if the Pareto k is greater than 0.7. The Pareto k pertains to how much the posterior distribution would change if one observation were left out. In these models, if one person were left out the corresponding gamma and that patient’s column of eps_raw would revert to their prior distributions because there would no longer be any information in the data to update them with. Thus, this posterior distribution is too sensitive to the particular patients being conditioned on to estimate the expected log predictive density of future patients well.

To overcome this problem, we can do K-fold with K equal to the number of patients or redo the loo calculation to use a likelihood function that integrates the random effects out like in a Frequentist estimator, as described in that Psychometrica article I sent you the link to. But we can still use Stan to first obtain the posterior draws conditional on the random effects.

FH question on homescedasticity of random effects: With random effect for subject \(i\) at the first time \(t=1\) having variance \(\sigma^2_\gamma\), we can use the recursive relationship \(V(X_t) = \rho^2 V(X_{t-1}) + \sigma^2_\epsilon\) to get variances at other times, where \(\sigma^2_\epsilon\) is the within-subject white noise variance. The variance of the random effect at time \(t=2\) is \(\rho^2 \sigma^2_\gamma + \sigma^2_\epsilon\) and equating the two successive variances results in \(\sigma_\epsilon = \sigma_\gamma \sqrt{1 - \rho^2}\). The same equation \(\times \rho^2\) results from equating the variance at \(t=3\) to the variance at \(t=2\) and the same equation $\(\rho^4\) results when comparing \(t=4\) to \(t=3\). So it is reasonable to not make \(\sigma_\epsilon\) a free parameter but instead to derive it from \(\sigma_\gamma\) and \(\rho\)? Would this make posterior sampling behave much better too? On the other hand, this homoscedasticity restriction may effectively limit the magnitude of correlation so perhaps the restriction should not be used.

The variances of \(X_t\) for \(t=1,2,3,4\) are: \(\sigma^2_\gamma\), \(\rho^2 \sigma^2_\gamma + \sigma^2_\epsilon\), \(\rho^4 \sigma^2_\gamma + (1 + \rho^2) \sigma^2_\epsilon\), \(\rho^6 \sigma^2_\gamma + (\rho^4 + \rho^2 + 1) \sigma^2_\epsilon)\).

Trade-offs for Various Longitudinal Model Formulations

Lee and Daniels Marginal PO State Transition Model


Ordinary Random Effects (Random Intercepts)

Simple Serial Dependence Model on Y Scale

Using a first-order Markov process, a serial dependence model looks like the following. Let \(Y_0\) denote the baseline version of the ordinal outcome variable \(Y\). Note that follow-up measurements will typically have more levels than \(Y_0\) has. For example, the top category of \(Y\) may be death but patients must be alive to enter the study. This should not present a problem, especially if there is some stability of transition tendencies over time. Let \(g(Y)\) be a function of previous outcome level \(Y\). This function can be a vector function with multiple parameters to estimate. For example if \(Y\) has 3 levels 1, 2, 3, the function could be a saturated one such as \(g(Y) = \tau_1 [Y=2] + \tau_2 [Y=3]\) where \([a]\) is 1 if \(a\) is true, 0 otherwise. If \(Y\) consisted of a semi-continuous initial range \(Y=0, 1, \dots, 20\) and then severe outcomes \(Y=21, 22\), one could use a discontinuous linear spline: \(g(Y) = \tau_1 Y + \tau_2 [Y=21] + \tau_3 [Y=22]\).

If measurement times are unequally spaced or there are missing outcome measurements at some of the times, the \(g\) function can be generalized to account for differing time lags. For example if there are \(d\) days from the last measurement to the current measurement at time \(t\) one could use for the first example \(g(Y) = \tau_1 [Y=2] + \tau_2 [Y=3] + \tau_3 [Y=2] \times d + \tau_4 [Y=3] \times d\).

For an absorbing state (e.g., \(Y=3\)), we need to determine whether a partial proportional odds model is needed so that the impact of the absorbing state can be \(y\)-dependent. Or for an absorbing state at the maximum value of \(Y=3\) it may be as simple as \(g(Y) = \tau_1 [Y=2] + 100 [Y=3]\) which requires a model offset, not an extra parameter \(\tau_2\).

The first order (lag one) serial dependence model for time \(t\) for any \(g\) would then be

\[\Pr(Y_t >= y | X, Y_{t-1}) = \text{expit}(\alpha_y + X \beta + g(Y_{t-1}))\]

Pros and cons of this approach are

One would typically pick an important time \(t\) and a series of covariate settings for computing such results.

Partial Proportional Odds Model

The PPO model of Peterson and Harrell (1990) in its unconstrained form (Eq. (5) of the reference) has this specification for a single observation when \(Y=1, 2, ..., k\) when \(j > 1\) (the paper uses a different coding, for \(Y=0, ..., k\) so their \(k\) is our \(k-1\))):

\[P(Y \geq j | X) = \text{expit}(\alpha_{j-1} + X\beta + [j > 2] Z\tau_{j-2}) = \text{expit}(c_{j})\]

For the entire dataset \(Z\) is an \(n\times q\) design matrix specifying the form of departures from proportional odds, and to honor the hierarchy principle for interactions must be a subset of the columns of \(X_{n, p}\). With regard to \(Z\) the model is multinomial instead of ordinal, and so unlike the PO model there are issues with cell sizes in the \(Y\) frequency distribution. The unconstrained PPO model is strictly for discrete ordinal \(Y\) and there must be at least \(k=3\) levels of \(Y\). The \(\alpha\)s are the intercepts for \(Y \geq 2\) (and thus their negatives are intercepts for \(Y=1\)).

Likelihood components are as follows:

In Stan code prmqrcppo.stan, \(Z\) is orthonormalized by the QR decomposition, and the PPO parameters on this new scale are \(\omega\) instead of \(\tau\) just as \(\theta\) is substituted for \(\beta\). This code implements cluster (random) effects, so if there are no repeated observations per subject the user needs to specify a very small mean for the exponential prior for the variance of random effects (e.g., 0.001).

The \(p\)-vector of normal prior standard deviations for \(\theta\) is sds and a separate \(k-2 \times q\) matrix of normal prior SDs for \(\omega\) is given by sdsppo. If a binary treatment variable is present and one wants full control over its prior SD, be sure to put treatment as the last parameter for \(X\), and if treatment is allowed to violate the PO assumption also put treatment as the last parameter for \(Z\).

To some extent the constrained PPO model may be obtained by using very skeptical priors on the \(\omega\) (transformed \(\tau\)) parameters, e.g., standard deviations < 1.0. This will lower the effective number of model parameters.

Constrained Partial Proportional Odds Model

This is Eq. (6) of Peterson & Harrell (1990) and is akin to including interactions between covariates and \(\log(t)\) in a Cox proportional hazards model. As discussed in Tutz and Berger this is like having separate regression models for the mean and the variance.

Let \(f(y)\) denote a continuous or discontinuous function of the response variable value \(y\). If \(Y\) is continuous this will usually be a continuous function, and if \(Y\) is discrete it may be linear in the integer category numbers or may be an indicator variable to allow one category of \(Y\) to not operate in proportional odds for specific covariates \(Z\). If \(Y=1, 2, \ldots, k\), the model is \(P(Y \geq y) = \text{expit}(\alpha_{y-1} + X\beta + f(y)Z\tau)\) One can see a great reduction in the number of non-PO parameters \(\tau\) to estimate as \(\tau\) is now a \(q\)-vector instead of a \((k-2) \times q\) matrix.

The likelihood components for \(Y=j^{th}\) ordered distinct value of \(Y\) are as follows:

In the blrm function the f function is the parameter cppo, and to help along the MCMC poseterior sampling this function is automatically normalized to have mean 0 and SD 1 over the dataset (or user-provided center and scale parameters can be used). In order to do this the user must define f so that the last line of the function (the returned value) is e.g. (value - center) / scale. If center and scale are not arguments to f, these arguments will be added automically and computed from the observed mean and SD (but the last line that does the scaling must appear).

Censored Data

Assume that all possible categories of Y are completely observed in at least one observation. This implies that if \(Y=1,\ldots,k\) left censoring at \(a\) is interval censored at \([1, a]\) and right censoring at \(b\) is interval censored at \([b, k]\). Consider an observation with Y interval censored in \([a,b]\). Then

Censored Data in Constrained Partial PO Model

Likelihood components are as follows:

Since \(Y\) is coded internally as \(1, \ldots, k\), \(f(b+) = f(b+1)\).

Pertinent Stan Documentation

Especially Relevant Functions

Function Computes
log1m_exp log(1 - exp(x))
log1p_exp log(1 + exp(x))
log_diff_exp log(exp(x) - exp(y))
-log1p_exp(-x) log(expit(x))

Also study bernoulli_logit and categorical_logit.

To look up Stan function near-equivalents to R functions use rstan::lookup('ncol') for example.

Some answers to my coding questions appear here, e.g.:

Other Useful Links