--- title: "`rmsb` Package Examples using `rstan`" author: - name: Frank E Harrell Jr url: https://hbiostat.org affiliation: Department of Biostatistics
Vanderbilt University School of Medicine
VICTR Research Methods Program date: last-modified format: html: self-contained: true number-sections: true number-depth: 3 anchor-sections: true code-tools: false code-fold: false code-link: false theme: journal toc: true toc-depth: 3 toc-location: left bibliography: harrelfe.bib csl: apa-annotated-bibliography.csl link-citations: true --- # Overview and Setup The `rmsb` package is the Bayesian companion to the `rms` package, and uses many of its functions for post-model-fitting analysis and graphics. The sole fitting function in `rmsb` at present is `blrm`. `blrm` is for Bayesian binary and ordinal proportional odds logistic regression and the @pet90par partial proportional odds model that relaxes the proportional odds assumption. It is the analog and generalization of `rms::lrm` and for ordinal responses is intended for outcomes with up to a several hundred ordinal levels. The Bayesian approach has a number of advantage over traditional frequentist models, including 1. the use of exact calculations (to within simulation error) instead of large sample (e.g., normal theory) approximations to p-values and confidence intervals 1. exact and more intuitive inference when random effects are included in the model 1. the ability to make probability statements about parameters and combinations of parameters, which includes computations of the probabilities of assertions such as "the effect of $x_1$ exceeds 1.2 **and** the effect of $x_2$ exceeds 0.7" 1. capturing more sources of uncertainty. For example, the `blrm` function automatically computes highest posterior density intervals on a variety of statistical indexes such as the Brier score and AUROC ($c$-index). Note: By default these intervals are computed using only 400 posterior draws to save time. For a `blrm` fit object `f` you can specify how many samples to draw, to get more accurate intervals, by specifying for example `print(f, ns=2000)`. 1. the ability to incorporate external information or beliefs about parameters using prior distributions 1. by using [Stan](https://mc-stan.org) language to specify the likelihood components, one can not only do posterior distribution sampling but can quickly compute maximum likelihood estimates (Bayesian posterior modes) `blrm` uses normal priors for the $\beta$ parameters, and the user may specify the standard deviation of these priors separately for each model parameters. The default is $\sigma=100$ which is nearly flat. When there are random effects, the analyst may specify the mean of the exponential distribution that serves as the prior for the standard deviation of the random effects, with the default mean being 1.0, a reasonable number for the logit scale. You can also use a half-$t$ distribution for the SD of random effects. Regarding uncertainties about statistical indexes of model performance such as AUROC, the Bayesian posterior distributions computed by `rmsb` account for coefficient uncertainty for the sample at hand's ability to uncover the true unknown data generating coefficients and how strongly those coefficients can predict outcomes in the sample. The posterior distributions do not pertain to sample-to-sample variability of model fit or model validation. `blrm` uses Stan code written by Ben Goodrich of Columbia University and Frank Harrell. This code is precompiled when the `rmsb` package is built. You must install the `rstan` package to use `rmsb`. You can use [`cmdstan`](https://mc-stan.org/users/interfaces/cmdstan) and the [`cmdstanr`](https://mc-stan.org/cmdstanr) R package instead. This is generally recommended, as `cmdstan` and `cmdstanr` make use of the latest versions of Stan. Here are the steps I used for installing these on Linux, using the latest version of `cmdstan` as of 2022-11-25 (2.31.0): ``` cd ~ sudo R install.packages('cmdstanr', repos='https://mc-stan.org/r-packages', lib='/usr/local/lib/R/site-library') # Default location for cmdstan is in `~/.cmdstan` in the user's home directory # This is not such a good idea cmdstanr::install_cmdstan(cores=10, dir='/usr/local/bin') # When running sudo, default is /root/.cmdstan which is not easily accessible ``` Add the following line in `.bashrc` so that `cmdstanr` will know where to find `cmdstan`: ``` export CMDSTAN=/usr/local/bin/cmdstan-2.31.0 ``` `cmdstanr` will try to put compiled Stan code in the same directory where `cmdstan` is stored. With the above setup, this location is not user-writeable, so it will not work. The `blrm` function by default puts compiled code in the `.rmsb` directory under the user's home directory. Specify `options(rmsbdir='some another location')` to change this. The default behavior puts all compiled code in a system-wide location. If you desire to archive a project and reproduce the result even if the Stan code inside `rmsb` is updated, you may want to specify a project-specific location fo `rmsbdir`. Here is a typical setup code chunk for using `rmsb`. When using `rstan`, it is not recommended to use `options(auto_write=TRUE)` as when running a series of regressions, the `rstan` package tends to recompile Stan code even when the code doesn't change.

```{r setup}
require(rmsb)
options(mc.cores = parallel::detectCores())   # use max # CPUs
```

```{r setup} require(rmsb) knitrSet(lang='markdown', w=7, h=7, fig.path='png/') options(prType='html') # for rms output: print, anova, summary, etc. options(mc.cores = parallel::detectCores()) # use max # CPUs ``` If you want to use `cmdstan` by default, avoiding having to use the `backend` argument to `blrm`, you can add this to the setup chunk: ```{r, eval=FALSE} options(rmsb.backend='cmdstan') ``` ## Running Fits Only When Something Changes You'll see `file='...'` in the longer running of the `blrm` calls below. If the file already exists and none of the data nor the options sent to `rstan` nor the underlying Stan code have changed from what were used to create the fit object stored in that file (as judged by their `md5` hash), that saved fit object is returned immediately without running the `rstan` code, often saving a great deal of execution time. This works well with the workflow of long R markdown reports making it so that only portions of Bayesian analysis that have changed are run. Note that using the `knitr` `cache=TRUE` option does not work well as the cache files for this script were about a GB in size, and the system does not accurately recognize when a model fit hasn't changed and doesn't need to be run when `rstan` is involved. A restored fit object does not contain the `rstan` or `cmdstanr` object, saving tens of megabytes of storage. Standard Stan diagnostics are stored in the fit object separately, and it is assumed that if the user wanted to run `rstan::loo` `loo=TRUE` would have been specified to `blrm` so that `loo` is run and its (small) result stored in the `blrm` fit object. If you want to run `pairs` to get more graphical diagnostics, intead of relying on the `rstan` object always being available, specify `pairs=TRUE` or `pairs='filename.png'` to `blrm` to graph the pair plots. The latter is recommended, and one can put `knitr::include_graphics('filename.png')` in the R code chunk to render the graph in the report even if `blrm` was not re-run. When `file='filename.rds'` is specified to `blrm` and the file does not exist or analysis inputs have changed since the file was created, the `blrm` fit object will be saved in `saveRDS` `.rds` binary format in your current working directory at the completion of `blrm`'s work. The `rstan` component is omitted from the saved file. The utility function `fitIf` available from [here](https://github.com/harrelfe/rscripts) is another way to efficiently manage Bayesian simulation workflow. The `fitIf` approach runs the analysis only if the file doesn't already exist. If it exists but the data or the model have changed, the `fitIf` approach is not intelligent enough to re-run the analysis (unlike the `file='...'` approach above). Whether using `fitIf` or `file=`, you lose the ability to run the `rms::stanDxplot(..., rstan=TRUE)` on restored objects, so the `stanDxplot` function tries to find an existing trace plot image file corresponding to the current R markdown chunk when the fit object no longer has the `rstan` component. For most purposes this doesn't matter, because running \code{stanDxplot} using the defaults shows the needed non-burnin samples which are always stored in fit objects. The `file=filename.rds` approach is preferred. # Proportional Odds and Binary Logistic Regression ## Example: 10-level Ordinal Outcome Simulate a dataset with three predictors and one 10-level ordinal outcome. Run the frequentist proportional odds model then the Bayesian one. ```{r sim} set.seed(1) n <- 500 x1 <- runif(n, -1, 1) x2 <- runif(n, -1, 1) x3 <- sample(0 : 1, n, TRUE) y <- x1 + 0.5 * x2 + x3 + rnorm(n) y <- as.integer(cut2(y, g=10)) dd <- datadist(x1, x2, x3); options(datadist='dd') f <- lrm(y ~ x1 + pol(x2, 2) + x3, eps=1e-7) # eps to check against rstan f ``` Before getting posterior distributions of parameters, use `rstan` to just get maximum likelihood estimates and compare them with those from `lrm`. Do this for increasingly flat priors for the $\beta$s. The default prior SD for `blrm` is 100. Running `method='optimizing'` is a quick way to study the effect of priors on the posterior modes for non-intercepts when there are no random effects in the model. It is also a way to get a sense of the scaling of parameters used in the actual calculations (but see a later section for how to show standard deviations of variables on the transformed scale given to Stan). Except for variables listed in the `keepsep` argument to `blrm`, prior standard deviations apply to the QR orthonormalized version of the design matrix. ```{r stanmle} for(psd in c(0.25, 1, 10, 100, 10000)) { cat('\nPrior SD:', psd, '\n') g <- blrm(y ~ x1 + pol(x2, 2) + x3, method='optimizing', priorsd=psd) cat('-2 log likelihood:', g$deviance, '\n') print(g$coefficients) } # Compare with ordinary MLEs and deviance f$deviance coef(f) ``` Fit the model with Dirichlet priors on intercepts and wide normal priors on the $\beta$s. Show the model fit summary. Note that the indexes of predictive discrimination/accuracy include 0.95 highest posterior density intervals. In frequentist inference we pretend that quantities such as AUROC and $R^2$ are estimated without error, which is far from the case. In several places you will see an index named `Symmetry`. This is a measure of the symmetry of a posterior distribution. Values farther from 1.0 indicate asymmetry, which indicates that the use of standard errors and the use of a normal approximation for the posterior distribution are not justified. The symmetry index is the ratio of the gap between the posterior mean and the 0.95 quantile of the posterior distribution to the gap between the 0.05 quantile and the mean. ```{r blrmsim} bs <- blrm(y ~ x1 + pol(x2, 2) + x3, file='bs.rds') bs ``` ```{r blrmStats} # Show more detailed analysis of model performance measures blrmStats(bs, pl=TRUE) ``` Show basic Stan diagnostics. Had `stanDxplots(bs, rstan=TRUE)` been used, intercepts would have been shifted from what is in `g` because of subtractions of covariate means before passing data to `rstan`. ```{r standx} stanDxplot(bs) stanDx(bs) ``` Here are the posterior distributions, calculated using kernel density estimates from posterior draws. Posterior models, shown as vertical lines, are parameter values that maximize the log posterior density (using `rstan::optimizing` in the original model fit) so do not necessarily coincide with the peak of the kernel density estimates. ```{r stanpost} plot(bs) # Also show 2-d posterior density contour for two collinear terms plot(bs, c('x2', 'x2^2'), bivar=TRUE) # assumes ellipse plot(bs, c('x2', 'x2^2'), bivar=TRUE, bivarmethod='kernel') # kernel density # Print frequentist side-by-side with Bayesian posterior mean, median, mode cbind(MLE=coef(f), t(bs$param)) # Compare covariance matrix of posterior draws with MLE round(diag(vcov(f)) / diag(vcov(bs)), 2) range(vcov(f) / vcov(bs)) ``` Next show frequentist and Bayesian contrasts. For the Bayesian contrast the point estimate is the posterior mean, and the 0.95 highest posterior density interval is computed. Instead of a p-value, the posterior probability that the contrast is positive is computed. ```{r contrast} contrast(f, list(x1=0, x3=1), list(x1=.25, x3=0)) k <- contrast(bs, list(x1=0:1, x3=1), list(x1=.25, x3=0)) k ``` For Bayesian contrasts we can also plot the posterior densities for the contrasts, and 2-d highest-density contour. ```{r pcontrast} plot(k) plot(k, bivar=TRUE) # applicable when exactly 2 contrasts plot(k, bivar=TRUE, bivarmethod='kernel') ``` Compute posterior probabilities for various assertions about unknown true parameter values. The `PostF` function is a function generator that effectively evaluates the assertion to a 0/1 value and computes the mean of these binary values over posterior draws. As is the case with inference about the quadratic effect of `x2` below, when the assertion does not evaluate to a binary 0/1 or logical `TRUE/FALSE` value, it is taken as a quantity that is derived from one or more model parameters, and a posterior density is drawn for the derived parameter. We use that to get a posterior distribution on the vertex of the quadratic `x2` effect. ```{r postprobs} P <- PostF(bs, pr=TRUE) # show new short legal R names P(b3 > 0 & b1 > 1.5) P(b3 > 0) P(abs(b3) < 0.25) # evidence for small |nonlinearity| mean(bs$draws[, 'x2^2'] > 0, na.rm=TRUE) # longhand calculation # Plot posterior distribution for the vertex of the quadratic x2 effect # This distribution should be wide because the relationship is linear # (true value of b3 is zero) plot(P(-b2 / (2 * b3))) ``` ```{r postprobs2} # Recreate the P function using original parameter names # (which may not be legal R name) P <- PostF(bs, name='orig') P(`x2^2` > 0) P(`x2^2` > 0 & x1 > 1.5) # Remove rstan results from fit. Compute savings in object size. # Note: this will only be accurate when running the fits for # the first time (not when restoring shortened forms of them from # disc) # Result: 33.8MB before, 0.5MB after s1 <- format(object.size(bs), 'MB') bs$rstan <- NULL s2 <- format(object.size(bs), 'MB') cat('Before:', s1, ' After:', s2, '\n') ``` ## Bayesian Wilcoxon Test Since the proportional odds ordinal (PO) logistic model is a generalization of Wilcoxon/Kruskal-Wallis tests one can use Bayesian proportional odds regression to get the Bayesian equivalent to the Wilcoxon test. Even if not adjusting for covariates (impossible with the Wilcoxon test) there are advantages to putting this in a modeling framework as detailed in [Section 7.6](http://hbiostat.org/doc/bbr.pdf) of BBR. A major advantage is estimation ability. One can estimate group-specific means, quantiles, and exceedance probabilities. And Bayesian inference provides exact uncertainty intervals (highest posterior density intervals in what follows) for these. The PO model does not require there to be any ties among the Y values, so it handles continuous data very well (the `orm` function in the `rms` package efficiently handles many thousands of distinct Y levels requiring many thousands of intercepts in the model). Let's re-analyze the calprotectin data in Section 7.3.1 of BBR to mimic the frequentist PO analysis in Section 7.6. ```{r cpo} # Fecal Calprotectin: 2500 is above detection limit # When detection limits occur at a single value, the PO # model easily handles this in terms of estimating group # differences (but not for estimating the mean Y) calpro <- c(2500, 244, 2500, 726, 86, 2500, 61, 392, 2500, 114, 1226, 2500, 168, 910, 627, 2500, 781, 57, 483, 30, 925, 1027, 2500, 2500, 38, 18) # Endoscopy score: 1 = No/Mild, 2=Mod/Severe Disease # Would have been far better to code dose as 4 ordinal levels endo <- c(2, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1) endo <- factor(endo, 1 : 2, c("No or Mild Activity", "Moderate or Severe Activity")) dd <- datadist(endo, calpro); options(datadist='dd') bcalpro <- blrm(calpro ~ endo, file='bcalpro.rds') print(bcalpro, intercepts=TRUE) # print.blrm defaults to not showing intercepts if more than 9 of them summary(bcalpro) ``` One can see that the posterior probability of a positive group difference exceeds 0.99. Now compute the posterior mean estimate of the mean and median calprotectin levels for the unknown data generating process, stratified by group and compare with sample estimates. ```{r bcalp} # Sample estimates tapply(calpro, endo, mean) tapply(calpro, endo, median) # Now compute estimates and 0.95 HPD intervals assuming PO # The first method is exact newdata <- data.frame(endo=levels(endo)) bar <- Mean(bcalpro) predict(bcalpro, newdata, fun=bar) quant <- Quantile(bcalpro) med <- function(lp, ...) quant(lp=lp, ...) Predict(bcalpro, endo, fun=med) ``` The `contrast` function in `rms` now allows one to get posterior distributions of differences in nonlinearly transformed parameters, as follows. ```{r bcalp2} k <- contrast(bcalpro, list(endo=levels(endo)[1]), list(endo=levels(endo)[2]), fun=bar) k plot(k, which='diff') ``` Peeking ahead a later section, we can use the constrained partial proportional odds model to assess the proportional odds assumption. Let's assume that departures from proportional odds (constant increments in log odds) are modeled as linear in square root of calprotectin level. There is only one predictor (`endo`), so there is only one variable that might act non-proportionally. Hence the second formula has the same right-hand-side as the first formula in the `blrm` call. ```{r bcalcppo} bcalp <- blrm(calpro ~ endo, ~ endo, cppo=sqrt) # use cppo=function(y) y for linear x by y interaction bcalp ``` The probability that the effect of moderate or severe activity decreases for higher levels of calprotectin is 0.99. Let's graph the logit of the group-stratified empirical CDFs to see some visual evidence for this. ```{r calecdf} Ecdf(~ calpro, group=endo, fun=qlogis) ``` Non-parallelism is readily seen, indicating non-proportional odds. Note that allowing for a systematic departure from proportional odds is like having unequal group variances in a 2-sample $t$-test, but is more general. With the constrained partial proportional odds model one can examine for a given y the evidence for a group effect in the tail of the distribution of Y beyond y. For ordinal logistic models the group effect is captured by the relative odds that Y $\geq$ y. We can use the `rms` `contrast` function to do the calculations for varying y. `contrast` will evaluate the `cppo` function as needed (which is a scaled and centered $\sqrt{}$). ```{r calcon,mfrow=c(1,2),h=3.5,w=7.5} ys <- seq(100, 2500, by=100) k <- contrast(bcalp, list(endo='Moderate or Severe Activity'), list(endo='No or Mild Activity'), ycut=ys) xl <- 'Calprotectin Cutoff' par(mfrow=c(1,2)) with(k, plot(ycut, Contrast, xlab=xl, ylab='log OR', type='l')) with(k, plot(ycut, PP, xlab=xl, ylab='P(OR > 1)', type='l')) ``` ## Binary Regression with Restricted Cubic Splines Turn to the `support` dataset and fit a binary logistic model to predict the probability of in-hospital death of critically ill adults. `blrm` keeps posterior sampling efficient by orthonormalizing the design matrix before doing the sampling (this is done internally in the Stan code). This allows for arbitrary collinearities, for example in the basis functions used in restricted cubic splines. When there are such collinearities, expect to see some disagreements in estimates between `blrm` and `lrm`, because the latter does not do orthonormalization (only normalization to mean 0 variance 1). Collinearity implies that there are many different solutions to the equations, all giving almost the same predicted values. ```{r support,w=7,h=7} getHdata(support) dd <- datadist(support); options(datadist='dd') f <- lrm(hospdead ~ dzgroup + rcs(crea, 5) + rcs(meanbp, 5), data=support, eps=1e-4, x=TRUE) specs(f) f # Compute the apparent standard error of Dxy (not accounting for overfitting) # for comparison with the Bayesian HPD interval for Dxy rcorr.cens(predict(f), support$hospdead) Function(f) bsup <- blrm(hospdead ~ dzgroup + rcs(crea, 5) + rcs(meanbp, 5), data=support, file='bsup.rds') specs(bsup) Function(bsup) # by default uses posterior mode parameter values # To add an intercept use e.g. Function(bsup, intercept=coef(g, 'mode')[5]) bsup stanDx(bsup) stanDxplot(bsup) plot(bsup) ``` Show approximate relative explained variation (REV) and compare this with Wald statistics from the frequentist `lrm` model. REV is less accurate the more the multivariate posterior distribution differs from a multivariate normal distribution. On a given posterior draw, REV for a term in the model is the Wald $\chi^2$ statistic divided by the Wald statistic for the whole model. ```{r rev,h=2.5} a <- anova(bsup) a plot(a) anova(f) ``` **Note**: To get a Bayesian equivalent of a likelihood ratio test for comparing two models use the `rms` function `compareBmods`. Now compute odds ratios over default inter-quartile ranges for continuous predictors, based on posterior mode parameters. Also show 0.95 HPD intervals. Note that unlike the `print` method, the `plot` method for `summary` doesn't actually compute HPD intervals, but approximates them by assuming normality and using the standard deviation of the posterior samples. Compare the plot with the ordinary `lrm` result. ```{r summary, top=2.5} s <- summary(bsup) s plot(s) plot(summary(bsup)) ``` Draw partial effect plots with 0.95 HPD intervals. Point estimates are posterior modes (which can be easily changed). ```{r peffect} ggplot(Predict(bsup)) ``` Compute estimated mortality probabilities at all levels of `dzgroup` adjusting covariates to medians/modes. Need `funint=FALSE` to tell `Predict` that `fun` is a simple 1-1 function of the linear predictor (unlike `Mean`, `Quantile`, etc.). ```{r peffect2} Predict(bsup, dzgroup, fun=plogis, funint=FALSE) ``` Draw a nomogram from posterior mode parameter values. ```{r nomogram} p <- nomogram(bsup, fun=plogis, funlabel='P(death)') plot(p) ``` For comparison here is a nomogram based on maximum likelihood estimates of parameters rather than posterior modes. ```{r nomfreq} plot(nomogram(f, fun=plogis, funlabel='P(death)')) ``` # Partial Proportional Odds Model The proportional odds (PO) assumption is a parallelism assumption reflecting the belief that the effect of baseline variables on, say, $Y \geq 3$ is the same effect on $Y \geq 4$. To relax that assumption, @pet90par developed the [partial proportional odds model](https://www.jstor.org/stable/2347760) (PPO) model. The `blrm` function accepts a second model formula in the argument named `ppo` that specifies the subset of predictors for which PO is not to be assumed but for which the model is effectively polytomous (multinomial). Note that for frequentist modeling the R `VGAM` package handles the PPO model (as will be shown below). `VGAM` is more flexible than what `blrm` can do in allowing for all sorts of model restrictions. The presence of the second formula triggers fitting a PPO model. The default is the unconstrained PPO model, which has $(k - 2) \times q$ extra parameter for $k$ category $Y$ and $q$ equal to the number of columns in the induced design matrix from the second formula. This is normally too many parameters. More typically, the _constrained PPO model_ (@pet90par Eq. 6) is used. This model is fitted when a function is provided as the `cppo` argument. Generalizing Peterson and Harrell Eq. 6 the `cppo` function can be a continuous function of $Y$ as well as being a discontinuous function such as an indicator variable that allows a category of $Y$ (typically the first or last) to have a special effect of selected covariates. When $Y$ is continuous, `cppo` would typically be the continuous value of $Y$ or a monotonic function of it. This induces a model that is akin to having a treatment $\times$ time interaction in a Cox proportional hazards model, or systematic heteroskedasticity in a linear model. When Y is very skewed, it may be more reasonable to use something like `cppo = function(y) y^(1/3)`. Note that `blrm` internally centers (by the mean) and scales (by the SD) the function value, so the user need not concern herself with location or scale of the PO-modification function. Note also that post-fitting functions for estimation and prediction are not implemented for the unconstrained partial PO model. Note that the `cppo` function used to trigger the use of the constrained PPO model is never evaluated at the lowest value of Y. That is because in the data likelihood, the probability element for observations at the minimum Y is one minus the probability element at the second lowest value of Y. So you can't give a special effect at the minimum Y. If for example you have a scale ordered as death, sick, well and you want to allow for a special effect of a covariate on death, reverse the order of levels in the factor variable representing Y and specify `cppo=function(y) y == 'dead'`. ## Unconstrained Partial PO Model Let's look first at the unconstrained PPO model, which is more suitable for categorical Y with not too many categories. Consider a $2\times 3$ table of proportions (2 treatment groups, 3 ordered outcome levels) where the treatment effect is not in PO. We will fit a PO model and see how well it tries to reconstruct the 6 proportions, then fit a PPO model. As of 2020-05-11 `blrm` has not implemented `predict`-type functions for PPO models, so you will se predictions done the long way (which does better show how PPO works). Note: the `VGAM` `vgam` function parameterizes PPO effects by $y$-specific covariate effects whereas `blrm` like Peterson and Harrell parameterize the model by estimating increments in log odds for the covariate effect for $y \geq y$ over and above the effect for $y \geq 2$ if $y$ has values 1, 2, ..., $k$. Bayesian model specification is detailed [here](https://github.com/harrelfe/stan/blob/master/notes.md). Not shown here is that `blrm` also allows for random effects with PPO models to handle longitudinal data if a compound symmetry correlation pattern is reasonable. ```{r ppo} p0 <- c(.4, .2, .4) p1 <- c(.3, .1, .6) m <- 50 # observations per cell m0 <- p0 * m # from proportions to frequencies m1 <- p1 * m x <- c(rep(0, m), rep(1, m)) y0 <- c(rep(1, m0[1]), rep(2, m0[2]), rep(3, m0[3])) y1 <- c(rep(1, m1[1]), rep(2, m1[2]), rep(3, m1[3])) y <- c(y0, y1) table(x, y) ``` ```{r ppo2} # A PO model cannot reproduce the original proportions f <- lrm(y ~ x) f ``` ```{r ppo3} predict(f, data.frame(x=c(0, 1)), type='fitted.ind') require(VGAM) fv <- vgam(y ~ x, cumulative(reverse=TRUE, parallel=TRUE)) coef(fv) predict(fv, data.frame(x=c(0, 1)), type='response') # Now fit a PPO model that will reproduce all cell proportions fvppo <- vgam(y ~ x, cumulative(reverse=TRUE, parallel=FALSE)) coef(fvppo) predict(fvppo, data.frame(x=c(0, 1)), type='response') # perfect recovery # Function to manually compute cell probablities pprop <- function(co, type, centered=FALSE) { x <- if(centered) c(-0.5, 0.5) else 0:1 switch(type, vgam = { pge2 <- plogis(co[1] + x * co[3]) peq3 <- plogis(co[2] + x * co[4]) rbind(c(1 - pge2[1], pge2[1] - peq3[1], peq3[1]), c(1 - pge2[2], pge2[2] - peq3[2], peq3[2])) }, blrm = { pge2 <- plogis(co[1] + x * co[3]) peq3 <- plogis(co[2] + x * (co[3] + co[4])) rbind(c(1 - pge2[1], pge2[1] - peq3[1], peq3[1]), c(1 - pge2[2], pge2[2] - peq3[2], peq3[2])) } ) } co <- coef(vgam(y ~ x, cumulative(reverse=TRUE, parallel=FALSE))) pprop(co, type='vgam') ``` ```{r ppo4} # Now try blrm # First mimic PO model by penalizing PPO term to nearly zero # Quickly get maximum likelihood estimates (posterior modes) b <- blrm(y ~ x, ~x, priorsd=1000, priorsdppo=0.001, method='opt') coef(b) pprop(coef(b), type='blrm') # Now really fit PPO model, at first only getting MLE # Do full posterior sampling b <- blrm(y ~ x, ~ x, priorsd=1000, priorsdppo=1000, method='opt') coef(b) # also the posterior mode coef(b)[3] + coef(b)[4] bppo <- blrm(y ~ x, ~ x, priorsd=1000, priorsdppo=1000, file='bppo.rds') # take differences in last 2 coefficients to get our scheme # Check recovery of proportions, using posterior mode/mean/median pprop(coef(bppo, 'mode'), type='blrm') pprop(coef(bppo, 'mean'), type='blrm') pprop(coef(bppo, 'median'), type='blrm') ``` ```{r ppo5} bppo ``` ## Constrained Partial PO Model Consider the same dataset analyzed above. Specify a constrained PPO model that in this particular case is really unconstrained because it has a total of two parameters to handle the group effect, and there are only $k=3$ levels of Y. ```{r cppo} # cppo function specifies that there is a special effect of x for y=3 bcppo <- blrm(y ~ x, ~ x, cppo=function(y) y == 3, file='bcppo.rds') cppo <- bcppo$cppo # automatically normalized to have mean 0 SD 1 cppo b <- coef(bcppo, 'mode') rbind(Mode=b, Mean=coef(bcppo)) # Compute the 4 cumulative probabilities using the posterior mode (MLE) # b[3] and b[4] are multiplied x (hence drop out of the x=0 row) L <- rbind('x=0' = c(b[1], b[2]), 'x=1' = c(b[1] + b[3] + cppo(2) * b[4], b[2] + b[3] + cppo(3) * b[4])) plogis(L) ``` Now consider the severity of nausea data from @pet90par. ```{r nausea} d0 <- data.frame(tx=0, y=c(rep(0, 43), rep(1, 39), rep(2, 13), rep(3, 22), rep(4, 15), rep(5, 29))) d1 <- data.frame(tx=1, y=c(rep(0, 7), rep(1, 7), rep(2, 3), rep(3, 12), rep(4, 15), rep(5, 14))) d <- rbind(d0, d1) d$tx <- factor(d$tx, 0:1, c('No cisplatin', 'cisplatin')) dd <- datadist(d); options(datadist='dd') with(d, table(tx, y)) # Allow for a different effect of tx at y=5 g <- function(y) y==5 # max(y)=5 and y is discrete # Check against maximum likelihood estimates in Peterson & Harrell f <- blrm(y ~ tx, ~ tx, cppo=g, data=d, method='opt') # Compute the treatment effect log(OR) for y=1, 2, 3, 4, 5 h <- f$cppo # normalized version of g (mean 0 SD 1) k <- coef(f) k k[6] + h(1:5) * k[7] # matches paper # Now get posterior distributions of parameters fp <- blrm(y ~ tx, ~ tx, cppo=g, data=d, file='bnausea.rds') rbind(coef(f), coef(fp, 'mode'), coef(fp, 'mean')) k <- coef(fp) # posterior means k[6] + h(1:5) * k[7] # close to paper dat <- data.frame(tx=levels(d$tx)) # Get posterior mean and 0.95 HPD intervals for treatment # effects at all levels of y contrast(fp, list(tx='cisplatin'), list(tx='No cisplatin'), ycut=1:5) Predict(fp, tx) ``` If the response variable is discrete and has character strings for the ordered factor levels, you can use these strings in the `cppo` function definition. For example, suppose that Y was a `factor` variable with these levels: `"ok"`, `"in pain"`, `"stroke"`, or `"death"`. To allow a treatment to a have a different effect for the last two levels while adjusting for a covariate `age` that is assumed to operate in proportional odds, one can code ``` f <- blrm(y ~ age + tx, ~ tx, cppo=function(y) y %in% c('stroke', 'death')) ``` There are special arguments to some of the `rms` functions for getting estimates or predictions for partial PO models. For the `Predict` function you can specify `ycut` or `kint` to specify the response variable value such that the logit or probability of $Y\geq y$ is being estimated. For `contrast` (full name `contrast.rms`) you can specify the argument `y` as a scalar or vector. When it is a scalar, that value of the y cutoff is used for all contrasts. When it is a vector, it is assume to either (1) have length equal to the number of contrasts being specified so that the appropriate value of `y` is used for each contrast, or (2) when only one contrast is requested, the contrast is repeated `length(y)` times to estimate the effect of varying `y`. This was done in the last part of the example above. When the response variable is continuous, it is more flexible to specify `ycut` to `Predict` than to specify `kint`, because for `ycut` you can specify a value of Y that did not occur in the data (when the `cppo` function is continuous) to take into account the degree of non-proportional odds at that value. # Longitudinal Data Examples: Random Effects ## Schizophrenia Dataset from `mixor ` Package The R `mixor` package fits frequentist random effects proportional odds models. Let's analyze the first dataset discussed in the `mixor` package vignette. The outcome is a 4-level severity of illness scale. ```{r mixor} require(mixor) data(schizophrenia) d <- schizophrenia f <- mixor(imps79o ~ sqrt(Week) * TxDrug, id=id, link='logit', data=d) f sqrt(diag(vcov(f))) ``` Fit the same model using the Bayesian approach. ```{r mixorb} bmixor <- blrm(imps79o ~ sqrt(Week) * TxDrug + cluster(id), data=d, file='bmixor.rds') bmixor <- blrm(imps79o ~ sqrt(Week) * TxDrug + cluster(id), data=d, file='/tmp/bmixorrep.rds') # 34 sec bmixor ``` The square of the posterior median for $\sigma_\gamma$ is `r median(bmixor$omega[,'sigmag'])^2` which compares well with the `mixor` estimate of 3.774. The $\hat{\beta}$s also compare very well. Note that the model is stated differently, which makes two of the intercepts have different meanings across packages. Repeat the fit using `cmdstan`. ```{r mixorbc} # Sometimes RStudio can't find the CMDSTAN environment variable cmdstanr::set_cmdstan_path('/usr/local/bin/cmdstan-2.31.0') bmixor2 <- blrm(imps79o ~ sqrt(Week) * TxDrug + cluster(id), data=d, backend='cmdstan', file='bmixor2.rds') # 29 sec once compiled code stored bmixor2 ``` ## Simulated Random Effects Longitudinal Data Let's generate some data with repeatedly measured outcome per subject where the outcome is binary and the random effects have a $N(0, 0.25^2)$ distribution. 500 subjects have 10 measurements each. ```{r re} n <- 500 # subjects set.seed(2) re <- rnorm(n) * 0.25 # worked fine also with rnorm(n) * 4 X <- runif(n) # baseline covariate, will be duplicated over repeats m <- 10 # measurements per subject id <- rep(1 : n, each = m) x <- X[id] L <- x + re[id] # actual logit y <- ifelse(runif(n * m) <= plogis(L), 1, 0) f <- lrm(y ~ x, x=TRUE, y=TRUE) # ordinary fit f # now use cluster sandwich covariance estimator: g <- robcov(f, id) # covariance matrix adjusted for clustering g ``` We first fit an inappropriate Bayesian model in which the random effects are omitted. ```{r bayesreo} # Note: loo defaults to FALSE when n > 1000 as in this case # Need loo for compareBmods breo <- blrm(y ~ x, loo=TRUE, file='breo.rds') breo ``` Now use a proper Bayesian random effects model. The prior distribution for the standard deviation $\sigma_{\gamma}$ of the random effects ($\gamma$s) is assumed to be exponential when `psigma=2`, and we will use the default mean of 1.0. ```{r bayesre} bre <- blrm(y ~ x + cluster(id), psigma=2, loo=TRUE, file='bre.rds') bre plot(bre) ``` Before delving more into the random effects model, let's compare this new model with the previous model that erroneously omitted the random effects. ```{r compare} compareBmods(breo, bre) ``` Roughly speaking, of the two models, the one with random effects has a probability of 0.93 of being the correct one. See `rstan::loo` and `loo::loo.array` for details. Now let's get into more details from the random effects model fit. ```{r bayesreh} # Plot distribution of the 500 estimated random effects (posterior medians) hist(bre$gammas, xlab='Estimated Random Effects', nclass=40) ``` Now generate similar data except for a bimodal random effects distribution. This will fool the random effects normal prior into having a wider variance for a single normal distribution but will still result in estimated random effects that are somewhat realistic. ```{r re2, w=7,h=6} n <- 500 set.seed(3) re <- c(rnorm(n/2, mean=-1.75), rnorm(n/2, mean=1.75)) * 0.25 cat('SD of real random effects:', round(sd(re), 4), '\n') X <- runif(n) # baseline covariate, will be duplicated over repeats m <- 10 # measurements per subject id <- rep(1 : n, each = m) x <- X[id] L <- x + re[id] # actual logit y <- ifelse(runif(n * m) <= plogis(L), 1, 0) breb <- blrm(y ~ x + cluster(id), file='breb.rds') breb par(mfrow=c(2, 2)) hist(breb$gammas, xlab='Estimated Random Effects', nclass=40, main='') hist(re, xlab='Real Random Effects', nclass=40, main='') plot(re, breb$gammas, xlab='Real', ylab='Estimated') abline(a=0, b=1) ``` ## Absorbing State in Mixed Effects Ordinal Regression `blrm` is not designed to handle this situation but let's see how it performs. For an ordinal outcome y=0, 1, 2, 3, 4, 5 suppose that y=5 represents an absorbing state such as death. Suppose that subjects are observed for 10 days, and if death occurs within those days, all later values of y for that subject are set to 5. Generate repeated outcomes under a $N(0, 0.25^2)$ random effects model with two treatments: `a` and `b`. The `b:a` odds ratio is 0.65 and the cell probabilities are 0.3, 0.3, 0.1, 0.1, 0.1, 0.1 corresponding to y=0-5, when the random effect is zero. ```{r os} # Generate data as if there is no absorbing state n <- 1000 set.seed(6) pa <- c(.3, .3, .1, .1, .1, .1) # P(Y=0-5 | tx=a, random effect=0) pb <- pomodm(p=pa, odds.ratio=0.65) # P(Y=0-5 | tx=b, re=0) # Hmisc round(pb, 3) re <- rnorm(n) * 0.25 tx <- c(rep('a', n/2), rep('b', n/2)) # will be duplicated over repeats m <- 10 # measurements per subject id <- rep(1 : n, each = m) time <- rep(1 : m, n) or <- exp(log(0.65) * (tx[id] == 'b') + re[id]) y <- integer(n * m) for(j in 1 : (n * m)) { p <- pomodm(p=pa, odds.ratio=or[j]) y[j] <- sample(0:5, 1, p, replace=TRUE) } Tx <- tx[id] table(Tx, y) ``` The first Bayesian proportional odds model fitted is the one that exactly matches the data generation model, as we have not yet imposed an absorbing state, so that outcomes with y < 5 can appear after a y=5 outcome for the subject. ```{r noabs} bst <- blrm(y ~ Tx + cluster(id), file='bst.rds') bst ``` ```{r noabsdx} stanDx(bst) stanDxplot(bst, 'ALL') ``` If `time` were to be added to the above model, you'll see that its regression coefficient is very small ($\hat{\beta}=0.009$ in this case), in alignment with the data generating model. Now assume that state y=5 is an absorbing state. Change observations after the first y=5 within subject to also have y=5. ```{r absorb} require(data.table) g <- function(x) if(length(x)) min(x, na.rm=TRUE) else 99L u <- data.table(id, time, Tx, y, key='id') # Add variable 'first' which is time of first y=5 for subject (99 if never) w <- u[, .(first=g(time[y == 5])), by=id] d <- u[w] # Show distribution of first time of y=5 table(d[time == 1, first]) # Set all observations after the first y=5 to also have y=5 z <- d z[time > first, y:=5] table(u$y); table(d$y); table(z$y) ``` ```{r absorb2} bcf <- blrm(y ~ Tx + cluster(id), data=z, file='bcf.rds') bcf hist(bcf$gammas, xlab='Estimated Random Effects', nclass=40, main='') ``` The regression coefficient for treatment is too large (the true value is log(0.65) = `r round(log(0.64), 3)`). The standard deviation of random effects is large (the true value is 0.25), reflecting increased dependence of outcomes without subject due to the duplication of y=5 records. **However** the data being analyzed were not formally generated with the model that has a treatment odds ratio of 0.65. Repeated correlated ordinal outcomes were generated with that odds ratio and with a random effect standard deviation of 0.25, but then the outcomes were overridden in the following fashion: The first time within a subject that y=5 causes suppression of all later records. The histogram of estimated subject random effects (posterior medians) shows some bimodality with heavy right tail due to the y=5 absorbing state. Let s also plot the random effects against the time of death (99 if the subject did not die, recoded here to 15). ```{r revst} t5 <- subset(z, time == 1)$first t5 <- ifelse(t5 == 99, 15, t5) plot(t5, bcf$gammas, xlab='Time of y=5', ylab='Random Effect') ``` What happens with time is added to this model? ```{r absorb3} bcft <- blrm(y ~ Tx + time + cluster(id), data=z, file='bcft.rds') bcft ``` We see that the slope of time is very large, but the treatment effect and random effect standard deviation are still very large. Look at random effects again. ```{r bcftre} hist(bcft$gammas, xlab='Estimated Random Effects', nclass=40, main='') plot(t5, bcft$gammas, xlab='Time of y=5', ylab='Random Effect') ``` Next we truncate patient records so that y=5 is not carried forward. ```{r nocarry} zt <- z[time <= first] bnc <- blrm(y ~ Tx + cluster(id), data=zt, file='bnc.rds') bnc ``` Finally, add time to the above model. ```{r nocarryt} bnct <- blrm(y ~ Tx + time + cluster(id), data=zt, file='bnct.rds') bnct ``` The time effect is very weak, and adding it did not change the already-accurate (with respect to the first part of the data generating mechanism) treatment effect posterior mean. # Censored Data The `blrm` function handles left-, right-, and interval-censored ordinal categorical or continuous Y. This opens up numerous possibilities, for example * one can analyze the usual right-censored time-to-event outcome but assuming, for example, prpoportional odds instead of proportional hazards * there can be more than one lower limit of detectability for a lab measurement * in a longitudinal study certain ranges of the outcome scale may not be assessed on certain days As an example of the third situation, suppose that Y is defined as follows: | Level of Y | Meaning | |------------|---------| | 0 | best quality of life | | 1 | very good QOL | | 2 | fair QOL | | 3 | poor QOL | | 4 | myocardial infarction | | 5 | stroke | | 6 | death | Suppose that Y were assessed weekly and that the clinical events of MI, stroke, or death are always known when they occur. But suppose that QOL is only assessed once per month. Instead of dealing with complex missing data methods, consider Y to be partially assessed by the use of left censoring. On weeks of non-assessment of QOL consider Y to just be known to be < 4 when the participant is event-free. `blrm` uses the `Ocens` function ("ordinal censoring") to handle censored Y. The notation is `Ocens(a, b)` where `a` is the lowest value that Y might be, and `b` is the highest value it might be for a certain observation. If `a=b`, the observation is uncensored. If in a given week a patient has a clinical event, that event overrides any level of QOL and will result in Y=5, for example, for stroke. For a week for which a clinical event does not occur, we know that Y < 4. When QOL is assessed, we know which of Y=0,1,2,3 pertains. When QOL is not assessed, Y = `Ocens(0, 4)`. It should be clear that if QOL were not assessed for any participant, the dependent variable is really a 4-level outcome (3 clinical outcomes, with Y=0 denoting that no bad outcome occurred for the participant). In full-likelihood models such as our extended PO model, censored data are easily handled. One just has to compute the contribution to the log-likelihood for each observation from the information it provides. An observation interval-censored with Y $\in [3,4]$ has a likelihood equal to the model's probability that Y is between 3 and 4 inclusive. For a cumulative probability model this easily derived from $P(Y \geq 3) - P(Y > 4)$ which is a difference in expits. Here is an example where we simulate a dataset with a 6-level ordinal response variable with no censored values, and fit the PO model to it. Then the dataset is duplicated and all the observations with y=1 or 2 are left censored at <= 2 (which is the same as interval censoring on [1,2]), all those with y=3 or 4 are interval censored in [3,4], and all those with y=5 or 6 are right censored at y >= 5. The model is re-fitted to see if the posterior mean regression coefficients remain relatively unchanged. ```{r cens} set.seed(1) n <- 500 x <- rnorm(n) y <- as.integer(cut2(x + rnorm(n), g=6)) f <- blrm(y ~ x, file='bnocens.rds') m <- length(x) y2 <- y[1 : m] a <- b <- y2 # Left censor obs with y <= 2 i <- y2 <= 2 a[i] <- 1 b[i] <- 2 # Interval censor obs with y = 3 or 4 i <- y2 == 3 | y2 == 4 a[i] <- 3 b[i] <- 4 # Right censor obs with y = 5 or 6 i <- y2 >= 5 a[i] <- 5 b[i] <- 6 table(y2, paste0('[', a, ',', b, ']')) Y <- Ocens(c(y, a), c(y, b)) x <- c(x, x[1 : m]) g <- blrm(Y ~ x, file='bcens.rds') rbind('No cens:mode'=coef(f, 'mode'), 'No cens:mean'=coef(f, 'mean'), 'Cens:mode' =coef(g, 'mode'), 'Cens:mean' =coef(g, 'mean')) ``` # Multiple Imputation When possible, full joint Bayesian modeling of (possibly missing) covariates and the outcome variable should be used to get exact inference in the presence of missing covariate values. Another good approach is to use multiple imputation with stacking of posterior draws after running the Bayesian model for each completed dataset. When doing posterior inference on the stacked posterior draws the uncertainty from multiple imputation is fully taken into account, and so is the change in posterior distribution. Frequentist inference requires complex adjustments, as multiple imputation alters the sampling distribution of model parameter estimates. For example, regression coefficient estimates that have a normal distribution with complete data may have a $t$-like distribution after multiple imputation. `blrm` works with the `stackMI` function to make posterior stacking easier. It works with `Hmisc::aregImpute` and the `mice` package. `stackMI` is the analog of the `Hmisc::fit.mult.impute` but is much simpler due to the use of the Bayesian paradigm. Here is an example adapted from the `aregImpute` help file. The `aregImpute` is stored so that the random seed-initiated multiple imputation process will not be re-run if not necessary. That will allow the Stan code to not be run again until the underlying data changes or the code in the chunks changes. ```{r aregi,w=7,h=7} set.seed(2) n <- 1000 x1 <- factor(sample(c('a','b','c'), n, TRUE)) x2 <- (x1=='b') + 3*(x1=='c') + rnorm(n,0,2) x3 <- rnorm(n) xbeta <- 0.35 * (x2 + 1 * (x1 == 'c') + 0.2 * x3) y <- ifelse(runif(n) <= plogis(xbeta), 1, 0) x1[1:250] <- NA x2[251:350] <- NA d <- data.frame(x1,x2,x3,y, stringsAsFactors=TRUE) mithere <- file.exists('mi.rds') mi <- if(mithere) readRDS('mi.rds') else aregImpute(~ y + x1 + x2 + x3, nk=3, data=d, B=10, n.impute=5, pr=FALSE) if(! mithere) saveRDS(mi, 'mi.rds', compress='xz') mi # Note: The following model will be re-run every time aregImpute runs # because imputations have randomness bmi <- stackMI(y ~ x1 + x2 + x3, blrm, mi, data=d, refresh=50, file='bmi.rds') stanDx(bmi) stanDxplot(bmi, which='x1=c', rev=TRUE, stripsize=5) ``` ```{r aregip,w=7,h=6} plot(bmi) ``` One can see that the 5 individual posterior distributions for the frequently missing variables `x1` and `x2` vary a lot, but not so for the never missing variable `x3`. Computations done on the `bmi` object will automatically use the full stacked posterior distribution. # Scaling and Intepretation of Priors on $\beta$s `blrm` orthonormalizes the data design matrix to greatly improve posterior distribution sampling in the case of collinearities among predictors (especially among spline basis functions). Through the `keepsep` argument one can hold out selected columns of the design matrix from QR orthogonalization so that prior standard deviations will apply to known quantities. When no columns are held out in this way, one can get a sense of the new QR scale on the translated columns by printing the standard deviations of the QR-transformed design matrix columns, which is stored in the `xqrsd` object in the fit object. The following example compares this to the SDs from the original design matrix (which is stored in the `x` object in the fit). ```{r qrscale} set.seed(1) n <- 200 x <- runif(n, 0, 10) tx <- rep(0:1, n/2) y <- sample(0:1, n, TRUE) f <- blrm(y ~ tx + pol(x, 3), iter=500) rbind(apply(f$x, 2, sd), f$xqrsd) ``` The `selectedQr` function can be used to recompute the transformed data matrix for further examination. You'll see that when centering is used (as with `blrm`) the transformed columns are orthogonal. # Speed of `blrm` For Large Numbers of Y Levels When there is a large number of intercepts in the model, the speed of `blrm` will decrease. What about the speed of using `blrm` just to get (potentially penalized) maximum likelihood estimates? Let's try fitting a progressively more continuous dependent variable. ```{r manyint} set.seed(1) n <- 1000 x <- rnorm(n) y <- x + rnorm(n) for(g in c(2, 4, 8, 16, 32, 64, 128, 256)) { cat('\n', g, 'distinct values of y\n') yg <- cut2(y, g=g) print(system.time(f <- blrm(yg ~ x, method='optimizing'))) } ``` This is impressive. For g=256 compare with the execution time of the Newton-Raphson method making optimum use of sparse matrices. Also compare coefficients. When sampling is done, the default Dirichlet distribution concentration parameter for the intercepts is selected to make the posterior means agree with maximum likelihood estimates, sacrificing some performance of posterior modes. When `method='optimizing'`, instead a concentration parameter of 1.0 for the Dirichlet prior distribution for intercepts is always used, which seems to make optimization agree more with maximum likelihood estimates. This optimization is used to get posterior modes when random effects are not present. ```{r morm} system.time(g <- orm(yg ~ x)) plot(coef(g), coef(f), xlab='Coefficients from orm', ylab='Coefficients from blrm') abline(a=0, b=1, col=gray(0.8)) ``` See how long it takes to do posterior sampling with `rstan` when there are 16, 64, or 128 levels of y. ```{r sampmany} for(g in c(16, 64, 128)) { cat('\n', g, 'distinct values of y\n') yg <- cut2(y, g=g) print(system.time(h <- blrm(yg ~ x))) } ``` # Computing Environment ```{r echo=FALSE} markupSpecs$html$session() ``` # References