Data source: The Titanic Passenger List edited by Michael A. Findlay, originally published in Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, and expanded with the help of the Internet community. The original html files were obtained from Philip Hind (1999). The dataset was compiled and interpreted by Thomas Cason. It is available in R and spreadsheet formats from hbiostat.org/data under the name titanic3.
12.1 Descriptive Statistics
Code
require(rms)options(prType='html') # for print, summary, anovagetHdata(titanic3) # get dataset from web site# List of names of variables to analyzev <-c('pclass','survived','age','sex','sibsp','parch')t3 <- titanic3[, v]units(t3$age) <-'years'describe(t3)
t3 Descriptives
t3
6 Variables 1309 Observations
pclass
n
missing
distinct
1309
0
3
Value 1st 2nd 3rd
Frequency 323 277 709
Proportion 0.247 0.212 0.542
For the frequency table, variable is rounded to the nearest 0
Code
spar(ps=6,rt=3)dd <-datadist(t3)# describe distributions of variables to rmsoptions(datadist='dd')s <-summary(survived ~ age + sex + pclass +cut2(sibsp,0:3) +cut2(parch,0:3), data=t3)plot(s, main='', subtitles=FALSE)
Show 4-way relationships after collapsing levels. Suppress estimates based on \(<25\) passengers.
12.2 Exploring Trends with Nonparametric Regression
Code
b <-scale_size_discrete(range=c(.1, .85))yl <-ylab(NULL)p1 <-ggplot(t3, aes(x=age, y=survived)) +histSpikeg(survived ~ age, lowess=TRUE, data=t3) +ylim(0,1) + ylp2 <-ggplot(t3, aes(x=age, y=survived, color=sex)) +histSpikeg(survived ~ age + sex, lowess=TRUE,data=t3) +ylim(0,1) + ylp3 <-ggplot(t3, aes(x=age, y=survived, size=pclass)) +histSpikeg(survived ~ age + pclass, lowess=TRUE,data=t3) + b +ylim(0,1) + ylp4 <-ggplot(t3, aes(x=age, y=survived, color=sex,size=pclass)) +histSpikeg(survived ~ age + sex + pclass,lowess=TRUE, data=t3) + b +ylim(0,1) + ylgridExtra::grid.arrange(p1, p2, p3, p4, ncol=2) # combine 4
Code
top <-theme(legend.position='top')p1 <-ggplot(t3, aes(x=age, y=survived, color=cut2(sibsp,0:2))) +stat_plsmo() + b +ylim(0,1) + yl + top +scale_color_discrete(name='siblings/spouses')p2 <-ggplot(t3, aes(x=age, y=survived, color=cut2(parch,0:2))) +stat_plsmo() + b +ylim(0,1) + yl + top +scale_color_discrete(name='parents/children')gridExtra::grid.arrange(p1, p2, ncol=2)
12.3 Binary Logistic Model with Casewise Deletion of Missing Values
First fit a model that is saturated with respect to age, sex, pclass
Insufficient variation in sibsp, parch to fit complex interactions or nonlinearities.
With age appearing in so many terms, giving too many parameters to age creates instabilities and makes many bootstrap repetitions fail to converge or to yield singular covariance matrices
Use AIC to determine the global number of knots for age that is “best for the money” in terms of being the most likely to cross-validate well
4 knots has best (lowest) AIC and we’ll use that going forward
Refit that model with x=TRUE, y=TRUE so can do likelihood ratio (LR) tests
But start with Wald tests
Code
f1 <-lrm(survived ~ sex*pclass*rcs(age,4) +rcs(age,4)*(sibsp + parch), data=t3, x=TRUE, y=TRUE)print(f1, r2=1:4) # print all 4 R^2 measures that use only the global LR chi-square
Logistic Regression Model
lrm(formula = survived ~ sex * pclass * rcs(age, 4) + rcs(age,
4) * (sibsp + parch), data = t3, x = TRUE, y = TRUE)
Frequencies of Missing Values Due to Each Variable
But models almost always provide better descriptive statistics
Code
m <-lrm(is.na(age) ~ sex * pclass + survived + sibsp + parch,data=t3)m
Logistic Regression Model
lrm(formula = is.na(age) ~ sex * pclass + survived + sibsp +
parch, data = t3)
Model Likelihood
Ratio Test
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 1309
LR χ2 114.99
R2 0.133
C 0.703
FALSE 1046
d.f. 8
R28,1309 0.078
Dxy 0.353
TRUE 263
Pr(>χ2) <0.0001
R28,630.5 0.156
γ 0.451
max |∂log L/∂β| 5×10-6
Brier 0.148
τa 0.131
β
S.E.
Wald Z
Pr(>|Z|)
Intercept
-2.2030
0.3641
-6.05
<0.0001
sex=male
0.6440
0.3953
1.63
0.1033
pclass=2nd
-1.0079
0.6658
-1.51
0.1300
pclass=3rd
1.6124
0.3596
4.48
<0.0001
survived
-0.1806
0.1828
-0.99
0.3232
sibsp
0.0435
0.0737
0.59
0.5548
parch
-0.3526
0.1253
-2.81
0.0049
sex=male × pclass=2nd
0.1347
0.7545
0.18
0.8583
sex=male × pclass=3rd
-0.8563
0.4214
-2.03
0.0422
Code
anova(m)
Wald Statistics for is.na(age)
χ2
d.f.
P
sex (Factor+Higher Order Factors)
5.61
3
0.1324
All Interactions
5.58
2
0.0614
pclass (Factor+Higher Order Factors)
68.43
4
<0.0001
All Interactions
5.58
2
0.0614
survived
0.98
1
0.3232
sibsp
0.35
1
0.5548
parch
7.92
1
0.0049
sex × pclass (Factor+Higher Order Factors)
5.58
2
0.0614
TOTAL
82.90
8
<0.0001
pclass and parch are the important predictors of missing age.
12.5 Single Conditional Mean Imputation
Single imputation is not the preferred approach here. Click below to see this section.
Single Imputation and Analysis Result
First try: conditional mean imputation Default spline transformation for age caused distribution of imputed values to be much different from non-imputed ones; constrain to linear. Also force discrete numeric variables to be linear because knots are hard to determine for them.
transcan(x = ~I(age) + sex + pclass + I(sibsp) + I(parch), imputed = TRUE,
pr = FALSE, pl = FALSE, data = t3)
Iterations: 4
R-squared achieved in predicting each variable:
age sex pclass sibsp parch
0.236 0.075 0.232 0.200 0.173
Adjusted R-squared:
age sex pclass sibsp parch
0.233 0.072 0.229 0.197 0.170
Coefficients of canonical variates for predicting each (row) variable
age sex pclass sibsp parch
age 1.33 5.98 -3.16 -0.85
sex 0.04 -0.67 -0.04 -0.80
pclass 0.08 -0.32 0.14 0.02
sibsp -0.02 -0.01 0.08 0.39
parch 0.00 -0.15 0.01 0.28
Summary of imputed values
Starting estimates for imputed values:
age sex pclass sibsp parch
28 2 3 0 0
Code
# Look at mean imputed values by sex,pclass and observed means# age.i is age, filled in with conditional mean estimatesage.i <-with(t3, impute(xtrans, age, data=t3))i <-is.imputed(age.i)with(t3, tapply(age.i[i], list(sex[i],pclass[i]), mean))
sex × pclass × age.i (Factor+Higher Order Factors)
11.88
6
0.0647
Nonlinear
10.57
4
0.0318
TOTAL NONLINEAR
40.70
16
0.0006
TOTAL INTERACTION
108.27
23
<0.0001
TOTAL NONLINEAR + INTERACTION
117.26
25
<0.0001
TOTAL
649.29
31
<0.0001
Figure 12.9: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean imputation (top). is set to zero for these predicted values.
Figure 12.10: Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean imputation (top). is set to zero for these predicted values.
D
12.6 Multiple Imputation
The following uses aregImpute with predictive mean matching. By default, aregImpute does not transform age when it is being predicted from the other variables. Four knots are used to transform age when used to impute other variables (not needed here as no other missings were present). Since the fraction of observations with missing age is \(\frac{263}{1309} = 0.2\) we use 20 imputations.
Force sibsp and parch to be linear for imputation, because their highly discrete distributions make it difficult to choose knots for splines.
Code
set.seed(17) # so can reproduce random aspectsmi <-aregImpute(~ age + sex + pclass +I(sibsp) +I(parch) + survived,data=t3, n.impute=20, nk=4, pr=FALSE)mi
Multiple Imputation using Bootstrap and PMM
aregImpute(formula = ~age + sex + pclass + I(sibsp) + I(parch) +
survived, data = t3, n.impute = 20, nk = 4, pr = FALSE)
n: 1309 p: 6 Imputations: 20 nk: 4
Number of NAs:
age sex pclass sibsp parch survived
263 0 0 0 0 0
type d.f.
age s 1
sex c 1
pclass c 2
sibsp l 1
parch l 1
survived l 1
Transformation of Target Variables Forced to be Linear
R-squares for Predicting Non-Missing Values for Each Variable
Using Last Imputations of Predictors
age
0.294
Code
# Print the first 10 imputations for the first 10 passengers# having missing agemi$imputed$age[1:10, 1:10]
This method takes final \(\hat{\beta}\) from a single model fit on 20 stacked completed datasets
But standard errors come from the usual Rubin’s rule and the 20 fits
rms::processMI computes the LR statistics from special information saved by fit.mult.impute triggered by lrt=TRUE
The Hmisc package runifChanged function is used to save the result and not spend 1m running it again until an input changes
The rmsLRupdate function is run to fix likelihood ratio-related statistics (LR test, its \(p\)-value, various \(R^2\) measures) using the overall Chan & Meng model LR \(\chi^2\) computed by processMI
Two of the \(R^2\) printed use an effective sample size of 927 for the unbalanced binary survived variable
sex (Factor+Higher Order Factors) pclass (Factor+Higher Order Factors)
1.017 1.044
age (Factor+Higher Order Factors) TOTAL
0.932 1.010
G
Using all available data resulted in increases in predictive information for sex, pclass and strangely a reduction for age
For each completed dataset run bootstrap validation of model performance indexes and the nonparametric calibration curve. Because the 20 analyses of completed datasets help to average out some of the noise in bootstrap estimates we can use fewer bootstrap repetitions (100) than usual (300 or so).
spar(bot=1, top=0.5, ps=8)# Get predicted values for certain types of passengerss <-summary(f.mi, age=c(1,30), sibsp=0:1)# override default ranges for 3 variablesplot(s, log=TRUE, main='')
Code
phat <-predict(f.mi, combos <-expand.grid(age=c(2,21,50),sex=levels(t3$sex),pclass=levels(t3$pclass),sibsp=0, parch=0), type='fitted')# Can also use Predict(f.mi, age=c(2,21,50), sex, pclass,# sibsp=0, fun=plogis)$yhatoptions(digits=1)data.frame(combos, phat)
We can also get predicted values by creating an R function that will evaluate the model on demand, but that only works if there are no 3rd-order interactions.
I
Code
pred.logit <-Function(f.mi)# Note: if don't define sibsp to pred.logit, defaults to 0plogis(pred.logit(age=c(2,21,50), sex='male', pclass='3rd'))
A nomogram could be used to obtain predicted values manually, but this is not feasible when so many interaction terms are present.
J
12.8 Bayesian Analysis
Repeat the multiple imputation-based approach but using a Bayesian binary logistic model
Using default blrm function normal priors on regression coefficients with zero mean and large SD making the priors almost flat
blrm uses the rcmdstan and rstan packages that provides the full power of Stan to R
Here we use cmdstan with rcmdstan
rmsb has its own caching mechanism that efficiently stores the model fit object (and all its posterior draws) and reads it back from disk install of running it again, until one of the inputs change
Could use smaller prior SDs to get penalized estimates
Using 4 independent Markov chain Hamiltonion posterior sampling procedures each with 1000 burn-in iterations that are discarded, and 1000 “real” iterations for a total of 4000 posterior sample draws
Use the first 10 multiple imputations already developed above (object mi), running the Bayesian procedure separately for 10 completed datasets
Merely have to stack the posterior draws into one giant sample to account for imputation and get correct posterior distribution
K
Code
# Use all available CPU cores less 1. Each chain will be run on its# own core.require(rmsb)options(mc.cores=parallel::detectCores() -1, rmsb.backend='cmdstan')cmdstanr::set_cmdstan_path(cmdstan.loc)# cmdstan.loc is defined in ~/.Rprofile# 10 Bayesian analyses took 3m on 11 coresset.seed(21)bt <-stackMI(survived ~ sex * pclass *rcs(age, 4) +rcs(age, 4) * (sibsp + parch), blrm, mi, data=t3, n.impute=10, refresh=25,file='bt.rds')bt
Bayesian Logistic Model
Dirichlet Priors With Concentration Parameter 0.541 for Intercepts
```{r include=FALSE}require(Hmisc)require(qreport)options(qproject='rms', prType='html')getRs('qbookfun.r')hookaddcap()knitr::set_alias(w ='fig.width', h ='fig.height', cap ='fig.cap', scap ='fig.scap')```# Logistic Model Case Study: Survival of Titanic Passengers {#sec-titanic}`r mrg(sound("titanic-1"))`**Data source**: _The Titanic Passenger List_ edited byMichael A. Findlay, originally published in Eaton \& Haas (1994) _Titanic: Triumph and Tragedy_, Patrick Stephens Ltd, and expanded withthe help of the Internet community. The original `html` files wereobtained from [Philip Hind (1999)](http://atschool.eduweb.co.uk/phind). The dataset wascompiled and interpreted by Thomas Cason. It is available in `R` andspreadsheet formats from [hbiostat.org/data](https://hbiostat.org/data) under the name `titanic3`.## Descriptive Statistics```{r desc}require(rms)options(prType='html') # for print, summary, anovagetHdata(titanic3) # get dataset from web site# List of names of variables to analyzev <-c('pclass','survived','age','sex','sibsp','parch')t3 <- titanic3[, v]units(t3$age) <-'years'describe(t3)``````{r summary,cap='Univariable summaries of Titanic survival'}#| label: fig-titanic-summaryspar(ps=6,rt=3)dd <-datadist(t3)# describe distributions of variables to rmsoptions(datadist='dd')s <-summary(survived ~ age + sex + pclass +cut2(sibsp,0:3) +cut2(parch,0:3), data=t3)plot(s, main='', subtitles=FALSE)```Show 4-way relationships after collapsing levels. Suppress estimates `r ipacue()` based on $<25$ passengers.```{r dot,h=3,w=6,cap='Multi-way summary of Titanic survival'}#| label: fig-titanic-dotrequire(ggplot2)tn <-transform(t3,agec =ifelse(age <21, 'child', 'adult'),sibsp=ifelse(sibsp ==0, 'no sib/sp', 'sib/sp'),parch=ifelse(parch ==0, 'no par/child', 'par/child'))g <-function(y) if(length(y) <25) NAelsemean(y)s <-with(tn, summarize(survived,llist(agec, sex, pclass, sibsp, parch), g))# llist, summarize in Hmisc packageggplot(subset(s, agec !='NA'),aes(x=survived, y=pclass, shape=sex)) +geom_point() +facet_grid(agec ~ sibsp * parch) +xlab('Proportion Surviving') +ylab('Passenger Class') +scale_x_continuous(breaks=c(0, .5, 1))```## Exploring Trends with Nonparametric Regression`r mrg(sound("titanic-2"))````{r plsmoa,h=6.5,w=6.5,cap='Nonparametric regression (`loess`) estimates of the relationship between age and the probability of surviving the Titanic, with tick marks depicting the age distribution. The top left panel shows unstratified estimates of the probability of survival. Other panels show nonparametric estimates by various stratifications.',scap='Nonparametric regression for age, sex, class, and passenger survival'}#| label: fig-titanic-plsmoab <- scale_size_discrete(range=c(.1, .85))yl <- ylab(NULL)p1 <- ggplot(t3, aes(x=age, y=survived)) + histSpikeg(survived ~ age, lowess=TRUE, data=t3) + ylim(0,1) + ylp2 <- ggplot(t3, aes(x=age, y=survived, color=sex)) + histSpikeg(survived ~ age + sex, lowess=TRUE, data=t3) + ylim(0,1) + ylp3 <- ggplot(t3, aes(x=age, y=survived, size=pclass)) + histSpikeg(survived ~ age + pclass, lowess=TRUE, data=t3) + b + ylim(0,1) + ylp4 <- ggplot(t3, aes(x=age, y=survived, color=sex, size=pclass)) + histSpikeg(survived ~ age + sex + pclass, lowess=TRUE, data=t3) + b + ylim(0,1) + ylgridExtra::grid.arrange(p1, p2, p3, p4, ncol=2) # combine 4``````{r plsmob, w=5, h=2.5,cap='Relationship between age and survival stratified by the number of siblings or spouses on board (left panel) or by the number of parents or children of the passenger on board (right panel).',scap='Relationship between age and survival stratified by family size variables'}#| label: fig-titanic-plsmobtop <- theme(legend.position='top')p1 <- ggplot(t3, aes(x=age, y=survived, color=cut2(sibsp, 0:2))) + stat_plsmo() + b + ylim(0,1) + yl + top + scale_color_discrete(name='siblings/spouses')p2 <- ggplot(t3, aes(x=age, y=survived, color=cut2(parch, 0:2))) + stat_plsmo() + b + ylim(0,1) + yl + top + scale_color_discrete(name='parents/children')gridExtra::grid.arrange(p1, p2, ncol=2)```## Binary Logistic Model with Casewise Deletion of Missing Values {#sec-titanic-cd}`r mrg(sound("titanic-3"))`<!-- NEW -->* First fit a model that is saturated with respect to `age, sex, pclass`* Insufficient variation in `sibsp`, `parch` to fit complex interactions or nonlinearities.* With `age` appearing in so many terms, giving too many parameters to `age` creates instabilities and makes many bootstrap repetitions fail to converge or to yield singular covariance matrices* Use AIC to determine the global number of knots for `age` that is "best for the money" in terms of being the most likely to cross-validate well```{r anova3}for(k in3:5) { f <-lrm(survived ~ sex*pclass*rcs(age, k) +rcs(age, k)*(sibsp + parch), data=t3)cat('k=', k, ' AIC=', AIC(f), '\n')}```* 4 knots has best (lowest) AIC and we'll use that going forward* Refit that model with x=TRUE, y=TRUE so can do likelihood ratio (LR) tests* But start with Wald tests```{r}f1 <-lrm(survived ~ sex*pclass*rcs(age,4) +rcs(age,4)*(sibsp + parch), data=t3, x=TRUE, y=TRUE)print(f1, r2=1:4) # print all 4 R^2 measures that use only the global LR chi-squareanova(f1)```Compute the slightly more time-consuming LR tests```{r}af1 <-anova(f1, test='LR')print(af1, which='subscripts')```* In the RMS text, 5 knots were used for `age` and only Wald tests were performed* Large $p$-value for the 3rd order interaction was used to justify exclusion of these highest-order interactions from the model (and one other term)* More evidence for 3rd order interaction from the more accurate LR test* Keep this modelShow the many effects of predictors. `r ipacue()````{r plot1, h=3, w=6, cap='Effects of predictors on probability of survival of Titanic passengers, estimated for zero siblings/spouses and zero parents/children',scap='Effects of predictors on probability of surviving the Titanic'}#| label: fig-titanic-plotap <-Predict(f1, age, sex, pclass, sibsp=0, parch=0, fun=plogis)ggplot(p)``````{r plot2,cap='Effect of number of siblings and spouses on the log odds of surviving, for third class males',scap='Effect of number of siblings/spouses on survival'}#| label: fig-titanic-plotbggplot(Predict(f1, sibsp, age=c(10,15,20,50), conf.int=FALSE))#```Note that children having many siblings apparently had lowersurvival. Married adults had slightly higher survival than unmarriedones.`r ipacue()`**But** moderate problem with missing data must be dealt with## Examining Missing Data Patterns {#sec-titanic-naclus}`r mrg(sound("titanic-4"))````{r napatterns,h=5,w=7,cap='Patterns of missing data. Upper left panel shows the fraction of observations missing on each predictor. Lower panel depicts a hierarchical cluster analysis of missingness combinations. The similarity measure shown on the $Y$-axis is the fraction of observations for which both variables are missing. Right panel shows the result of recursive partitioning for predicting `is.na(age)`. The `rpart` function found only strong patterns according to passenger class.',scap='Patterns of missing Titanic data'}#| label: fig-titanic-napatternsspar(mfrow=c(2,2), top=1, ps=11)na.patterns <- naclus(titanic3)require(rpart) # Recursive partitioning packagewho.na <- rpart(is.na(age) ~ sex + pclass + survived + sibsp + parch, data=titanic3, minbucket=15)naplot(na.patterns, 'na per var')plot(who.na, margin=.1); text(who.na)plot(na.patterns)``````{r summary-na,w=4,h=4,cap='Univariable descriptions of proportion of passengers with missing age'}#| label: fig-titanic-summary-naspar(ps=7, rt=3)plot(summary(is.na(age) ~ sex + pclass + survived + sibsp + parch, data=t3))```But models almost always provide better descriptive statistics```{r nalrm}m <-lrm(is.na(age) ~ sex * pclass + survived + sibsp + parch,data=t3)manova(m)````pclass` and `parch` are the important predictors of missing age.## Single Conditional Mean ImputationSingle imputation is not the preferred approach here. Click below to see this section.::: {.callout-note collapse="true"}### Single Imputation and Analysis Result`r ipacue()`First try: conditional mean imputation <br>Default spline transformation for age caused distribution of imputed values to be much different from non-imputed ones; constrain to linear. Also force discrete numeric variables to be linear because knots are hard to determine for them.```{r transcan}xtrans <-transcan(~I(age) + sex + pclass +I(sibsp) +I(parch),imputed=TRUE, pl=FALSE, pr=FALSE, data=t3)summary(xtrans)# Look at mean imputed values by sex,pclass and observed means# age.i is age, filled in with conditional mean estimatesage.i <-with(t3, impute(xtrans, age, data=t3))i <-is.imputed(age.i)with(t3, tapply(age.i[i], list(sex[i],pclass[i]), mean))with(t3, tapply(age, list(sex,pclass), mean, na.rm=TRUE))``````{r fit.si}dd <-datadist(dd, age.i)f.si <-lrm(survived ~ sex * pclass *rcs(age.i, 4) +rcs(age.i, 4) * (sibsp + parch), data=t3, x=TRUE, y=TRUE)print(f.si, coefs=FALSE)``````{r h=6, w=6, cap='Predicted probability of survival for males from fit using casewise deletion (bottom) and single conditional mean imputation (top). `sibsp` is set to zero for these predicted values.',scap='Predicted log odds of survival in Titanic using casewise deletion'}#| label: fig-titanic-nasinglespar(ps=12)p1 <- Predict(f1, age, pclass, sex, sibsp=0, fun=plogis)p2 <- Predict(f.si, age.i, pclass, sex, sibsp=0, fun=plogis)p <- rbind('Casewise Deletion'=p1, 'Single Imputation'=p2, rename=c(age.i='age')) # creates .set. variableggplot(p, groups='sex', ylab='Probability of Surviving')anova(f.si, test='LR')```:::## Multiple Imputation`r mrg(sound("titanic-5"))`The following uses `aregImpute` with predictive mean matching. Bydefault, `aregImpute` does not transform `age` when it is beingpredicted from the other variables. Four knots are used to transform`age` when used to impute other variables (not needed here as noother missings were present). Since the fraction of observations withmissing age is $\frac{263}{1309} = 0.2$ we use 20 imputations. [Force `sibsp` and `parch` to be linear for imputation, because their highly discrete distributions make it difficult to choose knots for splines.]{.aside}```{r aregi}set.seed(17) # so can reproduce random aspectsmi <- aregImpute(~ age + sex + pclass + I(sibsp) + I(parch) + survived, data=t3, n.impute=20, nk=4, pr=FALSE)mi# Print the first 10 imputations for the first 10 passengers# having missing agemi$imputed$age[1:10, 1:10]```Show the distribution of imputed (black) and actual ages (gray). `r ipacue()````{r ageDist,w=3.5,h=3.25,cap='Distributions of imputed and actual ages for the Titanic dataset. Imputed values are in black and actual ages in gray.',scap='Distribution of imputed and actual ages'}#| label: fig-titanic-agedistplot(mi)Ecdf(t3$age, add=TRUE, col='gray', lwd=2,subtitles=FALSE)```* Fit logistic models for 20 completed datasets and print the ratio `r ipacue()` of imputation-corrected variances to average ordinary variances.* Use method of [Chan & Meng](missing#lrt) to get LR tests* This method takes final $\hat{\beta}$ from a single model fit on 20 stacked completed datasets* But standard errors come from the usual Rubin's rule and the 20 fits* `rms::processMI` computes the LR statistics from special information saved by `fit.mult.impute` triggered by `lrt=TRUE`* The `Hmisc` package `runifChanged` function is used to save the result and not spend 1m running it again until an input changes* The `rms``LRupdate` function is run to fix likelihood ratio-related statistics (LR test, its $p$-value, various $R^2$ measures) using the overall Chan & Meng model LR $\chi^2$ computed by `processMI`* Two of the $R^2$ printed use an effective sample size of 927 for the unbalanced binary `survived` variable```{r fmi}runmi <-function()fit.mult.impute(survived ~ sex * pclass *rcs(age, 4) +rcs(age, 4) * (sibsp + parch), lrm, mi, data=t3, pr=FALSE, lrt=TRUE) # lrt implies x=TRUE y=TRUE + moreseed <-17f.mi <-runifChanged(runmi, seed, mi, t3)afmi <-processMI(f.mi, 'anova')# Print imputation penalty indexesprmiInfo(afmi)```* None of the denominator d.f. is small enough for us to worry about the $\chi^2$ approximation* Take the ratio of selected LR statistics after multiple imputation to that from casewise deletion```{r}afmif.mi <-LRupdate(f.mi, afmi)print(f.mi, r2=1:4) # print all 4 imputation-adjusted R^2round(afmi[c(1,3,5,30), 'Chi-Square'] / af1[c(1,3,5,30), 'Chi-Square'], 3)````r ipacue()`* Using all available data resulted in increases in predictive information for `sex, pclass` and strangely a reduction for `age`For each completed dataset run bootstrap validation of model performance indexes and the nonparametric calibration curve. Because the 20 analyses of completed datasets help to average out some of the noise in bootstrap estimates we can use fewer bootstrap repetitions (100) than usual (300 or so).```{r fmival}val <-function(fit)list(validate =validate (fit, B=100),calibrate =calibrate(fit, B=100) )runmi <-function()fit.mult.impute( # 1m survived ~ sex * pclass *rcs(age,4) +rcs(age,4) * (sibsp + parch), lrm, mi, data=t3, pr=FALSE,fun=val, fitargs=list(x=TRUE, y=TRUE))seed <-19f <-runifChanged(runmi, seed, mi, t3, val)````r mrg(sound("titanic-6"))`* Display the 20 bootstrap internal validations averaged over the multiple imputations.* Show the 20 individual calibration curves then the first 3 in more detail followed by the overall calibration estimate```{r}val <-processMI(f, 'validate')print(val, digits=3)``````{r}#| label: fig-titanic-calibrate#| fig.cap: "Estimated calibration curves for the Titanic risk model, accounting for multiple imputation"#| fig.width: 8.5#| fig.height: 6#| column: screen-inset-rightspar(mfrow=c(2,2), top=1, bot=2)cal <-processMI(f, 'calibrate', nind=3)# plot(cal) for full-size final calibration curve```Return to the stacked fit and compare it to the fit from single imputation```{r namult, h=6, w=6, cap='Predicted probability of survival for males from fit using single conditional mean imputation again (top) and multiple random draw imputation (bottom). Both sets of predictions are for `sibsp`=0.',scap='Predicted Titanic survival using multiple imputation'}#| label: fig-titanic-namultp1 <- Predict(f.si, age.i, pclass, sex, sibsp=0, fun=plogis)p2 <- Predict(f.mi, age, pclass, sex, sibsp=0, fun=plogis)p <- rbind('Single Imputation'=p1, 'Multiple Imputation'=p2, rename=c(age.i='age'))ggplot(p, groups='sex', ylab='Probability of Surviving')```## Summarizing the Fitted ModelShow odds ratios for changes in predictor values `r ipacue()````{r ors, w=5, h=3, cap='Odds ratios for some predictor settings'}#| label: fig-titanic-orsspar(bot=1, top=0.5, ps=8)# Get predicted values for certain types of passengerss <- summary(f.mi, age=c(1,30), sibsp=0:1)# override default ranges for 3 variablesplot(s, log=TRUE, main='')``````{r phat}phat <-predict(f.mi, combos <-expand.grid(age=c(2,21,50),sex=levels(t3$sex),pclass=levels(t3$pclass),sibsp=0, parch=0), type='fitted')# Can also use Predict(f.mi, age=c(2,21,50), sex, pclass,# sibsp=0, fun=plogis)$yhatoptions(digits=1)data.frame(combos, phat)options(digits=5)```We can also get predicted values by creating an R function that will `r ipacue()`evaluate the model on demand, but that only works if there are no 3rd-order interactions.```{r pred.logit,eval=FALSE}pred.logit <-Function(f.mi)# Note: if don't define sibsp to pred.logit, defaults to 0plogis(pred.logit(age=c(2,21,50), sex='male', pclass='3rd'))```A nomogram could be used to obtain predicted values manually, but this `r ipacue()`is not feasible when so many interaction terms are present.## Bayesian Analysis* Repeat the multiple imputation-based approach but using a `r ipacue()` Bayesian binary logistic model* Using default `blrm` function normal priors on regression coefficients with zero mean and large SD making the priors almost flat* `blrm` uses the `rcmdstan` and `rstan` packages that provides the full power of Stan to `R`* Here we use `cmdstan` with `rcmdstan`* `rmsb` has its own caching mechanism that efficiently stores the model fit object (and all its posterior draws) and reads it back from disk install of running it again, until one of the inputs change* See [this](https://hbiostat.org/R/rmsb) for more about the `rmsb` package* Could use smaller prior SDs to get penalized estimates* Using 4 independent Markov chain Hamiltonion posterior sampling procedures each with 1000 burn-in iterations that are discarded, and 1000 "real" iterations for a total of 4000 posterior sample draws* Use the first 10 multiple imputations already developed above (object `mi`), running the Bayesian procedure separately for 10 completed datasets* Merely have to stack the posterior draws into one giant sample to account for imputation and get correct posterior distribution```{r blrm}#| column: page-inset-right# Use all available CPU cores less 1. Each chain will be run on its# own core.require(rmsb)options(mc.cores=parallel::detectCores() -1, rmsb.backend='cmdstan')cmdstanr::set_cmdstan_path(cmdstan.loc)# cmdstan.loc is defined in ~/.Rprofile# 10 Bayesian analyses took 3m on 11 coresset.seed(21)bt <-stackMI(survived ~ sex * pclass *rcs(age, 4) +rcs(age, 4) * (sibsp + parch), blrm, mi, data=t3, n.impute=10, refresh=25,file='bt.rds')bt```* Note that fit indexes have HPD uncertainty intervals `r ipacue()`* Everthing above accounts for imputation* Look at diagnostics::: {.callout-note collapse="true"}### Separate Diagnostics for Each of 10 Imputed Datasets```{r}stanDx(bt)```:::```{r blrmdx,w=8,h=15}#| column: screen-right# Look at convergence of only 2 parametersstanDxplot(bt, c('sex=male', 'pclass=3rd', 'age'), rev=TRUE)```* Difficult to see but there are 40 traces (10 imputations `r ipacue()` $\times$ 4 chains)* Diagnostics look good; posterior samples can be trusted* Plot posterior densities for select parameters* Also shows the 10 densities before stacking```{r btdens,w=7,h=5.5}plot(bt, c('sex=male', 'pclass=3rd', 'age'), nrow=2)```* Plot partial effect plots with 0.95 highest posterior density intervals `r ipacue()````{r btpe,w=6,h=4.5}p <-Predict(bt, age, sex, pclass, sibsp=0, fun=plogis, funint=FALSE)ggplot(p)```* Compute approximate measure of explained outcome variation for predictors `r ipacue()````{r bev,w=5.25,h=3.75}plot(anova(bt))```* Contrast second class males and females, both at 5 years and 30 `r ipacue()` years of age, all other things being equal* Compute 0.95 HPD interval for the contrast and a joint uncertainty region* Compute P(both contrasts < 0), both < -2, and P(either one < 0)```{r brcon}k <-contrast(bt, list(sex='male', age=c(5, 30), pclass='2nd'),list(sex='female', age=c(5, 30), pclass='2nd'),cnames =c('age 5 M-F', 'age 30 M-F'))kplot(k)``````{r brcon2,fig.show='hold',out.width='3.5in'}plot(k, bivar=TRUE) # assumes an ellipseplot(k, bivar=TRUE, bivarmethod='kernel') # doesn'tP <-PostF(k, pr=TRUE)P(`age 5 M-F`<0&`age 30 M-F`<0) # note backticksP(`age 5 M-F`<-2&`age 30 M-F`<-2)P(`age 5 M-F`<0|`age 30 M-F`<0)```* Show posterior distribution of predicted survival probability for a 21 year old male in third class with `sibsp=0`* `Predict` summarizes with a posterior mean (set `posterior.summary='median'` to use posterior median)* Frequentist multiple imputation estimate was 0.1342```{r}#| fig.height: 3.5#| fig.width: 4.75pmean <-Predict(bt, age=21, sex='male', pclass='3rd', sibsp=0, parch=0,fun=plogis, funint=FALSE)pmeanp <-predict(bt,data.frame(age=21, sex='male', pclass='3rd', sibsp=0, parch=0),posterior.summary='all', fun=plogis, funint=FALSE)plot(density(p), main='',xlab='Pr(survival) For One Covariate Combination')abline(v=with(pmean, c(yhat, lower, upper)), col=alpha('blue', 0.5))```* Compute Pr(survival probability > 0.2) for this man```{r}mean(p >0.2)```| Package | Purpose | Functions ||-----|-----|-----||`Hmisc`| Miscellaneous functions |`summary,plsmo,naclus,llist,latex, summarize,Dotplot,describe`||`Hmisc`| Imputation |`transcan,impute,fit.mult.impute,aregImpute,stackMI`||`rms`| Modeling |`datadist,lrm,rcs`||| Accounting for imputation |`processMI, LRupdate`||| Model presentation |`plot,summary,nomogram,Function,anova`||| Estimation |`Predict,summary,contrast`||| Model validation |`validate,calibrate`||`rmsb`| Misc. Bayesian |`blrm`, `stanDx`,`stanDxplot`,`plot`||`rpart`^[Written by Atkinson and Therneau]| Recursive partitioning |`rpart`|: `R` software used```{r echo=FALSE}saveCap('12')```