Bayesian Titanic Data Analysis

require(rms)
getHdata(titanic3)

# List variables to be analyzed
v <- c('pclass', 'survived', 'age', 'sex', 'sibsp', 'parch')
t3 <- titanic3[, v]
units(t3$age) <- 'years'

html(describe(t3))

t3

6 Variables 1309 Observations

pclass

n	missing	distinct
1309	0	3

 Value        1st   2nd   3rd
 Frequency    323   277   709
 Proportion 0.247 0.212 0.542

survived: Survived

n	missing	distinct	Info	Sum	Mean	Gmd
1309	0	2	0.708	500	0.382	0.4725

age: Age years

n	missing	distinct	Info	Mean	Gmd	.05	.10	.25	.50	.75	.90	.95
1046	263	98	0.999	29.88	16.06	5	14	21	28	39	50	57

lowest : 0.1667 0.3333 0.4167 0.6667 0.7500 , highest: 70.5000 71.0000 74.0000 76.0000 80.0000

sex

n	missing	distinct
1309	0	2

 Value      female   male
 Frequency     466    843
 Proportion  0.356  0.644

sibsp: Number of Siblings/Spouses Aboard

n	missing	distinct	Info	Mean	Gmd
1309	0	7	0.67	0.4989	0.777

 Value          0     1     2     3     4     5     8
 Frequency    891   319    42    20    22     6     9
 Proportion 0.681 0.244 0.032 0.015 0.017 0.005 0.007

parch: Number of Parents/Children Aboard

n	missing	distinct	Info	Mean	Gmd
1309	0	8	0.549	0.385	0.6375

 Value          0     1     2     3     4     5     6     9
 Frequency   1002   170   113     8     6     6     2     2
 Proportion 0.765 0.130 0.086 0.006 0.005 0.005 0.002 0.002

Casewise deletion of missing values

The models are implemented using the brms (Bayesian Regression Models using ‘Stan’) package which is a wrapper for the more general probabilistic programming language Stan and its R implementation rstan. The syntax of the main brms function brm() uses R formula notation is similar to other regression functions such as glm().

The code below produces a Bayesian logistic model for the binary survival outcome with a linear term for age and indicators for sex and passenger class (pclass). For simplicity the siblings/spouse/parent variables are not used. Notice that the brm() function produces a warning that ‘Rows containing NAs were excluded from the model’ which indicates casewise deletion was used.

require(brms)

# Recommended rstan options
#For execution on a local, multicore CPU with excess RAM
options(mc.cores = parallel::detectCores())

fit1 <- brm(survived ~ age + sex + pclass, 
            family = bernoulli(), data=t3, 
            chains = 2, iter = 2000, refresh = 0)

## Warning: Rows containing NAs were excluded from the model.

## Compiling the C++ model

## Start sampling

The summary and plot don’t show any major convergence issues. Using prior_summary() we can see that a (weakly informative) student-t prior was used for the Intercept and (improper) flat priors were used for the other coefficients.

# summarize posterior
summary(fit1)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: survived ~ age + sex + pclass 
##    Data: t3 (Number of observations: 1046) 
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept     3.55      0.32     2.93     4.18       1445 1.00
## age          -0.03      0.01    -0.05    -0.02       1794 1.00
## sexmale      -2.52      0.16    -2.86    -2.20       2059 1.00
## pclass2nd    -1.29      0.22    -1.72    -0.87       1679 1.00
## pclass3rd    -2.31      0.22    -2.75    -1.89       1470 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

# posterior densities and MCMC traces
plot(fit1)

# priors 
prior_summary(fit1)

##                 prior     class      coef group resp dpar nlpar bound
## 1                             b                                      
## 2                             b       age                            
## 3                             b pclass2nd                            
## 4                             b pclass3rd                            
## 5                             b   sexmale                            
## 6 student_t(3, 0, 10) Intercept

We can also add non-linear terms for age using natural splines and all two-way interactions between sex, pclass, and the splines for age.

require(splines)

## Loading required package: splines

# fit model with natural splines
fit3 <- brm(survived ~ (sex + pclass + ns(age,df=4))^2, 
          family = bernoulli(), data=t3, 
          chains = 2, iter=2000, refresh = 0)

## Warning: Rows containing NAs were excluded from the model.

## Compiling the C++ model

## recompiling to avoid crashing R session

## Start sampling

The marginal effects plot shows the predicted median survival by age, sex, and class along with 95% credible intervals.

summary(fit3)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: survived ~ (sex + pclass + ns(age, df = 4))^2 
##    Data: t3 (Number of observations: 1046) 
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept                 4.13      2.11     0.60     8.98        481 1.00
## sexmale                  -2.11      0.97    -4.03    -0.19        864 1.00
## pclass2nd                 9.65      5.19     1.49    21.43        667 1.00
## pclass3rd                -4.10      2.07    -8.77    -0.57        559 1.00
## nsagedfEQ41              -1.21      2.00    -5.83     2.13        594 1.00
## nsagedfEQ42               0.09      1.81    -3.65     3.45        716 1.00
## nsagedfEQ43              -0.94      4.41   -11.11     6.74        561 1.00
## nsagedfEQ44              -2.18      3.06    -7.67     4.46        913 1.00
## sexmale:pclass2nd        -0.84      0.74    -2.30     0.60       1506 1.00
## sexmale:pclass3rd         2.20      0.65     1.00     3.56       1200 1.00
## sexmale:nsagedfEQ41      -0.85      0.83    -2.44     0.88       1134 1.00
## sexmale:nsagedfEQ42      -2.07      1.41    -4.81     0.79       1049 1.00
## sexmale:nsagedfEQ43      -5.12      2.23    -9.52    -1.02       1004 1.00
## sexmale:nsagedfEQ44      -0.74      3.13    -7.28     4.87        986 1.00
## pclass2nd:nsagedfEQ41   -10.25      4.93   -21.29    -2.51        678 1.00
## pclass3rd:nsagedfEQ41     1.06      1.94    -2.20     5.47        635 1.00
## pclass2nd:nsagedfEQ42    -6.76      3.56   -14.32    -0.55        747 1.00
## pclass3rd:nsagedfEQ42    -1.33      1.66    -4.44     1.99        723 1.00
## pclass2nd:nsagedfEQ43   -23.61     10.48   -47.37    -7.18        669 1.00
## pclass3rd:nsagedfEQ43     0.15      4.47    -7.99     9.72        661 1.00
## pclass2nd:nsagedfEQ44    -6.20      4.92   -16.17     2.95       1560 1.00
## pclass3rd:nsagedfEQ44     1.49      3.55    -6.00     7.60       1360 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

# similar to Fig 12.5
marginal_effects(fit3, effects ='age:sex', robust = TRUE,
                 conditions = data.frame(pclass=c('1st','2nd','3rd')))

The Bayesian model predictions from brm() and the frequentist model predictions from lrm() are nearly identical.

dd <- datadist(t3)
options(datadist='dd')

fit3_lrm <- lrm(survived ~ (sex + pclass + rcs(age,5))^2, data=t3)

plt_lrm <- Predict(fit3_lrm, age, sex, pclass,fun=plogis)

ggplot(plt_lrm)

The Bayesian model can also be used to get the posterior survival probability distribution for an individual passenger with a given set of covariates. In the plots below the posterior survival probabilities for 3 types of passenger are shown. The red line is the point estimate from the frequentist model.

combos <- expand.grid(age=c(2,21,50), sex=levels(t3$sex), 
                      pclass=levels(t3$pclass))

phat_brm <- fitted(fit3, newdata = combos, summary=FALSE)
phat_lrm <- predict(fit3_lrm, combos, type="fitted")

op<-par(mfrow=c(3,1)) 

# 2yr, female, 1st class
hist(phat_brm[,1], xlab="Pr(survive)",
     main="Predicted survival for 2yr, female, 1st class") 
abline(v=phat_lrm[1], col="red")

# 21yr, male, 2nd class
hist(phat_brm[,11], xlab="Pr(survive)",
     main="Predicted survival for 21yr, male, 2nd class") 
abline(v=phat_lrm[11], col="red")

# 50yr, female, 3rd class
hist(phat_brm[,15], xlab="Pr(survive)", 
     main="Predicted survival for 50yr, female, 3rd class") 
abline(v=phat_lrm[15], col="red")

par(op)

Imputation during model fitting

Rather than performing casewise deletion, we can impute the values of missing age during the model fitting. The overall model includes two sub-models, one for the survival outcome and one for the missing ages. To start, we specify a model for missing age with only an intercept term. This is similar to replacing all the missing ages with the mean of the observed ages.

# outcome model
bf_outcome <- bf(survived ~ mi(age) + sex + pclass) + bernoulli()

# imputation model for age with just intercept
bf_imp <- bf(age|mi() ~ 1) + gaussian() 

# set_rescor(FALSE) --- no residual correlation between multivariate responses vars (age and survived)
fit4 <- brm(bf_outcome + bf_imp + set_rescor(FALSE), data=t3, 
            chains=2, iter=2000, refresh = 0)

## Compiling the C++ model

## Start sampling

The summary for this model looks similar to previous models, but includes coefficients for both the survival outcome model and the age imputation model.

summary(fit4)

##  Family: MV(bernoulli, gaussian) 
##   Links: mu = logit
##          mu = identity; sigma = identity 
## Formula: survived ~ mi(age) + sex + pclass 
##          age | mi() ~ 1 
##    Data: t3 (Number of observations: 1309) 
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                    Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## survived_Intercept     3.42      0.31     2.80     4.05        981 1.00
## age_Intercept         29.92      0.44    29.08    30.79       2253 1.00
## survived_sexmale      -2.53      0.15    -2.82    -2.23       2258 1.00
## survived_pclass2nd    -1.22      0.21    -1.65    -0.80       1305 1.00
## survived_pclass3rd    -2.20      0.20    -2.61    -1.81       1264 1.00
## survived_miage        -0.03      0.01    -0.05    -0.02       1065 1.00
## 
## Family Specific Parameters: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_age    14.42      0.32    13.81    15.06       1709 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

We can specify a more complex imputation for age by modifying the second sub-model.

# imputation model for age with all main effects and 2-way interactions
bf_imp2 <- bf(age|mi() ~ (sex+pclass+survived)^2) + gaussian()

fit5 <- brm(bf_outcome + bf_imp2 + set_rescor(FALSE), data=t3, 
            chains=2, iter=2000, refresh = 0)

## Compiling the C++ model

## Start sampling

The outcome model includes linear terms for age, sex, and pclass while the imputation model for age includes main effects and two-way interactions for sex, pclass, and survival status.

summary(fit5)

##  Family: MV(bernoulli, gaussian) 
##   Links: mu = logit
##          mu = identity; sigma = identity 
## Formula: survived ~ mi(age) + sex + pclass 
##          age | mi() ~ (sex + pclass + survived)^2 
##    Data: t3 (Number of observations: 1309) 
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                        Estimate Est.Error l-95% CI u-95% CI Eff.Sample
## survived_Intercept         3.85      0.33     3.21     4.54       1569
## age_Intercept             40.95      2.88    35.14    46.43        841
## survived_sexmale          -2.52      0.15    -2.83    -2.22       3944
## survived_pclass2nd        -1.38      0.21    -1.80    -0.97       1909
## survived_pclass3rd        -2.46      0.22    -2.90    -2.06       1825
## age_sexmale                2.95      2.74    -2.28     8.53        885
## age_pclass2nd             -3.74      3.33   -10.05     2.97        972
## age_pclass3rd            -16.84      2.86   -22.42   -11.22        892
## age_survived              -4.11      2.78    -9.47     1.42        905
## age_sexmale:pclass2nd     -7.23      3.10   -13.43    -1.37       1147
## age_sexmale:pclass3rd      0.12      2.63    -5.07     5.22        977
## age_sexmale:survived      -3.64      2.21    -8.10     0.68       1701
## age_pclass2nd:survived    -6.78      3.11   -12.74    -0.82       1139
## age_pclass3rd:survived     0.48      2.53    -4.59     5.26       1124
## survived_miage            -0.04      0.01    -0.06    -0.03       1509
##                        Rhat
## survived_Intercept     1.00
## age_Intercept          1.00
## survived_sexmale       1.00
## survived_pclass2nd     1.00
## survived_pclass3rd     1.00
## age_sexmale            1.00
## age_pclass2nd          1.00
## age_pclass3rd          1.00
## age_survived           1.00
## age_sexmale:pclass2nd  1.00
## age_sexmale:pclass3rd  1.00
## age_sexmale:survived   1.00
## age_pclass2nd:survived 1.00
## age_pclass3rd:survived 1.00
## survived_miage         1.00
## 
## Family Specific Parameters: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_age    12.77      0.29    12.21    13.35       2402 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The distribution of the imputed ages looks reasonable compared to the observed ages given that the imputation model used only 3 variables.

# compare imputed and observed ages
imp_age<-fitted(fit5,newdata=t3[is.na(t3$age),])[,,'age']
obs_age<-t3[!is.na(t3$age),'age']

ages <- data.frame(age = c(imp_age[,1],obs_age),
                 imputed = factor(c(rep(1,nrow(imp_age)),
                                    rep(0,length(obs_age))),
                                   labels=c("no","yes")))

ggplot(data=ages,aes(x=imputed,y=age))+geom_boxplot()

Unfortunately the brms formula syntax doesn’t currently allow imputed variables to have non-linear effects in the main outcome model (survived ~ ns(mi(age),df=4) + sex + pclass won’t work). One possible solution is to examine and modify the underlying Stan code. Running stancode(fit3) will show the Stan code underlying the brms model for fit3 (see ?rstan for more details on using Stan directly in R). This option can be tricky because the R function used to define the splines needs to be rewritten as a function within Stan (https://mc-stan.org/users/documentation/case-studies/splines_in_stan.html).

Imputation before model fitting

The second possible solution is to separate the imputation from the model by first performing traditional multiple imputation and then combining the results with brm_multiple(). Posterior draws from all the imputation models are ‘stacked’ and inference is performed using this combined posterior dataset. The downside with this approach is that the model needs to be re-fit for each imputation dataset which can be computational intensive with a complex model and many imputations. Below, we use aregImpute to get 10 datasets with imputed values for the missing age variable.

# number of imputation datasets
n_imp <- 10
aregimp <- aregImpute(~ age + sex + pclass + survived, 
                 data=t3, n.impute=n_imp, nk=4, pr=FALSE)

# format imputation datasets as list of data.frames
imputed <- lapply(1:n_imp, function(x) impute.transcan(aregimp, imputation=x, data=t3, 
                           list.out=TRUE, pr=FALSE, check=FALSE))

mk_df <- function(x){
  data.frame(age=imputed[[x]][1],sex=imputed[[x]][2],
           pclass=imputed[[x]][3],survived=imputed[[x]][4])
}

imp_lst <- lapply(1:n_imp, mk_df)

fit_imp1 <- brm_multiple(survived ~ (sex + pclass + ns(age,df=4))^2, 
          family = bernoulli(), data = imp_lst, 
          chains = 2, iter = 2000, refresh = 0)

## Compiling the C++ model

## recompiling to avoid crashing R session

## Start sampling

## the number of chains is less than 1; sampling not done

## Fitting imputed model 1

## Start sampling

## Fitting imputed model 2

## Start sampling

## Fitting imputed model 3

## Start sampling

## Fitting imputed model 4

## Start sampling

## Fitting imputed model 5

## Start sampling

## Fitting imputed model 6

## Start sampling

## Fitting imputed model 7

## Start sampling

## Fitting imputed model 8

## Start sampling

## Fitting imputed model 9

## Start sampling

## Fitting imputed model 10

## Start sampling

The results of the combined multiple imputations and the casewise deletion model are similar. Note that the combined model may issue false positive convergence warnings (see ?brm_multiple)

Casewise deletion model

summary(fit3)

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: survived ~ (sex + pclass + ns(age, df = 4))^2 
##    Data: t3 (Number of observations: 1046) 
## Samples: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 2000
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept                 4.13      2.11     0.60     8.98        481 1.00
## sexmale                  -2.11      0.97    -4.03    -0.19        864 1.00
## pclass2nd                 9.65      5.19     1.49    21.43        667 1.00
## pclass3rd                -4.10      2.07    -8.77    -0.57        559 1.00
## nsagedfEQ41              -1.21      2.00    -5.83     2.13        594 1.00
## nsagedfEQ42               0.09      1.81    -3.65     3.45        716 1.00
## nsagedfEQ43              -0.94      4.41   -11.11     6.74        561 1.00
## nsagedfEQ44              -2.18      3.06    -7.67     4.46        913 1.00
## sexmale:pclass2nd        -0.84      0.74    -2.30     0.60       1506 1.00
## sexmale:pclass3rd         2.20      0.65     1.00     3.56       1200 1.00
## sexmale:nsagedfEQ41      -0.85      0.83    -2.44     0.88       1134 1.00
## sexmale:nsagedfEQ42      -2.07      1.41    -4.81     0.79       1049 1.00
## sexmale:nsagedfEQ43      -5.12      2.23    -9.52    -1.02       1004 1.00
## sexmale:nsagedfEQ44      -0.74      3.13    -7.28     4.87        986 1.00
## pclass2nd:nsagedfEQ41   -10.25      4.93   -21.29    -2.51        678 1.00
## pclass3rd:nsagedfEQ41     1.06      1.94    -2.20     5.47        635 1.00
## pclass2nd:nsagedfEQ42    -6.76      3.56   -14.32    -0.55        747 1.00
## pclass3rd:nsagedfEQ42    -1.33      1.66    -4.44     1.99        723 1.00
## pclass2nd:nsagedfEQ43   -23.61     10.48   -47.37    -7.18        669 1.00
## pclass3rd:nsagedfEQ43     0.15      4.47    -7.99     9.72        661 1.00
## pclass2nd:nsagedfEQ44    -6.20      4.92   -16.17     2.95       1560 1.00
## pclass3rd:nsagedfEQ44     1.49      3.55    -6.00     7.60       1360 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

marginal_effects(fit3, effects ='age:sex', robust = TRUE,
                 conditions = data.frame(pclass=c('1st','2nd','3rd')))

Multiple imputation model

summary(fit_imp1)

## Warning: The model has not converged (some Rhats are > 1.1). Do not analyse the results! 
## We recommend running more iterations and/or setting stronger priors.

##  Family: bernoulli 
##   Links: mu = logit 
## Formula: survived ~ (sex + pclass + ns(age, df = 4))^2 
##    Data: imp_lst (Number of observations: 1309) 
## Samples: 20 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 20000
## 
## Population-Level Effects: 
##                       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept                 4.79      2.21     1.02     9.69        110 1.07
## sexmale                  -2.86      1.10    -5.26    -0.91         26 1.28
## pclass2nd                 9.19      5.38     0.51    21.56       4143 1.01
## pclass3rd                -4.68      2.16    -9.51    -0.99        121 1.06
## nsagedfEQ41              -1.83      2.14    -6.59     1.82         87 1.08
## nsagedfEQ42              -0.11      1.75    -3.68     3.23        111 1.06
## nsagedfEQ43              -2.08      4.71   -12.45     6.22        120 1.06
## nsagedfEQ44              -2.58      2.96    -7.94     3.85        120 1.06
## sexmale:pclass2nd        -0.61      0.71    -1.98     0.82      11106 1.01
## sexmale:pclass3rd         2.13      0.62     0.99     3.43        488 1.02
## sexmale:nsagedfEQ41      -0.19      1.01    -1.95     1.95         20 1.42
## sexmale:nsagedfEQ42      -1.83      1.29    -4.35     0.68         92 1.07
## sexmale:nsagedfEQ43      -3.78      2.45    -8.35     1.51         27 1.28
## sexmale:nsagedfEQ44      -0.41      2.92    -6.64     4.87        115 1.06
## pclass2nd:nsagedfEQ41   -10.09      5.13   -21.86    -1.83       4531 1.01
## pclass3rd:nsagedfEQ41     1.57      2.11    -2.05     6.28         70 1.10
## pclass2nd:nsagedfEQ42    -6.36      3.61   -14.30    -0.19       4290 1.01
## pclass3rd:nsagedfEQ42    -0.86      1.74    -4.20     2.67         64 1.10
## pclass2nd:nsagedfEQ43   -22.88     10.94   -47.61    -5.06       3434 1.01
## pclass3rd:nsagedfEQ43     1.21      4.58    -7.02    11.37        226 1.04
## pclass2nd:nsagedfEQ44    -6.52      5.00   -16.97     2.50       2189 1.01
## pclass3rd:nsagedfEQ44     1.30      3.28    -5.91     7.03        450 1.02
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

marginal_effects(fit_imp1, effects ='age:sex', robust = TRUE,
                 conditions = data.frame(pclass=c('1st','2nd','3rd')))