6  Comparing Two Proportions

6.1 Overview

  • Compare dichotomous independent variable with a dichotomous outcome
    • Independent variables: Exposed/Not, Treatment/Control, Knockout/Wild Type, etc.
    • Outcome (dependent) variables: Diseased/Not or any Yes/No outcome
  • Continuous outcomes often dichotomized for analysis (bad idea)
    • Consider \(t\)-tests (Chapter 5) or Non-parametric methods (Chapter 7)

6.2 Normal-Approximation Test

  • Two independent samples
Sample 1 Sample 2
Sample size \(n_1\) \(n_2\)
Population probability of event \(p_1\) \(p_2\)
Sample probability of event \(\hat{p}_1\) \(\hat{p}_2\)
  • Null Hypothesis, \(H_{0}: p_{1}=p_{2}=p\)
  • Estimating the variance
    • Variance of \(\hat{p}_{i} = p_{i}(1-p_{i})/n_{i}\) for \(i = 1, 2\)
    • Variance of \(\left(\hat{p}_1 - \hat{p}_2\right)\) is the sum of the variances, which under \(H_{0}\) is \[ p(1-p)[\frac{1}{n_{1}}+\frac{1}{n_{2}}] \]
    • We estimate this variance by plugging \(\hat{p}\) into \(p\), where \[ \hat{p} = \frac{n_{1}\hat{p}_{1}+n_{2}\hat{p}_{2}}{n_{1}+n_{2}} \] is the pooled estimate of the probability under \(H_{0}: p_1 = p_2 = p\)
  • Test statistic which has approximately a normal distribution under \(H_{0}\) if \(n_{i}\hat{p}_{i}\) are each large enough: \[ z = \frac{\hat{p}_{1} - \hat{p}_{2}}{\sqrt{\hat{p}(1-\hat{p})[\frac{1}{n_{1}}+\frac{1}{n_{2}}]}} \]
  • To test \(H_{0}\) we see how likely it is to obtain a \(z\) value as far or farther out in the tails of the normal distribution than \(z\) is
  • We don’t recommend using the continuity correction
  • Example:
    Test whether the population of women whose age at first birth \(\leq 29\) has the same probability of breast cancer as women whose age at first birth was \(\geq 30\). This dichotomization is highly arbitrary and we should really be testing for an association between age and cancer incidence, treating age as a continuous variable.
  • Case-control study (independent and dependent variables interchanged); \(p_{1}=\) probability of age at first birth \(\geq 30\), etc.
with Cancer without Cancer
Total # of subjects \(3220~~(n_1)\) \(10245~~(n_2)\)
# age \(\geq 30\) \(683\) \(1498\)
Sample probabilities \(0.212~~(\hat{p}_1\)) \(0.146~~(\hat{p}_2\))
Pooled probability \(\frac{683 + 1498}{3220 + 10245} = 0.162\)
  • Estimate the variance
    • \(\mathrm{variance}(\hat{p}_{1}-\hat{p}_{2}) = \hat{p}(1-\hat{p})\times \left[\frac{1}{n_{1}}+\frac{1}{n_{2}}\right] = 5.54 \times 10^{-5}\)
    • \(SE = \sqrt{\mathrm{variance}} = 0.00744\)
  • Test statistic
    • \(z = \frac{0.212-0.146}{0.00744} = 8.85\)
  • 2-tailed \(P\)-value is \(< 10^{-4}\)
n1 <- 3220;     n2 <- 10245
p1 <- 683 / n1; p2 <- 1498 / n2
pp <- (n1 * p1 + n2 * p2) / (n1 + n2)
se <- sqrt(pp * (1 - pp) * (1 / n1 + 1 / n2))
z  <- (p1 - p2) / se
pval <- 2 * (1 - pnorm(abs(z)))
round(c(p1=p1, p2=p2, pooled=pp, se=se, z=z, pval=pval), 4)
    p1     p2 pooled     se      z   pval 
0.2121 0.1462 0.1620 0.0074 8.8527 0.0000 
  • We do not use a \(t\)-distribution because there is no \(\sigma\) to estimate (and hence no “denominator d.f.” to subtract)

6.3 \(\chi^2\) Test

  • If \(z\) has a normal distribution, \(z^2\) has a \(\chi^2\) distribution with 1 d.f. (are testing a single difference against zero)
  • The data we just tested can be shown as a \(2\times 2\) contingency table
Cancer + Cancer -
Age \(\leq 29\) 2537 8747
Age \(\geq 30\) 683 1498
3220 10245
  • In general, the \(\chi^2\) test statistic is given by \[\sum_{ij} \frac{(\mathrm{Obs}_{ij}-\mathrm{Exp}_{ij})^2}{\mathrm{Exp}_{ij}}\]
  • \(\mathrm{Obs}_{ij}\) is the observed cell frequency for row \(i\) column \(j\)
  • \(\mathrm{Exp}_{ij}\) is the expected cell frequency for row \(i\) column \(j\)
    • Expected cell frequencies calculating assuming \(H_0\) is true
    • \(\mathrm{Exp}_{ij} = \frac{\mathrm{row }~i~\mathrm{ total} \times \mathrm{column}~j~\mathrm{total}}{\mathrm{grand total}}\)
    • e.g. \(\mathrm{Exp}_{11} = \frac{11284 \times 3220}{13465} = 2698.4\)
  • For \(2 \times 2\) tables, if the observed cell frequencies are labeled
\(a\) \(b\)
\(c\) \(d\)

the \(\chi^{2}\) test statistic simplifies to \[ \frac{N[ad - bc]^{2}}{(a+c)(b+d)(a+b)(c+d)}, \] where \(N=a+b+c+d\). Here we get \(\chi^{2}_{1} = 78.37 = z^{2}\) from above.

x <- matrix(c(2537, 8747, 683, 1498), nrow=2, byrow=TRUE)
     [,1] [,2]
[1,] 2537 8747
[2,]  683 1498
chisq.test(x, correct=FALSE)

    Pearson's Chi-squared test

data:  x
X-squared = 78.37, df = 1, p-value < 2.2e-16
# Also compute more accurate P-value based on 1M Monte-Carlo simulations
chisq.test(x, correct=FALSE, simulate.p.value=TRUE, B=1e6)

    Pearson's Chi-squared test with simulated p-value (based on 1e+06

data:  x
X-squared = 78.37, df = NA, p-value = 1e-06
  • Don’t need Yates’ continuity correction
  • Note that even though we are doing a 2-tailed test we use only the right tail of the \(\chi^{2}_{1}\) distribution; that’s because we have squared the difference when computing the statistic, so the sign is lost.
  • This is the ordinary Pearson \(\chi^2\) test

6.4 Fisher’s Exact Test

  • Is a misnomer in the sense that it computes probabilities exactly, with no normal approximation, but only after changing what is being tested to condition on the number of events and non-events
  • Because frequencies are discrete and because of the conditioning, the test is conservative (\(P\)-values too large)
  • Is exact only in the sense that actual type I error probability will not exceed the nominal level
  • The ordinary Pearson \(\chi^2\) works fine (even when an expected cell frequency is as low as 1.0, contrary to popular belief)
  • We don’t use Yates’ continuity correction because it was developed to make the normal approximation test yield \(P\)-values that are more similar to Fisher’s test, i.e., to be more conservative
  • The attempt to obtain exact unconditional \(P\)-values for the simple \(2\times 2\) contingency table has stumped frequentist statisticians for many decades Choi et al. (2015)
  • By contrast, Bayesian posterior probabilities for the true unconditional quantity of interest are exact
    • Frequentist confidence limits and \(P\)-values are approximate because they use the sample space, and the sample space is discrete when the response variable is categorical
    • Bayes does not consider the sample space, only the parameter space, which is almost always continuous
  • See stats.stackexchange.com/questions/14226 for discussion

6.5 Sample Size and Power for Comparing Two Independent Samples

  • Power \(\uparrow\) as
    • \(n_{1}, n_{2} \uparrow\)
    • \(\frac{n_{2}}{n_{1}} \rightarrow 1.0\) (usually)
    • \(\Delta = |p_{1}-p_{2}| \uparrow\)
    • \(\alpha \uparrow\)
  • There are approximate formulas such as the recommended methods in Altman based on transforming \(\hat{p}\) to make it have a variance that is almost independent of \(p\)
  • Example: Using current therapy, 0.5 of the population is free of infection at 24 hours. Adding a new drug to the standard of care is expected to increase the percentage infection-free to 0.7. If we randomly sample 100 subjects to receive standard care and 100 subjects to receive the new therapy, what is the probability that we will be able to detect a certain difference between the two therapies at the end of the study?

\[ p_{1}=.5, p_{2}=.7, n_{1}=n_{2}=100 \] results in a power of 0.83 when \(\alpha=0.05\)

bpower(.5, .7, n1=100, n2=100)
  • When computing sample size to achieve a given power, the sample size \(\downarrow\) when
    • power \(\downarrow\)
    • \(\frac{n_{2}}{n_{1}} \rightarrow 1.0\)
    • \(\Delta \uparrow\)
    • \(\alpha \uparrow\)
  • Required sample size is a function of both \(p_{1}\) and \(p_{2}\)
  • Example:

How many subjects are needed to detect a 0.8 fold decrease in the probability of colorectal cancer if the baseline probability of cancer is \(0.0015\)? Use a power of 0.8 and a type-I error probability of 0.05.

\[\begin{array}{c} p_{1}=0.0015, p_{2}=0.8\times p_{1}=0.0012, \alpha=0.05, \beta=0.2 \\ n_{1}=n_{2}=235,147 \end{array}\]

(Rosner estimated 234,881)

bsamsize(.0015, 0.8 * .0015, alpha=0.05, power=0.8)
      n1       n2 
235147.3 235147.3 

Formulas for power and sample size may be seen as R code found at

6.6 Confidence Interval

An approximate \(1-\alpha\) 2-sided CL is given by \[\hat{p}_{1}-\hat{p}_{2} \pm z_{1-\alpha/2} \times\sqrt{\frac{\hat{p}_{1}(1-\hat{p}_{1})}{n_{1}}+\frac{\hat{p}_{2}(1-\hat{p}_{2})}{n_{2}}}\] where \(z_{1-\alpha/2}\) is the critical value from the normal distribution (1.96 when \(\alpha=0.05\)).

The CL for the number of patients needed to be treated to save one event may simply be obtained by taking the reciprocal of the two confidence limits.1

1 If a negative risk reduction is included in the confidence interval, set the NNT to \(\infty\) for that limit instead of quoting a negative NNT. There is more to this; see bit.ly/datamethods-nnt.

6.7 Sample Size for a Given Precision

  • Goal: Plan a study so that the margin of error is sufficiently small
  • The margin of error (\(\delta\)) is defined to be half of the confidence interval width. For two proportions, \[ \delta = z_{1-\alpha/2} \times\sqrt{\frac{\hat{p}_{1}(1-\hat{p}_{1})}{n_{1}}+\frac{\hat{p}_{2}(1-\hat{p}_{2})}{n_{2}}} \]
  • Basing the sample size calculations on the margin of error can lead to a study that gives scientifically relevant results even if the results are not statistically significant.
  • Example: Suppose that the infection risk in a population is \(0.5\) and a reduction to \(0.4\) is believed to be a large enough reduction that it would lead to a change in procedures. A study of a new treatment is planned so that enough subjects will be enrolled for the margin of error is \(0.05\). Consider these two possible outcomes:
  1. The new treatment is observed to decrease infections by \(0.06\) (\(0.95\) CI: \([0.11, 0.01]\)). The confidence interval does not contain \(0\), so we have indirect evidence2 that the new treatment is effective at reducing infections. \(0.1\) is also within the confidence interval limits.
  2. The new treatment is observed to decrease infections by only \(0.04\) (\(0.95\) CI: \([0.09, -0.01]\)). The confidence interval now contains \(0\), so we do not have enough evidence to reject the supposition that there is no effect of the treatment on reducing infections if we are bound to an arbitrary \(\alpha=0.05\). However, the confidence interval also does not contain \(0.10\), so we are able to indirectly rule out a scientifically relevant decrease in infections.

2 To obtain direct evidence requires Bayesian posterior probabilities.

  • For fixed \(n_{1} = n_2 = n\), confidence intervals for proportions have the maximum width, when \(p_{1} = p_{2} = 0.5\). This can be shown by:
    • Recall that the variance formula for the difference in two proportions when calculating a confidence interval is \[ \frac{p_{1}(1-p_{1})}{n_{1}}+\frac{p_{2}(1-p_{2})}{n_{2}} \]
    • When \(p_{1} = p_{2} = p\) and \(n_{1} = n_2 = n\), the variance formula simplifies to \[ \frac{p(1 - p)}{n}+\frac{p (1- p)}{n} = 2\frac{p(1 - p)}{n} \]
    • Then, for any fixed value of \(n\) (\(e.g. n = 1\) or \(10\)), \(2\frac{p(1 - p)}{n}\) is largest when \(p = 0.5\). With \(p= 0.5\), the variance formula further simplifies to \[ 2\frac{.25}{n} = \frac{1}{2n} \]
  • Using \(\alpha = 0.05\) (\(z_{1-\alpha/2} = 1.96\)), the worst-case margin of error will be \[ \delta = 1.96 \sqrt{\frac{1}{2n}} \]
  • By solving for \(n\), we can rearrange this formula to be \[ n = \frac{1.92}{\delta^2} \]
  • This formula then gives the number of subjects needed in each group \(n\) to obtain a given margin of error \(\delta\). For a margin of error of \(0.05\) (\(\delta = 0.05\)), \(n = \frac{1.92}{0.05^2} = 768\) subjects in each group.
diff <- .05
qnorm(.975)^2 / 2 / (diff ^ 2)
[1] 768.2918

6.8 Relative Effect Measures

  • We have been dealing with risk differences which are measures of absolute effect
  • Measures of relative effect include risk ratios and odds ratios
  • Risk ratios are easier to interpret but only are useful over a limited range of prognosis (i.e., a risk factor that doubles your risk of lung cancer cannot apply to a subject having a risk above 0.5 without the risk factor)
  • Odds ratios can apply to any subject
  • In large clinical trials treatment effects on lowering probability of an event are often constant on the odds ratio scale
  • OR = Odds ratio = \(\frac{\frac{p_{1}}{1-p_{1}}}{\frac{p_{2}}{1-p_{2}}}\)
  • Testing \(H_{0}\): OR=1 is equivalent to testing \(H_{0}:p_{1}=p_{2}\)
  • There are formulas for computing confidence intervals for odds ratios
  • Odds ratios are most variable when one or both of the probabilities are near 0 or 1
  • We compute CLs for ORs by anti-logging CLs for the log OR
  • In the case where \(p_{1}=p_{2}=0.05\) and \(n_{1}=n_{2}=n\), the standard error of the log odds ratio is approximately \(\sqrt{\frac{42.1}{n}}\)
  • The common sample size \(n\) needed to estimate the true OR to within a factor of 1.5 is 984 with \(p\)s in this range
  • To show the multiplicative margins of error3 for a range of sample sizes and values of \(p\). For each scenario, the margin of error assumes that both unknown probability estimates equal \(p\).

3 Value by which to multiply the observed odds ratio to obtain the upper 0.95 confidence limit or to divide the observed odds ratio to obtain the lower 0.95 limit. The confidence interval for a ratio should always be constructed on the log scale (unless using the excellent profile likelihood interval which is transformation-invariant). We compute the margin of error on the log ratio scale just as we do for means. The margin of error can be defined as \(\frac{1}{2}\) of the width of the confidence interval for the log ratio. If you anti log that margin of error you get the multiplicative factor, i.e., the multiplicative margin of error. Suppose that the MMOE is 1.5. This means that you can get the 0.95 confidence interval for the ratio by taking the ratio’s point estimate and dividing it by 1.5 and then multiplying it by 1.5. You can loosely say that our point estimate can easily be off by a factor of 1.5.

d <- expand.grid(n=c(seq(10, 1000, by=10), seq(1100, 50000, by=100)),
                 p=c(.02, .05, .075, .1, .15, .2, .25, .3, .4, .5))
d$selor <- with(d, sqrt(2 / (p * (1 - p) * n)))
d$mmoe  <- with(d, exp(qnorm(0.975) * selor))
mb <- c(1, 1.25, 1.5, 2, 2.5, 3, 4, 5, 10, 20, 30, 40, 50, 100, 400)
ggplot(aes(x=n, y=mmoe, color=factor(p)), data=d) +
  geom_line() +
                  20000,50000)) +
  scale_y_log10(breaks=mb, labels=as.character(mb)) +
  xlab(expression(n)) + ylab('Multiplicative Margin of Error for OR') +
Figure 6.1: Multiplicative margin of error related to 0.95 confidence limits of an odds ratio, for varying \(n\) and \(p\) (different curves), assuming the unknown true probability in each group is no lower than \(p\)

6.9 Comprehensive example

6.9.1 Study Description

  • Consider patients who will undergo coronary artery bypass graft surgery (CABG)
  • Mortality risk associated with open heart surgery
  • Study question: Do emergency cases have a surgical mortality that is different from that of non-emergency cases?
  • Population probabilities
    • \(p_1\): Probability of death in patients with emergency priority
    • \(p_2\): Probability of death in patients with non-emergency priority
  • Statistical hypotheses
    • \(H_0: p_1 = p_2\) (or \(\mathrm{OR} = 1\))
    • \(H_1: p_1 \neq p_2\) (or \(\mathrm{OR} \neq 1\))

6.9.2 Power and Sample Size

  • Prior research shows that just over \(0.1\) of surgeries end in death
  • Researchers want to be able to detect a 3 fold increase in risk
  • For every 1 emergency priority, expect to see 10 non-emergency
  • \(p_1 = 0.3\), \(p_2 = 0.1\), \(\alpha = 0.05\), and power} = 0.90
  • Calculate sample sizes using the PS software for these values and other combinations of \(p_1\) and \(p_2\)
\((p_1, p_2)\) (0.3, 0.1) (0.4, 0.2) (0.03, 0.01) (0.7, 0.9)
\(n_1\) 40 56 589 40
\(n_2\) 400 560 5890 400

Check PS calculations against the R Hmisc package’s bsamsize function.

round(bsamsize(.3, .1, fraction=1/11, power=.9))
 n1  n2 
 40 399 
round(bsamsize(.4, .2, fraction=1/11, power=.9))
 n1  n2 
 56 561 
round(bsamsize(.7, .9, fraction=1/11, power=.9))
 n1  n2 
 40 399 

6.9.3 Collected Data

In-hospital mortality figures for emergency surgery and other surgery

Surgical Priority Dead Alive
Emergency 6 19
Other 11 100
  • \(\hat{p}_1 = \frac{6}{25} = 0.24\)
  • \(\hat{p}_2 = \frac{11}{111} = 0.10\)

6.9.4 Statistical Test

n1 <- 25;     n2 <- 111
p1 <- 6 / n1; p2 <- 11 / n2
or <- p1 / (1 - p1) / (p2 / (1 - p2))
[1] 2.870813
# Standard error of log odds ratio:
selor <- sqrt(1 / (n1 * p1 * (1 - p1)) + 1 / (n2 * p2 * (1 - p2)))
# Get 0.95 confidence limits
cls <- exp(log(or) + c(-1, 1) * qnorm(0.975) * selor)
[1] 0.946971 8.703085
tcls <- paste0(round(or, 2), ' (0.95 CI: [', round(cls[1], 2),
               ', ', round(cls[2], 2), '])')
# Multiplying a constant by the vector -1, 1 does +/-
x <- matrix(c(6, 19, 11, 100), nrow=2, byrow=TRUE)
     [,1] [,2]
[1,]    6   19
[2,]   11  100
chisq.test(x, correct=FALSE)

    Pearson's Chi-squared test

data:  x
X-squared = 3.7037, df = 1, p-value = 0.05429
  • Interpretation
    • Compare odds of death in the emergency group \(\left(\frac{\hat{p}_1}{1-\hat{p}_1}\right)\) to odds of death in non-emergency group \(\left(\frac{\hat{p}_2}{1-\hat{p}_2}\right)\)
    • Emergency cases have 2.87 (0.95 CI: [0.95, 8.7]) fold increased odds of death during surgery compared to non-emergency cases.

Fisher’s Exact Test

Observed marginal totals from emergency surgery dataset

Emergency \(a\) \(b\)
Other \(c\) \(d\)
  • With fixed marginal totals, there are 18 possible tables (\(a = 0, 1, \ldots 17\))
  • Can calculated probability of each of these tables
    • \(p\)-value: Probability of observing data as extreme or more extreme than we collected in this experiment
  • Exact test: \(p\)-value can be calculated “exactly” (not using the \(\chi^2\) distribution to approximate the \(p\)-value)

    Fisher's Exact Test for Count Data

data:  x
p-value = 0.08706
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.7674155 9.6831351
sample estimates:
odds ratio 

Note that the odds ratio from Fisher’s test is a conditional maximum likelihood estimate, which differs from the unconditional maximum likelihood estimate we obtained earlier. * Fisher’s test more conservative than Pearson’s \(\chi^2\) test (larger \(P\)-value)

6.10 Logistic Regression for Comparing Proportions

  • When comparing \(\geq 2\) groups on the probability that a categorical outcome variable will have a certain value observed (e.g., \(P(Y=1)\)), one can use the Pearson \(\chi^2\) test for a contingency table (or the less powerful Fisher’s “exact” test)
  • Such analyses can also be done with a variety of regression models
  • It is necessary to use a regression model when one desires to analyze more than the grouping variable, e.g.
    • analyze effects of two grouping variables
    • adjust for covariates
  • For full generality, the regression model needs to have no restrictions on the regression coefficients
    • Probabilities are restricted to be in the interval \([0,1]\) so an additive risk model cannot fit over a broad risk range
    • Odds (\(\frac{p}{1-p}\)) are restricted to be in \([0, \infty]\)
    • Log odds have no restrictions since they can be in \([-\infty, \infty]\)
  • So log odds is a good basis for regression analysis of categorical \(Y\)
    • default assumption of additivity of effects needs no restrictions
    • will still translate to probabilities in \([0,1]\)
  • The binary logistic regression model is a model to estimate the probability of an event as a flexible function of covariates
  • Let the outcome variable \(Y\) have the values \(Y=0\) (non-event) or \(Y=1\) (event) \[\mathrm{Prob}(Y = 1 | X) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \ldots))}\]
  • The sum inside the inner () is called the linear predictor (LP)
  • The binary logistic model relates LP (with no restrictions) to the event probability as so:
lp <- seq(-5, 5, length=150)
ggplot(mapping=aes(lp, plogis(lp))) + geom_line() +
  xlab('Linear Predictor') + ylab('P(Y=1)')
Figure 6.2: The logistic function
  • Notes about LP:
    • for a given problem the range may be much narrower than \([-4,4]\)
    • when any predictor \(X_j\) with a non-zero \(\beta_j\) is categorical, LP cannot take on all possible values within its range, so the above plot will have gaps
  • As a special case the model can estimate and compare two probabilities as done above, through an odds ratio
  • Logistic regression seems like an overkill here, but it sets the stage for more complex frequentist analysis as well as Bayesian analysis
  • For our case the model is as follows
  • Consider groups A and B, A = reference group
    \([x]\) denotes 1 if \(x\) is true, 0 if \(x\) is false
    Define the expit function as the inverse of the logit function, or \(\mathrm{expit}(x) = \frac{1}{1 + \exp(-x)}\) \[\begin{array}{ccc} P(Y = 1 | \mathrm{group}) &=& \frac{1}{1 + \exp(-(\beta_0 + \beta_1 [\mathrm{group } B]))} \\ &=& \mathrm{expit}(\beta_0 + \beta_1 [\mathrm{group } B])\\ P(Y = 1 | \mathrm{group A}) &=& \mathrm{expit}(\beta_0)\\ P(Y = 1 | \mathrm{group B}) &=& \mathrm{expit}(\beta_0 + \beta_1) \end{array}\]

\(\beta_0 =\) log odds of probability of event in group A \(= \log(\frac{p_1}{1 - p_1}) = \mathrm{logit}(p_1)\)
\(\beta_1 =\) increase in log odds in going from group A to group B =
\(\log(\frac{p_2}{1 - p_2}) - \log(\frac{p_1}{1 - p_1}) = \mathrm{logit}(p_2) - \mathrm{logit}(p_1)\)
\(\exp(\beta_1) =\) group B : group A odds ratio \(= \frac{\frac{p_2}{1 - p_2}}{\frac{p_1}{1 - p_1}}\)
expit\((\beta_0) = p_1\)
expit\((\beta_0 + \beta_1) = p_2\)

  • Once the \(\beta\)s are estimated, one quickly gets the B:A odds ratio and \(\hat{p}_1\) and \(\hat{p}_2\)
  • This model is saturated
    • has the maximum number of parameters needed to fully describe the situation (here, 2 since 2 groups)
    • saturated models must fit the data if the distributional and independence assumptions are met
    • logistic model has no distributional assumption
  • Logistic regression is more general and flexible than the specialized tests for proportions
    • allows testing association on continuous characteristics
    • easily extends to more than two groups
    • allows adjustment for covariates
  • Examples:
    • assess effects of subjects’ sex and country (Canada vs. US) on \(P(Y=1)\) denoted by \(p\)
      logit \(p\) = constant + logit male effect + logit Canada effect

    • same but allow for interaction
      logit \(p\) = constant + logit male effect + logit Canada effect + special effect of being male if Canadian

    • latter model is saturated with 3 d.f. so fits as well as a model with 4 proportions

      • unlike the overall Pearson \(\chi^2\) test, allows testing interaction and
      • separate effects of sex and country (e.g., 2 d.f. chunk test for whether there is a sex difference for either country, allowing for the sex effect to differ by country)

6.10.1 Test Statistics

For frequentist logistic models there are 3 types of \(\chi^2\) test statistics for testing the same hypothesis:

  • likelihood ratio (LR) test (usually the most accurate) and is scale invariant4
    • Can obtain the likelihood ratio \(\chi^2\) statistic from either the logistic model or from a logarithmic equation in the two proportions and sample sizes
  • Score test (identical to Pearson test for overall model if model is saturated)
  • Wald test (square of \(\frac{\hat{\beta}}{\mathrm{s.e.}}\) in the one parameter case; misbehaves for extremely large effects)
  • Wald test is the easiest to compute but \(P\)-values and confidence intervals from it are not as accurate. The score test is the way to exactly reproduce the Pearson \(\chi^2\) statistic from the logistic model.
  • As with \(t\)-test vs. a linear model, the special case tests are not needed once you use the logistic model framework
  • Three usages of any of these test statistics:
    • individual test, e.g. sex effect in sex-country model without interaction
    • chunk test, e.g. sex + sex \(\times\) country interaction 2 d.f. test
      tests overall sex effect
    • global test of no association, e.g. 3 d.f. test for whether sex or country is associated with \(Y\)

4 The LR test statistic is the same whether testing for an absolute risk difference of 0.0, or for an odds or risk ratio of 1.0.

6.10.2 Frequentist Analysis Example

  • Consider again our emergency surgery example
  • String the observations out to get one row = one patient, binary \(Y\)
priority <- factor(c(rep('emergency', 25), rep('other', 111)), c('other', 'emergency'))
death <- c(rep(0, 19), rep(1, 6), rep(0, 100), rep(1, 11))
table(priority, death)
priority      0   1
  other     100  11
  emergency  19   6
d  <- data.frame(priority, death)
dd <- datadist(d); options(datadist='dd')
# rms package needs covariate summaries computed by datadist
f <- lrm(death ~ priority)

Logistic Regression Model

lrm(formula = death ~ priority)
Model Likelihood
Ratio Test
Rank Discrim.
Obs 136 LR χ2 3.20 R2 0.044 C 0.597
0 119 d.f. 1 R21,136 0.016 Dxy 0.193
1 17 Pr(>χ2) 0.0737 R21,44.6 0.048 γ 0.483
max |∂log L/∂β| 4×10-9 Brier 0.106 τa 0.043
β S.E. Wald Z Pr(>|Z|)
Intercept  -2.2073  0.3177 -6.95 <0.0001
priority=emergency   1.0546  0.5659 1.86 0.0624
  • Compare the LR \(\chi^2\) of 3.2 with the earlier Pearson \(\chi^2\) of 3.70
  • The likelihood ratio (LR) \(\chi^2\) test statistic and its \(P\)-value are usually a little more accurate than the other association tests
    • but \(\chi^2\) distribution is still only an approximation to the true sampling distribution
  • See that we can recover the simple proportions from the fitted logistic model:
    \(\hat{p}_1 = \frac{11}{111} = 0.099 = \mathrm{expit}(-2.2073)\)
    \(\hat{p}_2 = \frac{6}{25} = 0.24 = \mathrm{expit}(-2.2073 + 1.0546)\)
Effects   Response: death
Low High Δ Effect S.E. Lower 0.95 Upper 0.95
priority --- emergency:other 1 2 1.055 0.5659 -0.05449 2.164
Odds Ratio 1 2 2.871 0.94700 8.703
  • The point estimate and CLs for the odds ratio is the same as what we obtained earlier.

Add a random binary variable to the logistic model—one that is correlated with the surgical priority—to see the effect on the estimate of the priority effect

randomUniform <- runif(length(priority))
random <- ifelse(priority == 'emergency', randomUniform < 1/3,
                                          randomUniform > 1/3) * 1
d$random <- random
table(priority, random)
priority     0  1
  other     39 72
  emergency 17  8
lrm(death ~ priority + random)

Logistic Regression Model

lrm(formula = death ~ priority + random)
Model Likelihood
Ratio Test
Rank Discrim.
Obs 136 LR χ2 6.03 R2 0.082 C 0.659
0 119 d.f. 2 R22,136 0.029 Dxy 0.319
1 17 Pr(>χ2) 0.0489 R22,44.6 0.086 γ 0.464
max |∂log L/∂β| 1×10-8 Brier 0.103 τa 0.070
β S.E. Wald Z Pr(>|Z|)
Intercept  -1.6851  0.4157 -4.05 <0.0001
priority=emergency   0.7811  0.5909 1.32 0.1862
random  -0.9319  0.5621 -1.66 0.0973

The effect of emergency status is diminished, and the random grouping variable, created to have no relation to death in the population, has a large apparent effect.

6.10.3 Bayesian Logistic Regression Analysis

  • Has several advantages
    • All calculations are exact (to within simulation error) without changing the model
    • Can incorporate external information
    • Intuitive measures of evidence
    • Automatically handles zero-frequency cells (priors shrink probabilities a bit away from 0.0)
  • Simple to use beta priors for each of the two probabilities
  • But we’d need to incorporate a complex dependency in the two priors because we know more about how the two probabilities relate to each other than we know about each absolute risk
  • Simpler to have a wide prior on \(p_1\) and to have a non-flat prior on the log odds ratio
    • could back-solve and show dependency between knowledge of \(p_1\) and \(p_2\)
  • Use the same data model as above
  • As before we use the R brms package which makes standard modeling easy
  • Need two priors: for intercept \(\beta_0\) and for log odds ratio \(\beta_1\)
  • \(\beta_0\): use a normal distribution that makes \(p_1 = 0.05\) the most likely value (put mean at logit(0.05) = -2.944) and allows only a 0.1 chance that \(p_1 > 0.2\); solve for SD \(\sigma\) that accomplishes that
# Given mu and value, solve for SD so that the tail area of the normal distribution beyond value is prob
normsolve <- function(mu, value, prob) (value - mu) / qnorm(1 - prob)
normsolve(qlogis(0.05), qlogis(0.2), 0.1)  # qlogis is R logit()
[1] 1.215827
  • We round \(\sigma\) to 1.216
  • For \(\beta_1\) put a prior that has equal chance for OR \(< 1\) as for OR \(> 1\), i.e., mean for log OR of zero
    Put a chance of only 0.1 that OR \(> 3\)
normsolve(0, log(3), 0.1)
[1] 0.8572517
  • Round to 0.857 Compute the correlation between prior evidence for \(p_1\) and \(p_2\) by drawing 100,000 samples from the prior distributions. Also verify that prior probability \(p_2 > p_1\) is \(\frac{1}{2} =\) prob. OR \(> 1\).
b0 <- rnorm(100000, qlogis(0.05), 1.216)
b1 <- rnorm(100000, 0, 0.857)
p1 <- plogis(b0)
p2 <- plogis(b0 + b1)
cor(b0, b1, method='spearman')
[1] 0.003724938
cor(p1, p2, method='spearman')
[1] 0.8066422
# Define functions for posterior probability operator and posterior mode
P <- mean   # proportion of posterior draws for which a condition holds
pmode <- function(x) {
  z <- density(x)

P(p2 > p1)
[1] 0.50106
P(b1 > 0)
[1] 0.50106
P(exp(b1) > 1)
[1] 0.50106

To show that prior knowledge about \(p_1\) and \(p_2\) is uncorrelated when we don’t know anything about the odds ratio, repeat the above calculation use a SD of 1000 for the log odds ratio:

b1 <- rnorm(100000, 0, 1000)
p2 <- plogis(b0 + b1)
cor(p1, p2, method='spearman')
[1] 0.001448724

Now do Bayesian logistic regression analysis.

# Tell brms/Stan to use all available CPU cores less 1 and to use
# cmdstan instead of rstan
options(mc.cores=parallel::detectCores() - 1, brms.backend='cmdstanr')
p <- c(prior(normal(-2.944, 1.216), class='Intercept'),
       prior(normal(0, 0.857),      class='b'))
f <- brm(death ~ priority, data=d, prior=p, family='bernoulli', seed=123)
 Family: bernoulli 
  Links: mu = logit 
Formula: death ~ priority 
   Data: d (Number of observations: 136) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            -2.20      0.31    -2.82    -1.63 1.00     2907     2772
priorityemergency     0.72      0.49    -0.28     1.64 1.00     2895     2630

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

# Bring out posterior draws
w <- as.data.frame(f)
b0 <- w[, 'b_Intercept']
b1 <- w[, 'b_priorityemergency']
r  <- rbind(c(mean(b0), median(b0), pmode(b0)),
            c(mean(b1), median(b1), pmode(b1)),
            c(mean(exp(b1)), median(exp(b1)), pmode(exp(b1))))
colnames(r) <- c('Posterior Mean', 'Posterior Median', 'Posterior Mode')
rownames(r) <- c('b0', 'b1', 'OR')
round(r, 3)
   Posterior Mean Posterior Median Posterior Mode
b0         -2.204           -2.195         -2.212
b1          0.724            0.732          0.745
OR          2.323            2.079          1.612

Because the prior on the OR is conservative, the Bayesian posterior mode for the OR is smaller than the frequentist maximum likelihood estimate of 2.87.

Below notice how easy it is to do Bayesian inference on derived quantities p1 and p2 which are functions of b0 and b1.

# 0.95 credible interval for log odds ratio and odds ratio
quantile(b1, c(0.025, 0.975))
      2.5%      97.5% 
-0.2759717  1.6387450 
quantile(exp(b1), c(.025, 0.975))
     2.5%     97.5% 
0.7588348 5.1487040 
exp(quantile(b1,  c(0.025, 0.975)))
     2.5%     97.5% 
0.7588344 5.1487038 
# Posterior density of emergency:other odds ratio
ggplot(mapping=aes(x=exp(b1))) + geom_density() +
  xlab('OR') + ylab('') +
  geom_vline(xintercept=c(1, pmode(exp(b1))), alpha=I(0.25))
# Probability that OR > 1
P(exp(b1) > 1)
[1] 0.92825
# Probability it is > 1.5
P(exp(b1) > 1.5)
[1] 0.74075
# Probability that risk with emergency surgery exceeds that of
# non-emergency (same as P(OR > 1))
# plogis in R is 1/(1 + exp(-x))
P(plogis(b0 + b1) > plogis(b0))
[1] 0.92825
# Prob. that risk with emergency surgery elevated by more than 0.03
P(plogis(b0 + b1) > plogis(b0) + 0.03)
[1] 0.817
Figure 6.3: Posterior distribution of odds ratio

Even though the priors for the intercept and log odds ratio are independent, the connection of these two parameters in the data likelihood makes the posteriors dependent as shown with Spearman correlations of the posterior draws below. Also get the correlation between evidence for the two probabilities. These have correlated priors even though they are unconnected in the likelihood function. Posteriors for \(p_1\) and \(p_2\) are less correlated than their priors.

cor(b0, b1, method='spearman')
[1] -0.4228229
cor(plogis(b0), plogis(b0 + b1), method='spearman')
[1] 0.1881691

To demonstrate the effect of a skeptical prior:

  • Add random grouping to model as we did with the frequentist analysis
  • Make use of prior information that this variable is unlikely to be important
  • Put a prior on the log OR for this variable centered at zero with chance that the OR \(> 1.25\) of only 0.05
normsolve(0, log(1.25), 0.05)
[1] 0.1356616
p <- c(prior(normal(-2.944, 1.216), class='Intercept'),
       prior(normal(0, 0.857),      class='b', coef='priorityemergency'),
       prior(normal(0, 0.136),      class='b', coef='random'))

f2 <- brm(death ~ priority + random, data=d, prior=p, family='bernoulli',
 Family: bernoulli 
  Links: mu = logit 
Formula: death ~ priority + random 
   Data: d (Number of observations: 136) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Population-Level Effects: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept            -2.16      0.31    -2.78    -1.58 1.00     3445     3192
priorityemergency     0.70      0.48    -0.26     1.62 1.00     4014     3109
random               -0.06      0.13    -0.32     0.19 1.00     4180     3011

Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
  • Effect of random is greatly discounted
  • Posterior mean priority effect and its credible interval is virtually the same as the model that excluded random

James Rae and Nils Reimer have written a nice tutorial on using the R brms package for binary logistic regression available at bit.ly/brms-lrm