# 2Background

Data are noisy and inferences are probabilistic — Kruschke (2015) p. 19

## 2.1 Probabilistic Thinking and Decision Making

Optimum decisions require good data, a statistical model, and a utility/loss/cost function. The optimum decision maximizes expected utility and is a function of the Bayesian posterior distribution and the utility function. Utility functions are extremely difficult to specify in late-phase drug development. Even though we seldom provide this function we need to provide the inputs (posterior probability distribution) to decisions that will be based on an informal utility function.

• Statistics: judgment and decision making under uncertainty
• Decision maker must understand key uncertainties
• To understand uncertainty one must understand probability
• Know the conditions being assumed and the event or assertion we are estimating the prob. of
• Most useful probs: Prob(something unknown) conditioning on or assuming something known
• Forward time/information flow

Interactive demo of conditional probability

Compare backwards and forwards probs
PP = Bayesian posterior (“post data”) probability

Type Backwards Probability Forwards Probability
Forecast P(current state | future event occurs) P(future event | current state)
Diagnosis P(positive test | disease) (sensitivity) P(disease | positive test)
Disease Incidence P(black | has diabetes) P(diabetes | black)
General P(data | assertion X) P(assertion X true | data)
P-value vs. PP P(data in general more extreme | no effect) P(effect | these data)

Aside:

• Optimum medical dx and treatment = maximizing expected utility (Vickers (2008))
• Expected utility is a function of forward Prob(disease) and not sens and spec

Order of conditioning all-important

Probability Statement Meaning Value
P(female | senator) of senators what prop. are female 21/100
P(senator | female) of females what prop. are senators 21/165M

Translation of assertions into probability statements:

Assertion Probability Statement
50 year old has disease now P(disease | age=50)
disease-free 50 y.o. will get a disease within 5y P(T ≤ 5 | age=50)
T = time until disease
50 y.o. male has disease P(disease | age=50 and male)
A drug really lowers blood pressure P(θ < 0 | data)
θ=unknown bp Δ, data=RCT data
Drug ↓ blood pressure or ↑ exercise time P(θ1 < 0 or θ2 > 0 | data)
θ1=unknown bp Δ, θ2=unknown ex. time Δ

Spiegelhalter (1986) classic paper: power of forward probs. in decision making about patient management & clinical trials

Advantages when prob is P(assertion of direct interest):

• Forward prob self-contained and defines its own error probability for decision making
• P(efficacy) = 0.96 ⇒ P(error) = P(drug ineffective or harmful) = 0.04
• vs. 1 - type I error = P(others will get data less extreme than ours if exactly zero effect of drug)

Thus to claim that the null P value is the probability that chance alone produced the observed association is completely backwards: The P value is a probability computed assuming chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by pondering how the P value, which is a probability deduced from a set of assumptions (the statistical model), can possibly refer to the probability of those assumptions. — Greenland et al. (2016)

## 2.2 Bayes’ Theorem

• Basis for Bayesian estimation and inference
• Is a condition-reversal formula
• Consider condition $$A$$ vs. not $$A$$ (denoted $$\overline{A}$$)
• Consider condition $$B$$ vs. not $$B$$ (denoted $$\overline{B}$$)
• The theorem comes from laws of conditional and total probability
• $$P(A|B)$$ = prob that condition $$A$$ holds given that condition $$B$$ holds, and similarly for $$P(B|A)$$
• $$P(A)$$ is prob that $$A$$ holds regardless of $$B$$, and likewise for $$P(B)$$

Bayes’ theorem is stated as $P(B|A) = \frac{P(A|B) \times P(B)}{P(A)}$ To avoid dealing with the marginal probability of $$A$$, Bayes re-expressed the last equation. Consider the case where $$B$$ can take on only two values $$B$$ and $$\overline{B}$$: $P(B|A) = \frac{P(A|B) \times P(B)}{(1-P(B)) \times P(A|\overline{B}) + P(B) \times P(A|B)}$

This can be written as posterior odds = prior odds $$\times$$ likelihood ratio since a probability equals $$\frac{\text{odds}}{1 + \text{odds}} = \frac{1}{\frac{1}{\text{odds}} + 1}$$: $\frac{P(B|A)}{1 - P(B|A)} = \frac{P(B)}{1 - P(B)} \times \frac{P(A|B)}{P(A|\overline{B})}$

Bayes’ rule is visualized in the

Contingency table illustrating Bayes’ theorem, from Wikimedia Commons. $$w$$ $$x$$ $$y$$ and $$z$$ represent frequency counts. Think of Condition $$A$$ as representing a positive diagnostic test, condition not $$A$$ (Condition $$\overline{A}$$) as a negative test, Case $$B$$ as “diseased” and Case $$\overline{B}$$ as “non-diseased”.

• Will use Bayes’ theorem to make a prob statement about drug effect given data

• Data model is a function of the unknown drug effect

• Video tutorial on Bayes basics

Bayes’ theorem dictates that two individuals starting with the same beliefs (distribution) about an unknown parameter, who are given the same data, use the same data model, and agree not to redefine their prior beliefs after seeing the data, must have identical beliefs about the parameter (same conclusion about drug effectiveness) after analyzing the data.

## 2.3 What is a Posterior Probability?

• Bayes incorporates baseline beliefs (prior) with observed data (trial results) to get posterior prob (PP)
• Flexibility to ask multiple and complex questions
• What is P(drug has at least a specific effect size and avoids a specific adverse effect)
• Posterior density function: like a histogram
• x-axis: drug effect size
• y-axis: prob or relative degree of belief in that effect size
• Uses:
• area under curve from a to b is P(effect between a and b)
• area to the right of b is P(effect > b)
• Answers the question clinicians and regulators most interested in:
• What is P(drug does X)
not
• P(observing an even larger observed effect if the drug does nothing)
• Frequentist probs:
• long-run relative frequencies
• don’t apply to one-time events, e.g. experiments that cannot be repeated
• Bayesian PPs:
• can be fully objective/mechanistic/long-run rel. freq. as in upcoming coin flipping example
• usually we don’t have unassailable truth for prior knowledge, so anchor is quantified by a degree of belief
• But if we have a certain pre-data belief, Bayes’ theorem dictates the post-data degree of belief we must have
• uses new evidence to update a prior prob to a PP
• the entire Bayesian machine can be derived from the most basic axioms of probability (e.g., the sum of all probs over all possibilities is 1.0)

## 2.4 Example: Prior to Posterior

• Novelty coin maker makes biased coin with P(heads)=0.6
• He randomly mixes in fair coins; 3/10 of coins are fair
• Choose a coin from the whole batch at random, toss 40 times
• Observe 23 heads out of 40 tosses
• Let θ=unknown probability of heads
• Only 2 values of θ are possible: 0.5 and 0.6
• Prior distribution for θ has two values where probability is nonzero
• Prior distribution is zero for θ ≠ 0.5 or 0.6
• Let y = observed number of heads
• PP(θ = 0.5) proportional to $$0.3 \times 0.5 ^ {y} \times 0.5 ^ {n - y}$$
• PP(θ = 0.6) proportional to $$0.7 \times 0.6 ^ {y} \times 0.4 ^ {n - y}$$
• Summing these two provides the normalizing constant to make them sum to 1.0
• Shown as a function of y in figure below, for n=40
Code
spar(bty='l')
prior1 <- 0.3; prior2 <- 0.7
n <- 40
y <- 5:35
post1 <- prior1 * (0.5 ^ y) * (0.5 ^ (n - y))
post2 <- prior2 * (0.6 ^ y) * (0.4 ^ (n - y))
total <- post1 + post2
post1 <- post1 / total
post2 <- post2 / total
plot(y,  post1, type='n',
xlab=expression(y), ylab='Posterior Probability')
abline(v=c(20, 23), col=gray(.9))
abline(h=prior2, col=gray(.9))
lines(y, post1)
lines(y, post2, col='blue')
text(23.5, 0.75, 'P(coin is 0.6 unfair) ↑\nwith # heads',
text(23.5, 0.28, 'P(coin is fair) ↓\nwith # heads', adj=0)
Code
op1 <- round(post1[y == 23], 2)
op2 <- round(post2[y == 23], 2)
• With y=23, PPs are 0.22 and 0.78 so evidence for fairness has lessened over what we had in the prior
• Breakeven point y=20: equally likely to be fair or biased (because predisposition to be biased)

## 2.5 Bayesian Inference Model: General Case

• Above example had only limit number of possibilities for θ
• Usually have a continuous parameter space for θ (difference in means, etc.)
• Must use prob density functions instead of discrete probs
• pdf: P(θ in a small interval) divided by width of interval as width → 0
• General Bayes formula: p(θ | data y) ∝ p(y | θ) p(θ) = data likelihood × prior
prior belief about θ $$\stackrel{\text{data}}{\longrightarrow}$$ current belief about θ
• See Wasserman for a nice overview of the general Bayes equation
• If you have degree of belief p(θ) before seeing data from the experiment you must have degree of belief p(θ|y) after unveiling the data
• Summarize posterior density p(θ|y) by posterior means, quantiles, cumulative probs, interval probs
• Must integrate out θ to get such quantities
• Easier to sample from posterior than to derive posterior mathematically

### 2.5.2 One-Sample Mean Examples

The following examples are for the simple case where the data come from a normal distribution with standard deviation 1.0, and we are interested in estimating the unknown mean μ which is responsible for generating the data. Large values of μ are “better”, and we assume a prior distribution for μ which is normal with mean zero, and with various values of σ. In this “normal data, normal prior” case, the posterior distribution is also a normal distribution. The variance of the posterior distribution is the geometric mean of the prior variance and the variance of $$\overline{x}$$. The mean of the posterior is a weighted average of the prior mean and $$\overline{x}$$, with weights equal to, respectively, the ratio of the posterior variance to the prior variance, and the ratio of the posterior variance to the variance of $$\overline{x}$$ (which in our situation is 1/n). When the prior mean is zero, the posterior mean is just a shrunken version of $$\overline{x}$$.

In specifying a prior distribution it is often convenient to specify an effect size and a probability that the size exceeds that. The prior SD σ is computed here so that the prior probability P(μ > 2) matches a specified value (either 0.01, 0.05, or 0.25). Larger chances of μ being large correspond to larger σ. The observed sample mean $is set at three possible values: 0, 1, 2. Code require(plotly) ppost <- function(n) { d <- expand.grid(xbar=0:2, Plarge=c(.01, .05, .25)) p <- plot_ly() for(i in 1 : nrow(d)) { w <- d[i, ] xbar <- w$xbar
plarge <- w\$Plarge
psigma <- 2 / qnorm(1 - plarge)
vs     <- 1 / sqrt(n)
vpre   <- psigma ^ 2
vpost  <- 1 / (1 / vpre + 1 / vs)
mpost  <- xbar * vpost / vs
mu     <- seq(-3, 3, by=0.02)
lab    <- paste0('<span style="text-decoration: overline">x</span>=', xbar, ' prior P(μ > 2)=', plarge)
pp     <- dnorm(mu, mpost, sqrt(vpost))
priorp <- dnorm(mu, 0, psigma)
vis <- if(i == 1) TRUE else 'legendonly'
p <- p %>% add_lines(x=~mu, y=~priorp, hoverinfo='none', visible=vis, legendgroup=lab,
name=paste0(lab, ':', 'prior'), opacity=0.2, data=data.frame(mu, priorp))
pexceed <- 1 - pnorm(mu, mpost, sqrt(vpost))
txt <- paste0('μ=', round(mu, 2), ' P(μ > ' , round(mu, 2), ')=', round(pexceed, 2))
p <- p %>% add_lines(x=~mu, y=~pp, text=~txt, hoverinfo='text',
visible=vis, legendgroup=lab,
name=paste0(lab, ':', 'posterior'), data=data.frame(mu, pp, txt))
}
p %>% layout(xaxis=list(title='μ'),
yaxis=list(title='Probability Density Function', range=c(0, 1.35)))
}
ppost(5)
Code
ppost(10)
Code
ppost(25)