5 Statistical Inference

ABD 6

5.1 Overview

Inferential modes
- hypothesis testing
- relative model fit/relative support for hypotheses (likelihood ratios, Bayes factors)
- estimation (including interval estimation; often more appropriate than hypothesis testing)
- Bayesian probability of an effect in the right direction (more relevant to decision making; more actionable)
Contrast inference with decision making:
- acting as if something is true whether or not it is actually true
Hypothesis testing is most useful for inferential “existence” questions
- is indirect for other types of questions
Majority of hypothesis tests are on a single point
These place asymmetric importance on “special values” such as zero treatment effect
What does it mean to “not reject a hypothesis”?
- often very little
What does it mean to reject a hypothesis with a statistical test?
- in statistics it means that some aspect of a model is under suspicion (e.g., normality assumption)
- even if all other assumptions of the data model are satisfied, when the hypothesis involves a single point (e.g. zero effect), the alternative space is infinite so what have we learned about a specific alternative (e.g., that a treatment lowers blood pressure by 10 mmHg)?

Statistical hypothesis testing involves a model for data:

Parametric tests have very specific models
Nonparametric tests have semi-specific models without a distribution assumption
Permutation tests make an assumption about the best data summarization (e.g., mean, and the mean may not be the best summary for a heavy-tailed data distribution)

This chapter covers parametric tests for the following reasons:

historical
they are very useful for sample size estimation
occasionally one has prior information that the raw data, or differences from pre to post, actually follow a normal distribution, and with large effects one can get quite significant results in very small samples with parametric tests
to show that Bayesian parametric tests make fewer assumptions about the data model Nonparametric methods are covered in Chapter Chapter 7.

Back to data model—the primary basis for statistical inference:

Assume a model for how the data were generated
Model contains
- main parameters of interest (e.g., means)
- auxiliary parameters for other variables (e.g., confounders)
- parameters for sources of between-subject variability
- if parametric, a distribution function such as Gaussian (normal)
- if nonparametric/semiparametric, a connection between distributions for different types of subjects (link function)
- if longitudinal, a correlation pattern for multiple measurements within a subject
- assumptions about censoring, truncation, detection limits if applicable
- …
Example (2-sample $t$-test): $Y = \mu_{0} + \delta [\textrm{treatment B}] + \epsilon$
- $\mu_{0}$: unknown data-generating mean for treatment A
- $\delta$: unknown data-generating difference in means (B-A)
- $[\textrm{treatment B}]$: an indicator variable (1 if observation is from treatment B, 0 if from treatment A)
- $\epsilon$: irreducible error, i.e., unaccountable subject-to-subject variation; biologic variability; has variance $\sigma^2$
- Primary goal: uncover the hidden value of $\delta$ generating our dataset in the presence of noise $\epsilon$
  - higher $\sigma^{2} \rightarrow$ larger $|\epsilon| \rightarrow$ harder to uncover $\delta$ (the signal) from the noise
  - were $\epsilon$ always zero (no uncertainties), one directly observes $\delta$ and no statistical analysis is needed
Rejecting a straw-man hypothesis implies that some aspect of the model is in doubt
- that aspect may be the distribution or an assumption about equal variances, not the difference in means you care about

Common error: Using a two-stage approach to select which data model to use:

Assumes the data contain rich enough information to lead to a correct decision
Alters the operating characteristics of the final test
Fails to realize that nonparametric tests have excellent power

Example: Testing normality to decide on whether to use a $t$-test vs. a Wilcoxon-Mann-Whitney two-sample rank test. A two-stage test with a test for normality in the first stage assumes that

the test for normality has power near 1.0 for our sample size
if the test rejects $H_{0}$:normality, the magnitude of non-normality is worth bothering about
pre-testing for normality does not modify the type I error of the overall testing procedure
nonparametric tests are less efficient than parametric tests In fact it may be that none of these assumptions is true (4. is very seldom true). As will be seen later, a full Bayesian model can be completely honest and provide exactly the right amount of caution:

flexible parametric model allowing uncertainty about equal variances for groups A and B
allows uncertainty about the normality assumption
still results in inference about $\delta$, properly more cautious (wider credible interval) because
- we don’t know if normality truly holds
- we don’t know if equality of variance truly holds

5.1.1 Central Limit Theorem

The CLT is frequently misunderstood by practitioners when they use a limit theorem on their finite data to compute falsely symmetric confidence intervals. Asymmetric data must lead to asymmetric confidence intervals, otherwise confidence coverage will be inaccurate in at least one of the two tails. When the SD is unknown and must be estimated from the data, practitioners assume that the SD is a good measure of dispersion (it isn’t, for asymmetric or heavy-tailed distributions) and the CLT assumes the mean and SD are uncorrelated (they aren’t, for asymmetric distributions). An example where things go terribly wrong is when you should have taken logs but didn’t.

The essence of the CLT is

Assume observations are independent with the same distribution and have finite variance
Assume that the sample size $n$ can increase without bound
Limiting distribution of the sample mean is normal

The CLT is frequently used to justify the use of parametric statistical tests and confidence intervals even when their assumptions are violated. But the CLT (and the $t$ distribution) are much less helpful for computing confidence intervals and $P$–values than it seems Wilcox et al. (2013):

since the standard deviation is unknown, one must estimate it while estimating the mean to use the CLT
SD may be a completely inappropriate summary of dispersion (e.g., if one should have first log-transformed but computed SD on the raw scale)
if the data distribution is asymmetric, the SD is not independent of the mean (so the $t$ distribution does not hold) and the SD is not a good dispersion measure
the sample size for which the CLT is an adequate approximation is unknown for any given situation
example: log-normal distribution—the CLT is not accurate even for $n=50000$ (see below)
even when the CLT provides somewhat accurate $P$-values, it provides no comfort with regard to statistical power. Example: analyzing $Y$ when you should have analyzed $\log(Y)$ will result in a devastating increase in type II error.

Example simulation to compute confidence-coverage when using the $t$-distribution confidence interval for $n=50,000$ when analyzing data from a log-normal distribution but not taking logs. We want the confidence limits to be such that the fraction of samples for which the true population mean is to the left of the lower limit is 0.025, and the fraction to the right of the upper limit to also be 0.025.

For more about runifChanged go here. The way runifChanged is used below, the results are stored in a file whose name is the name of the Quarto/knitr chunk followed by .clt.

Code

n    <- 50000
nsim <- 5000            # number of simulations
mul <- 0; sdl <- 1.65   # on log scale
mu <- exp(mul + sdl * sdl / 2)   # population mean on orig. scale
count <- c(lower=0, upper=0)
seed <- 1
set.seed(seed)
z <- qt(0.975, n - 1)   # t critical value (near 1.96)

g <- function() {
  for(i in 1 : nsim) {
    x <- exp(rnorm(n, mul, sdl))
    ci <- mean(x) + c(-1, 1) * z * sqrt(var(x) / n)
    count[1] <- count[1] + (ci[1] > mu)
    count[2] <- count[2] + (ci[2] < mu)
  }
    count / nsim  # non-coverage prob. in left and right tails
}
f <- runifChanged(g, n, nsim, mul, sdl, seed, z)  # runifChanged in Hmisc
as.vector(f)   # should be 0.025 0.025 if CLT is accurate

[1] 0.0182 0.0406

See stats.stackexchange.com/questions/204585 and here for more information and discussion.

5.2 Hypotheses

5.2.1 Scientific Hypotheses and Questions

Scientific questions are usually stated with a direction or with regard to an expected effect, not with respect to a null effect. Scientific questions often involve estimating a quantify of interest. Here are some examples:

Does the risk of death increase when serum triglyceride increases?
To what extent is mechanism $x$ responsible for regulation of physiologic parameter $y$?
What is the average decrease in blood pressure when the dose of a drug goes from 0 to 10mg/day to 20mg/day?

5.2.2 Statistical Hypotheses

Hypothesis: usually a statement to be judged of the form
“population value = specified constant”
- $\mu = 120$mmHg
- $\mu_{1} - \mu_{2} = 0$mmHg
- Correlation between wealth and religiosity = 0
Null hypothesis is usually a hypothesis of no effect but can be $H_{0}: \mu =$ constant or $H_{0}:$ Probability of heads $=\frac{1}{2}$;
$H_{0}$ is often a straw man; something you hope to disprove
Alternative hypothesis: $H_1$; e.g.: $H_{1}: \mu \neq 120$mmHg
One-sided hypothesis (tested by 1-tailed test): $H_{1}$ is an inequality in one direction ($H_{1}: \mu > 120$mmHg)
Two-sided hypothesis (2-tailed test, most common type): $H_{1}$ involves values away from the hypothesized value in either direction

5.3 Branches of Statistics

Classical (frequentist or sampling statistics)
- Emphasizes (overemphasizes?) hypothesis testing
- Assumes $H_{0}$ is true and tries to amass evidence casting doubt on this assumption
- Conceives of data as one of many datasets that might have happened; considers the process by which the sample arose
- Inference is based on long-run operating characteristics not about direct evidence from the sample at hand
  - probability of making an assertion of an effect if there is no effect
  - proportion of the time that varying confidence intervals over replicates of the experiment cover the true unknown parameter; no statement about the chance that the current confidence interval covers the true parameter
- See if data are consistent with $H_{0}$
- Are data extreme or unlikely if $H_{0}$ is really true?
- Proof by contradiction: if assuming $H_{0}$ is true leads to results that are “bizarre” or unlikely to have been observed, casts doubt on premise
- Evidence summarized through a single statistic capturing a tendency of data, e.g., $\bar{x}$
- Look at probability of getting a statistic as or more extreme than the calculated one (results as or more impressive than ours) if $H_{0}$ is true (the $P$-value)¹
- If the statistic has a low probability of being more extreme we say that if $H_{0}$ is true we have acquired data that are very improbable, i.e., have witnessed a low probability event
- William Briggs described a basic flaw in the logic of $P$-value-guided hypothesis testing dating all the way back to Fisher:

¹ We could drop the “as” and just say “more extreme” because for continuous data the probability of getting a result exactly as extreme as ours is exactly zero.

A version of an argument given first by Fisher appears in every introductory statistics book. The original argument is this, [33]:

“Belief in a null hypothesis as an accurate representation of the population sampled is confronted by a logical disjunction: Either the null hypothesis is false, or the p-value has attained by chance an exceptionally low value.”

A logical disjunction would be a proposition of the type “Either it is raining or it is not raining.” Both parts of the proposition relate to the state of rain. The proposition “Either it is raining or the soup is cold” is a disjunction, but not a logical one because the first part relates to rain and the second to soup. Fisher’s “logical disjunction” is evidently not a logical disjunction because the first part relates to the state of the null hypothesis and the second to the p-value. Fisher’s argument can be made into a logical disjunction, however, by a simple fix. Restated: Either the null hypothesis is false and we see a small p-value, or the null hypothesis is true and we see a small p-value. Stated another way, “Either the null hypothesis is true or it is false, and we see a small p-value.” The first clause of this proposition, “Either the null hypothesis is true or it is false”, is a tautology, a necessary truth, which transforms the proposition to (loosely) “TRUE and we see a small p-value.” Adding a logical tautology to a proposition does not change its truth value; it is like multiplying a simple algebraic equation by 1. So, in the end, Fisher’s dictum boils down to: “We see a small p-value.” In other words, in Fisher’s argument a small p-value has no bearing on any hypothesis (any hypothesis unrelated to the p-value itself, of course). Making a decision about a parameter or data because the p-value takes any particular value is thus always fallacious: it is not justified by Fisher’s argument, which is a non sequitur.
— William Briggs

Ignoring all that and plunging ahead:
- $P$-value is a measure of surprise that is well described by Nicholas Maxwell²: “A p value is a measure of how embarrassing the data are to the null hypothesis”
- Then evidence mounts against $H_{0}$ and we might reject it
- A failure to reject does not imply that we have gathered evidence in favor of $H_0$ — many reasons for studies to not be impressive, including small sample size ($n$)
- Ignores clinical significance
- Is fraught with how to deal with multiplicity problems
  - No principled recipe for how they should be handled
  - Arise because
    type I error is fixed at a number $> 0$
    backward time ordering of information (transposed conditional)
  - Evidence about one question is changed according to whether other questions are asked (regardless of their answers)
Classical parametric statistics: assumes the data to arise from a certain distribution, often the normal (Gaussian distribution)
Nonparametric statistics: does not assume a data distribution; generally looks at ranks rather than raw values
Bayesian statistics:
- Considers the sample data, not how it arose from a sequence of samples but rather the data generating mechanism for this sample
- Computes the probability that a clinically interesting statement is true, e.g. that the new drug lowers population mean SBP by at least 5mmHg, given what we observed in the data
- Instead of trying to amass evidence against a single hypothesized effect size, Bayes tries to uncover the hidden parameter generating the data aside from noise (e.g., treatment effect) whatever its value
  - Provides evidence for all possible values of an unknown parameter
- More natural and direct approach but requires more work
- Because respects forward flow of time/information there is no need for nor availability of methods for correcting for multiplicity³
- Evidence about one question is not tilted by whether other questions are asked
- Can formally incorporate knowledge from other studies as well as skepticism from a tough audience you are trying to convince to use a therapy
- Starting to catch on (only been available for about 240 years) and more software becoming available
Likelihood inference⁴:
- Considers the sample data, not how it arose
- Akin to Bayesian but without the prior
- Interval estimates are based on relative likelihood (from the likelihood function) and are called likelihood support intervals
- For testing, allows both type I and type II errors $\rightarrow 0$ as $n \rightarrow \infty$, whereas with frequentist methods the type I error never shrinks as $n \rightarrow \infty$
- This greatly reduces problems with multiplicities
- Likelihood methods do not deal well with complex assertions (e.g., either the treatment reduces mortality by any amount or reduces blood pressure by at least 10 mmHg) and do not allow the use of external information
Bayesian and likelihood inference use the likelihood principle; frequentist inference does not
- Likelihood principle: All of the evidence in a sample relevant to model parameters is in the likelihood function
- If we want to know our current location, frequentist inference asks the following: If I am in Nashville, what fraction of routes to here involved the southern route I took? There are many paths to get where we are, and frequentists have to consider all possible relevant paths. Bayesian and likelihood inference states it differently: Where am I now? This involves an assessment of current evidence about my location. Asking “how did I get here?” (i.e., how did the data arise?) involves multiplicity issues that answering the simpler question does not.
- Consider a sequentially monitored randomized experiment. Bayesians and likelihoodists can make infinitely many assessments of efficacy with no penalty. On the other hand, a frequentist must think the following way:

² Data Matters: Conceptual Statistics for a Random World. Emeryville CA: Key College Publishing, 2004.

³ Bayesian inference assumes only that the prior distribution is “well calibrated” in the sense that one sticks to the pre-specified prior no matter what information is unveiled.

⁴ A key thinker and researcher in the field is Richard Royall.

I am at the first interim analysis. I am going to make later assessments of efficacy so I need to discount the current assessment and be more conservative or I will spend all my $\alpha$ already.
…
I am at the fifth interim analysis. I made four previous efficacy assessments, and even though none of them mattered, because they may have mattered I spent $\alpha$ so I need to discount the current assessment and be more conservative.

We will deal with classical parametric and nonparametric statistical tests more than Bayesian methods only because of time and because of abundance of easy-to-use software for the former

5.4 Errors in Hypothesis Testing; $P$ Values

Can attempt to reject a formal hypothesis or just compute $P$-value
Type I error: rejecting $H_0$ when it is true
$\alpha$ is the probability of making this error (typically set at $\alpha=0.05$—for weak reasons)
- To be specific: $\alpha$ = P(asserting an effect exists when in fact it doesn’t)
- So it is an assertion probability or false alarm probability like 1 minus specificity
- It is not a false positive probability (P(effect=0) given an assertion that it is nonzero) since $\alpha$ is based on an assumption that effect=0
Type II error: failing to reject $H_0$ when it is false
probability of this is $\beta$

	True state of $H_0$
Decision	$H_0$ true	$H_0$ false
Reject $H_0$	Type I error ($\alpha$)	Correct
Do Not Reject $H_0$	Correct	Type II error ($\beta$)

Power: $1 - \beta$: probability of (correctly) rejecting $H_0$ when it is false

Within the frequentist framework of statistics there are two schools. One, the Neyman-Pearson school, believes that the type I error should be pre-set at $\alpha$ (say $\alpha=0.05$) so that binary decisions can be made (reject/fail to reject). The other school due to Fisher believes that one should compute $P$-values and quote the result in the report or publication. This is the more popular approach, being less dichotomous.

Simplest definition of $\alpha$: probability that a $P$-value will be less than it if the treatment is ignorable

A $P$-value is something that can be computed without speaking of errors. It is the probability of observing a statistic as or more extreme than the observed one if $H_0$ is true, i.e., if the population from which the sample was randomly chosen had the characteristics posited in the null hypothesis.⁵ The $P$-value to have the usual meaning assumes that the data model is correct.

⁵ Note that Rosner’s Equation 7.4 in his section 7.3 is highly problematic. Classifications of “significant” or “highly significant” are arbitrary, and treating a $P$-value between 0.05 and 0.1 as indicating a “trend towards significance” is bogus. If the $P$-value is 0.08, for example, the 0.95 confidence interval for the effect includes a “trend” in the opposite (harmful) direction.

5.4.1 Problems With Type I Error

It’s not really an “error”
- Error = being wrong in asserting an effect
- Type I = P(asserting effect $| H_{0}$) = “effect assertion probability”
Frequentist designs attempt to preserve type I error
This is not the probability of making a mistake in concluding an effect is present
When $\alpha=0.05$ the probability of asserting an effect when there is none never decreases even as $n \rightarrow \infty$
(statistical vs. clinical significance problem)
$\alpha$ does not depend on any observed data. It is a pre-study concept.
$\alpha$ increases because of chances you give data to be more extreme (multiplicity), not because of chances you give hypotheses to be false. Bayesian and likelihood approaches do not look at sampling (sample space giving rise to data extremes). See this
Probability of making a mistake in asserting an effect, given the data, is one minus the Bayesian posterior probability of efficacy

5.4.2 Misinterpretation of $P$-values

$P$-values have come under extreme criticism since 2000, partially because they are often misinterpreted. Greenland et al. (2016) is the best paper that summarizes the misinterpretations and explains them. Some quotes from this paper are below, with their explanations for the first two.

The $P$ value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave $P$ = 0.01, the null hypothesis has only a 1% chance of being true; if instead it gave $P$ = 0.40, the null hypothesis has a 40% chance of being true. No! The $P$ value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The $P$ value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus $P$ = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while $P$ = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation.

The $P$ value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the $P$ value for the null hypothesis is 0.08, there is an 8% probability that chance alone produced the association. No! This is a common variation of the first fallacy and it is just as false. To say that chance alone produced the observed association is logically equivalent to asserting that every assumption used to compute the $P$ value is correct, including the null hypothesis. Thus to claim that the null $P$ value is the probability that chance alone produced the observed association is completely backwards: The $P$ value is a probability computed assuming chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by pondering how the $P$ value, which is a probability deduced from a set of assumptions (the statistical model), can possibly refer to the probability of those assumptions. Note: One often sees “alone” dropped from this description (becoming “the $P$ value for the null hypothesis is the probability that chance produced the observed association”), so that the statement is more ambiguous, but just as wrong.

A significant test results ($P\leq 0.05$) means that the test hypothesis is false or should be rejected. No!

A nonsignificant test results ($P > 0.05$) means that the test hypothesis is true or should be accepted. No!

A large $P$ value is evidence in favor of the test hypothesis. No!

A null-hypothesis $P$ value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated. No!

Statistical significance indicates a scientifically or substantively important relation has been detected. No!

Lack of statistical significance indicates that the effect size is small. No!

The $P$ value is the chance of our data occurring if the test hypothesis is true; for example, $P$ = 0.05 means that the observed association would occur only 5% of the time under the test hypothesis. No!

If you reject the test hypothesis because $P \leq 0.05$, the change you are in error (the chance your “significant finding” is a false positive) is 5%. No!

$P = 0.05$ and $P \leq 0.05$ mean the same thing. No!

$P$ values are properly reported as inequalities (e.g., report “$P < 0.02$” when $P = 0.015$ or report $P > 0.05$ when $P = 0.06$ or $P = 0.70$). No!

Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance. No!

One should always use two-sided $P$ values. No!

When the same hypothesis is tested in different studies and none or a minority of the tests are statistically significant (all $P > 0.05$), the overall evidence supports the hypothesis. No!

When the same hypothesis is tested in two different populations and the resulting $P$ values are on opposite sides of 0.05, the results are conflicting. No!

When the same hypothesis is tested in two different populations and the same $P$ values are obtained, the results are in agreement. No!

If one observes a small $P$ value, there is a good chance that the next study will produce a $P$ value at least as small for the same hypothesis. No!

The specific 95% confidence interval presented by a study has a 95% chance of containing the true effect size. No!

An effect size outside the 95% confidence interval has been refuted (or excluded) by the data. No!

If two confidence intervals overlap, the difference between the two estimates or studies is not significant. No!

An observed 95% confidence interval predicts that 95% of the estimates from future studies will fall inside the observed interval. No!

If one 95% confidence interval includes the null value and another excludes that value, the interval excluding the null is the more precise one. No!

If you accept the null hypothesis because the null $P$ value exceeds 0.05 and the power of your test is 90%, the chance you are in error (the chance that your finding is a false negative) is 10%. No!

If the null $P$ value exceeds 0.05 and the power of this test is 90% at an alternative, the results support the null over the alternative. counterexamples are easy to construct

5.5 Interval Estimation

5.5.1 Frequentist Confidence Intervals

A $1-\alpha$ two-sided confidence interval is an interval computed such that

if the experiment were repeated $N$ times one would expect $N \times (1 - \alpha)$ of the recomputed varying intervals to contain the true unknown quantity of interest
equivalently the set of all unknown population parameter values that if null hypothesized one would not reject that null hypothesis at the $\alpha$ level in a two-sided test
- e.g. when estimating the population mean $\mu$, the set $\mu_{0}$ such that the test $H_{0}: \mu = \mu_{0}$ has a $P$-value $> \alpha$ For this reason confidence intervals may better be called compatibility intervals.

Pros:

The $P$-value can be computed from the confidence interval but not vice-versa, so confidence limits have more information
Confidence intervals do not allow the “absence of evidence is not evidence of absence error”
- large $P$-values can come from small $n$ or large $\sigma^2$
- these make confidence intervals wide, giving a rational sense of uncertainty
- A confidence interval that is compatible with both large benefit and large detriment indicates that we don’t know much
  - large $P$-value means nothing more than “get more data”

Cons:

Confidence intervals have only a long-run interpretation over many experiments
They don’t provide a probability statement about whether a single interval includes the true population value
- In the frequentist world, the probability that the parameter is in a given interval is either zero or one
They are often taken to provide a measure of precision of a statistical estimate, but they’re not really that either
The experimenter controls the long-run inclusion probability $1 - \alpha$ and gets interval endpoints, but is often more interested in the probability that the population effect is in a pre-chosen fixed interval
It is very difficult to make confidence intervals incorporate known uncertainties (e.g., amount of non-normality), making them often too narrow (overoptimistic)

5.5.2 Bayesian Credible Intervals

Credible intervals have the interpretation that most researchers seek when they compute confidence intervals
A $1 - \alpha$ credible interval is an interval $[a, b]$ (computed, under a certain prior distribution encapsulating prior knowledge about the parameter $\mu$) so that
$P(a \leq \mu \leq b | \textrm{ data}) = 1 - \alpha$

Pros:

Pertains to the single, current dataset and does not provide just long-run operating characteristics
Provides a true probability statement even when the experiment could never be repeated
Is symmetric with Bayesian posterior probabilities of pre-chosen limits
- Researchers can specify $a, b$ and then compute whatever probability it is that the unknown parameter is in that interval
The credible interval can take into account all sorts of uncertainties (e.g., non-normality) that make it (correctly) wider

Cons:

One must specify a prior distribution

5.6 One Sample Test for Mean

5.6.1 Frequentist Method

Assuming continuous response from a normal distribution
One sample tests for $\mu =$ constant are unusual except when data are paired, e.g., each patient has a pre– and post–treatment measurement and we are only interested in the mean of post - pre values
$t$ tests in general: t = $$
The standard deviation of a summary statistic is called its standard error, which is the $\sqrt{}$ of the variance of the estimate
The one-sample $t$ statistic for testing a single population mean against a constant $\mu_0$ ($H_{0}$: $\mu = \mu_{0}$; often $\mu_{0} = 0$) is \[ t = \frac{\bar{x} - \mu_{0}}{se} \] where $se = \frac{s}{\sqrt{n}}$, is the standard error of the mean (SEM) and $\bar{x}$ is the sample mean

When your data comes from a normal distribution and $H_{0}$ holds, the $t$ ratio statistic follows the $t$ distribution
With small sample size ($n$), the $t$ ratio is unstable because the sample standard deviation ($s$) is not precise enough in estimating the population standard deviation ($\sigma$; we are assuming that $\sigma$ is unknown)
This causes the $t$ distribution to have heavy tails for small $n$
As $n\uparrow$ the $t$ distribution becomes the normal distribution with mean zero and standard deviation one
The parameter that defines the particular $t$ distribution to use as a function of $n$ is called the degrees of freedom or d.f.
d.f. = $n$ - number of means being estimated
For one-sample problem d.f. = $n-1$
Symbol for distribution $t_{n-1}$

Code

x <- seq(-3, 3, length=200)
w <- list(Normal     = list(x=x, y=dnorm(x)),
          't (50 df)'= list(x=x, y=dt(x, 50)),
          't (5 df)' = list(x=x, y=dt(x, 5)),
          't (2 df)' = list(x=x, y=dt(x,2)))
labcurve(w, pl=TRUE, keys='lines', col=c(1,2,4,5), lty=c(2,1,1,1),
         xlab=expression(x), ylab='')

Figure 5.1: Comparison of probability densities for $t_2$, $t_5$, $t_{50}$, and normal distributions

Two-tailed $P$-value: probability of getting a value from the $t_{n-1}$ distribution as big or bigger in absolute value than the absolute value of the observed $t$ ratio
Computer programs can compute the $P$-value given $t$ and $n$.⁶ R can compute all probabilities or critical values of interest. See the help files for pt,pnorm,pf,pchisq.
- don’t say “$P <$ something” but instead $P=$ something
In the old days tables were used to provide critical values of $t$, i.e., a value $c$ of $t$ such that Prob$[|t| > c] = \alpha$ for “nice” $\alpha$ such as 0.05, 0.01.
Denote the critical value by $t_{n-1;1-\alpha/2}$ for a 2-tailed setup
For large $n$ (say $n \geq 500$) and $\alpha=0.05$, this value is almost identical to the value from the normal distribution, 1.96
Example: We want to test if the mean tumor volume is 190 mm$^3$ in a population with melanoma, $H_0: \mu = 190$ versus $H_1: \mu \neq 190$. \[\begin{array}{c} \bar{x}=181.52, s=40, n=100, \mu_{0}=190 \\ t = \frac{181.52 - 190}{40/\sqrt{100}} = -2.12 \\ t_{99,.975}=1.984 \rightarrow \mathrm{reject~at~arbitrary~\alpha=.05~if~using~Neyman-Pearson~paradigm} \\ P=0.037 \end{array}\]

⁶ R has the function pt for the cumulative distribution function for the $t$ distribution, so the 2-tailed $P$-value would be obtained using 2*(1-pt(abs(t),n-1)).

Code

xbar  <- 181.52
s     <- 40
n     <- 100
mu0   <- 190
tstat <- (xbar - mu0) / (s / sqrt(n))
pval  <- 2 * (1 - pt(abs(tstat), n - 1))
c(tstat=tstat, pval=pval)

      tstat        pval 
-2.12000000  0.03650607

5.6.2 Bayesian Methods

All aspects of Bayesian probabilistic inference follow from the general form of Bayes’ rule allowing for the response $Y$ to be continuous:
The probability density function for the unknown parameters given the data and prior is proportional to the density function for the data multiplied by the density function for the prior⁷
The Bayesian counterpart to the frequentist $t$-test-based approach can use the same model
But brings extra information for the unknowns—$\mu, \sigma$—in the form of prior distributions for them
Since the raw data have a somewhat arbitrary scale, one frequently uses nearly flat “weakly-informative” priors for these parameters
- But it would be easy to substitute a prior for $\mu$ that rules out known-to-be impossible values or that assumes very small or very large $\mu$ are very unlikely
Note: the assumption of a Gaussian distribution for the raw data $Y$ is a strong one
- Bayes makes it easy to relax the normality assumption but still state the inference in familiar terms (e.g., about a population mean)⁸
- See later for an example
For now assume that normality is known to hold, and use relatively uninformative priors

⁷ When $Y$ is a discrete categorical variable we use regular probabilities instead of densities in Bayes’ formula. The density at $x$ is the limit as $\epsilon \rightarrow 0$ of the probability of the variable being in the interval $[x, x + \epsilon]$ divided by $\epsilon$.

⁸ Some analysts switch to trimmed means in the presence of outliers but it is hard to interpret what they are estimating.

In most Bayesian analyses that follow we use the R brms package by Paul-Christian Bürkner that is based on the general Stan system because brms is easy to use, makes good default choices for priors, and uses the same notation as used in frequentist models in Rfootnote{Thanks to Nathan James of the Vanderbilt Department of Biostatistics for providing brms code for examples in this chapter.}.

Except for the one-sample proportion example in the next section, our Bayesian calculations are general and do not assume that the posterior distribution has an analytic solution. Statistical samplers (Markov chain Monte Carlo, Gibbs, and many variants) can sample from the posterior distribution only by knowing the part of Bayes’ formula that is the simple product of the data likelihood and the prior, without having to integrate to get a marginal distribution⁹. We are generating 4000 samples from the posterior distribution (4000 random draws) in this chapter. When high precision of posterior probabilities is required, one can ensure that all probability calculations have a margin of simulation error of $< 0.01$ by using 10,000 samples, or margin of error $< 0.005$ by taking 40,000 draws¹⁰.

⁹ The simple product is proportional to the correct posterior distribution and just lacks a normalizing constant that makes the posterior probabilities integrate to 1.0.

¹⁰ These numbers depend on the sampler having converged and the samples being independent enough so that the number of draws is the effective sample size. Stan and brms have diagnostics that reveal the effective sample size in the face of imperfect posterior sampling.

Example

To introduce the Bayesian treatment for the one-sample continuous $Y$ problem consider the following made-up dataset that has an “outlier”:

98 105 99 106 102 97 103 132

Start with regular frequentist analysis
Most interested in confidence interval
Also test $H_{0}: \mu = 110$

Five measures of dispersion are computed¹¹.

¹¹ In a standard normal(0,1) distribution, the measures are, respectively, 1.0, 0.8, 0.8, 1.13, and 0.67.

Code

y <- c(98, 105, 99, 106, 102, 97, 103, 132)
median(y)

[1] 102.5

Code

sd(y)

[1] 11.28526

Code

mean(abs(y - mean(y)))

[1] 6.875

Code

mean(abs(y - median(y)))

[1] 6.25

Code

GiniMd(y)   # Gini's mean difference is more robust to outliers

[1] 10.85714

Code

 # It is the mean absolute difference over all pairs of observations
median(abs(y - median(y)))   # Even more robust, not as efficient

[1] 3.5

Code

t.test(y, mu=110)


    One Sample t-test

data:  y
t = -1.1905, df = 7, p-value = 0.2727
alternative hypothesis: true mean is not equal to 110
95 percent confidence interval:
  95.81528 114.68472
sample estimates:
mean of x 
   105.25

Now consider a Bayesian counterpart. A one-sample problem is a linear model containing only an intercept, and the intercept represents the overall unknown data-generating (population) mean of $Y$. In R an intercept-only model has ~ 1 on the right-hand side of the model formula.

For the prior distribution for the mean we assume a most likely value (mean of distribution, since symmetric) of 150 and a standard deviation of 50, indicating the unknown mean is likely to be between 50 and 250. This is a weakly informative prior.

Code

# Tell brms/Stan/rstanarm to use all available CPU cores except for 1
options(mc.cores=parallel::detectCores() - 1)

Code

require(brms)
d <- data.frame(y)
priormu <- prior(normal(150,50), class='Intercept')
h <- function() brm(y ~ 1, family=gaussian, prior=priormu, data=d, seed=1)
f <- runifChanged(h, priormu, d)

See which prior distributions are being assumed. The data model is Gaussian. Also display parameter distribution summaries. Estimate represents the mean of the posterior distribution.

Code

prior_summary(f)

                prior     class coef group resp dpar nlpar lb ub  source
      normal(150, 50) Intercept                                     user
 student_t(3, 0, 5.2)     sigma                             0    default

Code

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ 1 
   Data: d (Number of observations: 8) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   105.59      4.12    97.43   114.28 1.00     2272     1849

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma    11.27      2.94     7.19    18.72 1.00     2342     2165

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

draws <- as.data.frame(f)
mu    <- draws$b_Intercept
sigma <- draws$sigma
length(mu)

[1] 4000

Note how Bayes provides uncertainty/precision information about $\sigma$
Credible intervals are quantiles of posterior samples, e.g. we can duplicate the above credible interval (CI) for $\mu$ (the intercept) using the following:

Code

quantile(mu, c(.025, 0.975))

     2.5%     97.5% 
 97.43253 114.28276

Compare 0.95 credible interval with the 0.95 frequentist confidence interval above
Compare posterior means for $\mu, \sigma$ with the point estimates from the traditional analysis
- Posterior modes (the most likely values) may be more relevant¹². These are printed below along with sample estimates. The posterior mode is computed by fitting a nonparametric kernel density estimator to the posterior draws for the parameter of interest and finding the peak. The number of draws needed to compute the mode accurately is more than the number we are using here.

¹² The sample mean is the maximum likelihood estimate (MLE) and the sample standard deviation is, except for a factor of $\frac{n-1}{n}$, the maximum likelihood estimate. When priors are flat, MLEs equal posterior modes

Code

# Function to compute posterior mode given draws
pmode <- function(x) {
  z <- density(x)
  z$x[which.max(z$y)]
  }
n <- length(y)
c(mean(y), sd(y), sd(y)*sqrt((n-1)/n))

[1] 105.25000  11.28526  10.55640

Code

c(pmode(mu), pmode(sigma))

[1] 105.841080   9.851647

Instead of a hypothesis test we compute direct evidence for $\mu$ exceeding 110
- posterior probability that $\mu > 110$ given the data and priors is approximated (to within simulation error) by the proportion of posterior draws for $\mu$ for which the value of $\mu$ exceeded 110
- define a “probability operator” P that is just the proportion

Code

P <- mean
P(mu > 110)   # compare to 1-tailed p-value: 0.136

[1] 0.121

Here are posterior distributions for the two parameters along with convergence diagnostics

Code

plot(f)

Figure 5.2: Posterior distributions for $\mu$ and $\sigma$ using a normal data model with weak priors (left panels), and convergence diagnostics for posterior sampling (right panels)

Normality is a strong assumption
Heavy tails can hurt validity of the estimate of the mean, uncertainty intervals, and $P$-values
Easy to allow for heavy tails by adding a single parameter to the model¹³
Assume the data come from a $t$ distribution with unknown degrees of freedom $\nu$
- Use of $t$ for the raw data distribution should not be confused with the use of the same $t$ distribution for computing frequentist probabilities about the sample mean.
- John Kruschke has championed this Bayesian $t$-test approach in BEST (Bayesian Estimation Supersedes the $t$-Test)
$\nu > 20 \rightarrow$ almost Gaussian
Have a prior for $\nu$ that is a gamma distribution with parameters $\alpha=2, \beta=0.1$ with $\nu$ constrained to be $> 1$
Prior $P(\nu > 20)$ is (in R code) pgamma(20,2,0.1,lower.tail=FALSE) which is 0.41. So our prior probability that normality approximately holds is a little less than $\frac{1}{2}$.

¹³ One could use a data distribution with an additional parameter that allows for asymmetry of the distribution. This is not addressed here.

Code

h <- function() brm(y ~ 1, family=student, prior=priormu, data=d, seed=2)
g <- runifChanged(h, priormu, d)

Code

prior_summary(g)

                prior     class coef group resp dpar nlpar lb ub  source
      normal(150, 50) Intercept                                     user
        gamma(2, 0.1)        nu                             1    default
 student_t(3, 0, 5.2)     sigma                             0    default

Code

 Family: student 
  Links: mu = identity; sigma = identity; nu = identity 
Formula: y ~ 1 
   Data: d (Number of observations: 8) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept   103.40      3.63    97.68   111.44 1.00     1373     1231

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     7.71      3.39     2.69    15.01 1.00     1181     1958
nu       11.60     11.30     1.43    42.90 1.00     1324     1776

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

draws <- as.data.frame(g)
mu    <- draws$b_Intercept
P(mu > 110)

[1] 0.04

Code

plot(g)
nu    <- draws$nu
snu   <- c(mean(nu), median(nu), pmode(nu))
ssd   <- c(sd(y), pmode(sigma), pmode(draws$sigma))
fm    <- function(x) paste(round(x, 2), collapse=', ')

Figure 5.3: Posterior distributions for $\mu, \sigma, \nu$ for a data model that is the $t$ distribution with $\nu$ d.f. (left panels), and convergence diagnostics (right panels)

Posterior means discount the high outlier a bit and shows lower chance of $\mu > 110$
The credible interval for $\mu$ is significantly smaller than the confidence interval due to allowance for heavy tails
The posterior mean, median, and mode of $\nu$ (11.6, 7.45, 3.52) provide evidence that the data come from a distribution that is heavier tailed than the normal. Posterior $P(\nu > 20) =$ 0.18 which is essentially the probability of approximate normality under a $t$ data model.
The traditional SD estimate, posterior median SD assuming normal $Y$, and posterior median SD assuming $Y$ has a $t$-distribution are respectively 11.29, 9.85, 6.9. The ordinary SD is giving too much weight to the high outlier.

This latter analysis properly penalizes for not knowing normality in $Y$ to hold (i.e., not knowing the true value of $\nu$).

Decoding the Effect of the Prior

One can compute the effective number of observations added or subtracted by using, respectively, an optimistic or a skeptical prior
This is particularly easy when the prior is normal, the data model is normal, and the data variance is known
Let
$\mu_{0} =$ prior mean
$\sigma_{0} =$ prior standard deviation
$\overline{Y} =$ sample mean of the response variable
$\sigma =$ population SD of $Y$
$n =$ sample size for $\overline{Y}$
Then the posterior variance of $\mu$ is
$\sigma^{2}_{p} = \frac{1}{\frac{1}{\sigma^{2}_{0}} + \frac{1}{\frac{\sigma^{2}}{n}}}$
The posterior mean is
$\mu_{p} = \frac{\sigma^{2}_{p}}{\sigma^{2}_{0}} \times \mu_{0} + \frac{\sigma^{2}_{p}}{\frac{\sigma^{2}}{n}} \times \overline{Y}$
For a given $\overline{Y}$, sample size $n$ and posterior $P(\mu > 110)$, what sample size $m$ would yield the same posterior probability under a flat prior?
For a flat prior $\sigma_{0} = \infty$ so the posterior variance of $\mu$ would be $\frac{\sigma^{2}}{n}$ and the posterior mean would be $\overline{Y}$
Equating $P(\mu > 110)$ for the weakly informative prior at sample size $n$ to that from noninformative prior at a sample size $m$ is the same as setting $P(\mu < 110)$ to be equal
The probability that a Gaussian random variable with mean $a$ and standard deviation $b$ is less than 110 is $\Phi(\frac{110-a}{b})$ where $\Phi$ is the standard normal cumulative distribution function
Solve for $m$ for a variety of $n, \overline{Y}, \sigma$

Code

calc <- function(n, m, ybar, sigma, mu0, sigma0, cutoff) {
  vpost1 <- 1 / ((1 / (sigma0^2)) + 1 / ((sigma^2) / n))
  mupost1 <- mu0 * vpost1 / (sigma0 ^ 2) + ybar * vpost1 / ((sigma ^ 2) / n)
  vpost2 <- (sigma ^ 2) / m
  (110 - mupost1) / sqrt(vpost1) - (110 - ybar) / sqrt(vpost2)
}

# For a given n, ybar, sigma solve for m to get equal post. prob.
m <- function(n, ybar, sigma, sigma0=50, cutoff=110)
  round(uniroot(calc, interval=c(n / 2, 2 * n), n=n, ybar=ybar, sigma=sigma,
          mu0=150, sigma0=sigma0, cutoff=cutoff)$root, 1)
# Make sure that m=n when prior variance is huge
m(8, 100, 10, sigma0=50000)

[1] 8

Code

# From here on use the original prior standard deviation of 50
# Now compute m when n=8 using our original prior assuming sigma=10 ybar=105
m(8, 105, 10)

[1] 7.3

Code

# What about for two other sigmas
m(8, 105, 5)

[1] 7.8

Code

m(8, 105, 15)

[1] 6.6

Code

# What about two other ybar
m(8, 80, 10)

[1] 7.9

Code

m(8, 120, 10)

[1] 8.3

Code

# What if n were larger
m(15, 105, 10)

[1] 14.3

Code

m(30, 105, 10)

[1] 29.3

Code

m(300, 105, 10)

[1] 299.3

Code

m(3000, 105, 10)

[1] 2999.3

Typical effect of prior in this setting is like adding or dropping one observation
Simpler example where $\overline{Y}$ is irrelevant: $\sigma=1, \mu_{0}=0$ and posterior probability of interest is $P(\mu > 0)$
Vary the prior variance $\sigma^{2}_{0}$ and for each prior variance compute the prior probability that $\mu > 1$ (with prior centered at zero, lower variance $\rightarrow$ lower $P(\mu > 1)$)

Code

z    <- NULL
n    <- seq(1, 100, by=2)
vs   <- c(.05, .1, .25, .5, 1, 4, 100)
# Pad first part of text to 6 characterts so columns will line up
pad  <- function(x) substring(paste(x, '    '), 1, 6)
vlab <- paste0('v=', pad(vs), ' P(mu>1)=',
                format(1 - pnorm(sqrt(1 / vs)), digits=3, scientific=1))
names(vlab) <- as.character(vs)
for(v in vs)
  z <- rbind(z,
             data.frame(v, n, y = 0.5 * (n + sqrt(n^2 + 4 * n / v)) - n))
z$v <- factor(z$v, as.character(vs), vlab)
ggplot(z, aes(x=n, y=y, color=v)) + geom_line() +
  xlab('Sample Size With No Skepticism') +
  ylab('Extra Subjects Needed Due to Skepticism') +
  theme(legend.text = element_text(family='mono'))

Figure 5.4: Effect of discounting by a skeptical prior with mean zero and variance v: the increase needed in the sample size in order to achieve the same posterior probability of $\mu > 0$ as with the flat (non-informative) prior. Prior variance v=0.05 corresponds to a very skeptical prior, given almost no chance to a large $\mu$ $(\mu > 1)$.

For typical priors the effect of skepticism is like dropping 3 observations, and this is not noticeable when $n > 20$¹⁴
Note: For a specific problem one can just run the brm function again with a non-informative prior to help judge the effect of the informative prior

¹⁴ See here for the derivation.

5.6.3 Power and Sample Size

Bayesian power is the probability of hitting a specified large posterior probability and is usually obtained by simulation
Frequentist power, though often arbitrary (and inviting optimistic single point values for an effect to detect) is easier to compute and will provide a decent approximation for Bayesian sample size calculations when the main prior is weakly informative
Frequentist power $\uparrow$ when
- allow larger type I error ($\alpha$; trade-off between type I and II errors)
- true $\mu$ is far from $\mu_{0}$
- $\sigma \downarrow$
- $n \uparrow$
Power for 2-tailed test is a function of $\mu, \mu_{0}$ and $\sigma$ only through $|\mu - \mu_{0}| / \sigma$
Sample size to achieve $\alpha=0.05$, power $=0.9$ is approximately \[n = 10.51 [\frac{\sigma}{\mu-\mu_{0}}]^{2}\]
Some power calculators are here
PS program by Dupont and Plummer (see also G*Power)
Example: The mean forced expiratory volume (FEV) in a population of asthmatics is 2.5 liters per second and the population standard deviation is assumed to be 1. Determine the number of subjects needed if a new drug is expected to increase FEV to 3.0 liters per second ($\alpha = .05, \beta = 0.1$)

\[\begin{array}{c} \mu=2.5, \mu_0=3, \sigma = 1 \\ n = 10.51 \left[\frac{1}{3-2.5}\right]^{2} = 42.04 \end{array}\]

Rounding up, we need 43 subjects to have 0.9 power (42 subjects would have less than 0.9 power)

Code

sigma <- 1
mu    <- 2.5
mu0   <- 3
n     <- 10.51 * (1 / (mu - mu0)) ^ 2
# General formula for approximate power of 1-sample t-test
# Approximate because it uses the normal distribution throughout,
# not the t distribution
alpha <- 0.05
power <- 0.9
delta <- mu - mu0
za    <- qnorm(1 - alpha / 2)
zb    <- qnorm(power)
n     <- ((za + zb) * sigma / delta) ^ 2
c(alpha=alpha, power=power, delta=delta, za=za, zb=zb, n=n)

    alpha     power     delta        za        zb         n 
 0.050000  0.900000 -0.500000  1.959964  1.281552 42.029692

A slightly more accurate estimate can be obtained using the $t$ distribution, requiring iterative calculations programmed in R packages such as pwr.

Code

# Make sure pwr package is installed
require(pwr)
pwr.t.test(d = delta / sigma, power = 0.9, sig.level = 0.05, type='one.sample')


     One-sample t test power calculation 

              n = 43.99548
              d = 0.5
      sig.level = 0.05
          power = 0.9
    alternative = two.sided

5.6.4 Confidence Interval

A 2-sided $1-\alpha$ confidence interval for $\mu$ under normality for $Y$ is \[\bar{x} \pm t_{n-1,1-\alpha/2} \times \mathrm{se}\] The $t$ constant is the $1-\alpha/2$ level critical value from the $t$-distribution with $n-1$ degrees of freedom. For large $n$ it equals 1.96 when $\alpha=0.05$.

ABD 6.6

An incorrect but common way to interpret this is that we are 0.95 confident that the unknown $\mu$ lies in the above interval. The exact way to say it is that if we were able to repeat the same experiment 1000 times and compute a fresh confidence interval for $\mu$ from each sample, we expect 950 of the samples to actually contain $\mu$. The confidence level is about the procedure used to derive the interval, not about any one interval. Difficulties in providing exact but still useful interpretations of confidence intervals has driven many people to Bayesian statistics.

The 2-sided $1-\alpha$ CL includes $\mu_{0}$ if and only if a test of $H_{0}: \mu=\mu_{0}$ is not rejected at the $\alpha$ level in a 2-tailed test.

If a 0.95 CL does not contain zero, we can reject $H_0: \mu = 0$ at the $\alpha = 0.05$ significance level

$1 - \alpha$ is called the confidence level or confidence coefficient, but it is better to refer to compatibility

5.6.5 Sample Size for a Given Precision

There are many reasons for preferring to run estimation studies instead of hypothesis testing studies. A null hypothesis may be irrelevant, and when there is adequate precision one can learn from a study regardless of the magnitude of a $P$-value. A nearly universal property of precision estimates is that, all other things being equal, increasing the sample size by a factor of four improves the precision by a factor of two.

ABD 14.7

May want to estimate $\mu$ to within a margin of error of $\pm \delta$ with 0.95 confidence¹⁵
“0.95 confident” that a confidence interval includes the true value of $\mu$
If $\sigma$ were known but we still used the $t$ distribution in the formula for the interval, the confidence interval would be $\bar{x}\pm \delta$ where \[\delta = \frac{t_{n-1,1-\alpha/2}\sigma}{\sqrt{n}}\]
Solving for $n$ we get \[n = \left[\frac{t_{n-1,1-\alpha/2} \sigma}{\delta}\right]^{2}\]
If $n$ is large enough and $\alpha=0.05$, required $n=3.84[\frac{\sigma}{\delta}]^{2}$
Example: if want to be able to nail down $\mu$ to within $\pm 1$mmHg when the patient to patient standard deviation in blood pressure is 10mmHg, $n = 384$

¹⁵ Adcock (1997) presents both frequentist and Bayesian methods and for precision emphasizes solving for $n$ such that the probability of being within $\epsilon$ of the true value is controlled, as opposed to using confidence interval widths explicitly.

Code

sigma <- 10
delta <- 1
3.84 * (sigma / delta) ^ 2

[1] 384

Advantages of planning for precision rather than power¹⁶
- do not need to select a single effect to detect
- many studies are powered to detect a miracle and nothing less; if a miracle doesn’t happen, the study provides no information
- planning on the basis of precision will allow the resulting study to be interpreted if the $P$-value is large, because the confidence interval will not be so wide as to include both clinically significant improvement and clinically significant worsening

¹⁶ See Borenstein M: J Clin Epi 1994; 47:1277-1285.

5.7 One Sample Method for a Probability

ABD 7

5.7.1 Frequentist Methods

Estimate a population probability $p$ with a sample estimate $\hat{p}$
Data: $s$ “successes” out of $n$ trials
Maximum likelihood estimate of $p$ is $\hat{p} = \frac{s}{n}$ (value of $p$ making the data most likely to have been observed) = Bayesian posterior mode under a flat prior
Approximate 2-sided test of $H_{0}: p=p_{0}$ obtained by computing a $z$ statistic
A $z$-test is a test assuming that the statistic has a normal distribution; it is a $t$-test with infinite ($\infty$) d.f. \[z = \frac{\hat{p} - p_{0}}{\sqrt{p_{0}(1-p_{0})/n}}\]
The $z$-test follows the same general form as the $t$-test \[z = \frac{\textrm{estimate - hypothesized value}}{\textrm{standard deviation of numerator}}\]
Example: $n=10$ tosses of a coin, 8 heads; $H_{0}$: coin is fair ($p = p_{0}=\frac{1}{2}$) \[z = \frac{.8 - .5}{\sqrt{(\frac{1}{2})(\frac{1}{2})/10}} = 1.897\]
$P\textrm{-value} = 2 \times$ area under a normal curve to the right of $1.897 = 2 \times 0.0289 = 0.058$ (this is also the area under the normal curve to the right of $1.897$ + the area to the left of $-1.897$)

Code

p  <- 0.8
p0 <- 0.5
n  <- 10
z  <- (p - p0) / sqrt(p0 * (1 - p0) / n)
c(z=z, Pvalue=2 * pnorm(-abs(z)))

         z     Pvalue 
1.89736660 0.05777957

Approximate probability of getting 8 or more or 2 or fewer heads if the coin is fair is 0.058. This is indirect evidence for fairness, is not the probability the null hypothesis is true, and invites the “absence of evidence is not evidence of absence” error.
Use exact methods if $p$ or $n$ is small

Code

# Pr(X >= 8) = 1 - Pr(X < 8) = 1 - Pr(X <= 7)
pbinom(2, 10, 0.5) + 1 - pbinom(7, 10, 0.5)

[1] 0.109375

Code

# Also compute as the probability of getting 0, 1, 2, 8, 9, 10 heads
sum(dbinom(c(0, 1, 2, 8, 9, 10), 10, 0.5))

[1] 0.109375

Confidence interval for $p$
- Wilson’s method without continuity correction is recommended
- for 8 of 10 heads here is the Wilson interval in addition to the exact binomial and normal approximation. The Wilson interval is the most accurate of the three (Andersson, 2023)

Code

binconf(8, 10, method='all')

           PointEst     Lower     Upper
Exact           0.8 0.4439045 0.9747893
Wilson          0.8 0.4901625 0.9433178
Asymptotic      0.8 0.5520820 1.0479180

5.7.2 Bayesian

The single unknown probability $p$ for binary $Y$-case is a situation where there is a top choice for prior distributions
Number of events follows a binomial distribution with parameters $p, n$
The beta distribution is for a variable having a range of $[0, 1]$, has two parameters $\alpha, \beta$ for flexibility, and is conjugate to the binomial distribution
- The posterior distribution is simple: another beta distribution
The mean of a beta-distributed variable is $\frac{\alpha}{\alpha + \beta}$ and its standard deviation is
$\sqrt{\frac{\alpha\beta}{\alpha+\beta+1}} / (\alpha + \beta)$
Using a beta prior is equivalent to adding $\alpha$ successes and $\beta$ failures to the data
Posterior distribution of $p$ is beta$(s + \alpha, n - s + \beta)$
A uniform prior sets $\alpha = \beta = 1$
In general an intuitive way to set the prior is to preset the mean then solve for the parameters that force $P(p > c) = a$ for given $c$ and $a$
For the 10 coin toss example let’s set the prior mean of P$($heads$)=\frac{1}{2}$ and $P(p > 0.8) = 0.05$, i.e. only a 0.05 chance that the probability of heads exceeds 0.8. Here are $\alpha$ and $\beta$ satisfying these requirements:

Code

# alpha/(alpha + beta) = 0.5 -> alpha=beta
alphas <- seq(0, 20, length=100000)
exceedanceProb <- 1 - pbeta(0.8, alphas, alphas)
alpha <- alphas[which.min(abs(exceedanceProb - 0.05))]
beta  <- alpha
alpha.post <- 8 + alpha
beta.post  <- 10 - 8 + beta

The solution is $\alpha=\beta=3.26$
With the data $s=8$ out of $n=10$ the posterior distribution is beta (11.26, 5.26)

Code

p <- seq(0, 1, length=300)
prior <- dbeta(p, alpha, beta)
post  <- dbeta(p, alpha.post, beta.post)
curves <- list(Prior=list(x=p, y=prior),
               Posterior=list(x=p, y=post))
labcurve(curves, pl=TRUE, xlab='p', ylab='Probability Density')

Figure 5.5: Prior and posterior distribution for the unknown probability of heads. The posterior is based on tossing 8 heads out of 10 tries.

From the posterior distribution we can get the credible interval, the probability that the probability of heads exceeds $\frac{1}{2}$, and the probability that the probability of heads is within $\pm 0.05$ of fairness:

Code

qbeta(c(.025, .975), alpha.post, beta.post)

[1] 0.4464199 0.8751905

Code

1 - pbeta(0.5, alpha.post, beta.post)

[1] 0.9374297

Code

pbeta(0.55, alpha.post, beta.post) - pbeta(0.45, alpha.post, beta.post)

[1] 0.100748

Unlike the frequentist analysis, these are direct measures that are easier to interpret
Instead of just providing evidence against a straw-man assertion, Bayesian posterior probabilities measure evidence in favor (as well as against) all possible assertions

5.7.3 Power and Sample Size

Power $\uparrow$ as $n \uparrow$, $p$ departs from $p_{0}$, or $p_{0}$ departs from $\frac{1}{2}$
$n \downarrow$ as required power $\downarrow$ or $p$ departs from $p_{0}$

5.7.4 Sample Size for Given Precision

Approximate 0.95 CL: $\hat{p} \pm 1.96 \sqrt{\hat{p}(1-\hat{p})/n}$
Assuming $p$ is between 0.3 and 0.8, it would not be far off to use the worst case standard error $\sqrt{1/(4n)}$ when planning
$n$ to achieve a margin of error $\delta$ in estimating $p$: \[n = \frac{1}{4}\left[\frac{1.96}{\delta}\right]^{2} = \frac{0.96}{\delta^{2}}\]
Example: $\delta=.1 \rightarrow n=96$ to achieve a margin of error of $\pm 0.1$ with 0.95 confidence

Code

nprec <- function(delta) round(0.25 * (qnorm(0.975) / delta) ^ 2)
nprec(0.1)

[1] 96

To achieve a margin of error of $\pm 0.05$ even in the worst case where $p=0.5$ one needs $n=384$.

For Bayesian precision calculations we can solve $n$ that achieves a given width of, say, a 0.95 credible interval:

Use a flat beta prior, i.e., with $\alpha=\beta=1$
Posterior distribution for $p$ is $\textrm{beta}(s + 1, n - s + 1)$
Compute CI half-widths for varying $n$ for selected values of $s$

Code

n <- seq(10, 400, by=5)
k <- c(1/8, 1/4, 1/2)
ck <- paste0('s=', c('1/8', '1/4', '1/2'), ' n')
r <- NULL
for(i in 1 : 3) {
   ciu <- qbeta(0.975, k[i] * n + 1, n - k[i] * n + 1)
   cil <- qbeta(0.025, k[i] * n + 1, n - k[i] * n + 1)
     r   <- rbind(r, data.frame(n, y=(ciu - cil) / 2, s=ck[i]))
}
r$s <- factor(r$s, ck)
ggplot(r, aes(x=n, y, col=s)) + geom_line() +
  ylab('Precision') +
  geom_abline(slope=0, intercept=c(0.05, 0.1), alpha=I(0.25))

Figure 5.6: Half-widths of 0.95 credible intervals for $p$ using a flat prior

As with confidence intervals, precision is worst when $\frac{1}{2}$ of observations are successes, so often best to plan on worst case
Same sample size needed as with frequentist (since prior is flat)
Easy to modify for other priors

To put this in the context of relative errors, suppose that one wants to estimate the odds that an event will occur, to within a certain multiplicative margin of error (MMOE) with 0.95 confidence using frequentist methods. What is the MMOE as a function of the unknown $p$ when $n=384$? The standard error of the log odds is approximately $\sqrt{\frac{1}{n p (1 - p)}}$, and the half-width of a 0.95 confidence interval for the log odds is approximately 1.96 times that. Fix $n=384$ and vary $p$ to get the MMOE that is associated with the same sample size as a universal absolute margin of error of 0.05.

Code

p <- seq(0.01, 0.99, length=200)
mmoe <- exp(1.96 / sqrt(384 * p * (1 - p)))
plot(p, mmoe, type='l', xlab='Unknown Probability p', ylab='MMOE')
minor.tick()

Figure 5.7: Multiplicative margin of error in estimating odds when $n=384$ and the margin of error in estimating the absolute probability is $\leq 0.05$.

5.8 Paired Data and One-Sample Tests

ABD 11

To investigate the relationship between smoking and bone mineral density, Rosner presented a paired analysis in which each person had a nearly perfect control which was his or her twin
Data were normalized by dividing differences by the mean density in the twin pair (need to check if this normalization worked)
Note: It is almost never appropriate to compute mean percent change (a 100% increase is balanced by a 50% decrease) but we plunge ahead anyway
Computed density in heavier smoking twin minus density in lighter smoking one
Mean difference was $-5\%$ with se=$2.0\%$ on $n=41$
The $t$ statistic we’ve been using works here, once within-pair differences are formed
$H_{0}:$ mean difference between twins is zero ($\mu_{0} = 0$) \[\begin{array}{c} t_{40} = \frac{\bar{x} - \mu_{0}}{se} = -2.5 \\ P = 0.0166 \end{array}\]

Code

xbar  <- -5
se    <- 2
n     <- 41
mu0   <- 0
tstat <- (xbar - mu0) /se
pval  <- 2 * (1 - pt(abs(tstat), n - 1))
c(tstat=tstat, Pvalue=pval)

      tstat      Pvalue 
-2.50000000  0.01662035

5.9 Two Sample Test for Means

ABD 12

Two groups of different patients (unpaired data)
Much more common than one-sample tests
As before we are dealing for now with parametric tests assuming the raw data arise from a normal distribution
We assume for now that the two groups have the same spread or variability in the distributions of responses¹⁷

¹⁷ Rosner covers the unequal variance case very well. As nonparametric tests have advantages for comparing two groups and are less sensitive to the equal spread assumption, we will not cover the unequal variance case here.

5.9.1 Frequentist $t$-Test

Test whether population 1 has the same mean as population 2
Example: pop. 1=all patients with a certain disease if given the new drug, pop. 2=standard drug
$H_{0}: \mu_{1}=\mu_{2}$ (this can be generalized to test $\mu_{1}=\mu_{2}+\delta$, i.e., $\mu_{1}-\mu_{2}=\delta$). The quantity of interest or QOI is $\mu_{1}-\mu_{2}$
2 samples, of sizes $n_{1}$ and $n_{2}$ from two populations
Two-sample (unpaired) $t$-test assuming normality and equal variances—recall that if we are testing against an $H_0$ of no effect, the form of the $t$ test is \[t = \frac{\mathrm{point~estimate~of~QOI}}{\mathrm{se~of~numerator}}\]
Point estimate QOI is $\bar{x}_{1} - \bar{x}_{2}$
As with 1-sample $t$-test the difference in the numerator is judged with respect to the precision in the denominator (combination of sample size and subject-to-subject variability); like a signal:noise ratio
Variance of the sum or difference of two independent means is the sum of the variance of the individual means
This is $\frac{\sigma^{2}}{n_{1}} + \frac{\sigma^{2}}{n_{2}} = \sigma^{2}[\frac{1}{n_{1}} + \frac{1}{n_{2}}]$
Need to estimate the single $\sigma^{2}$ from the two samples
We use a weighted average of the two sample variances: \[s^{2} = \frac{(n_{1}-1)s^{2}_{1} + (n_{2}-1)s^{2}_{2}}{n_{1}+n_{2}-2}\]
True standard error of the difference in sample means: $\sigma \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$
Estimate: $s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}$, so \[t = \frac{\bar{x}_{1} - \bar{x}_{2}}{s \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}}\]
d.f. is the sum of the individual d.f., $n_{1}+n_{2}-2$, where the $-2$ is from our having to estimate the center of two distributions
If $H_{0}$ is true $t$ has the $t_{n_{1}+n_{2}-2}$ distribution
To get a 2-tailed $P$-value we compute the probability that a value from such a distribution is farther out in the tails of the distribution than the observed $t$ value is (we ignore the sign of $t$ for a 2-tailed test)
Example: $n_{1}=8, n_{2}=21, s_{1}=15.34, s_{2}=18.23, \bar{x}_{1}=132.86, \bar{x}_{2}=127.44$ \[\begin{array}{ccc} s^{2} &=& \frac{7(15.34)^{2}+20(18.23)^{2}}{7+20} = 307.18 \\ s &=& \sqrt{307.18} = 17.527 \\ se &=& 17.527 \sqrt{\frac{1}{8}+\frac{1}{21}} = 7.282 \\ t &=& \frac{5.42}{7.282} = 0.74 \end{array}\]

on 27 d.f.

$P = 0.463$ (see R code below)
Chance of getting a difference in means as larger or larger than 5.42 if the two populations really have the same means is 0.463
$\rightarrow$ little evidence for concluding the population means are different

Code

n1    <- 8;         n2 <- 21
xbar1 <- 132.86; xbar2 <- 127.44
s1    <- 15.34;     s2 <- 18.23
s     <- sqrt(((n1 - 1) * s1 ^ 2 + (n2 - 1) * s2 ^ 2) / (n1 + n2 - 2))
se    <- s * sqrt(1 / n1 + 1 / n2)
tstat <- (xbar1 - xbar2) / se
pval  <- 2 * (pt(- abs(tstat), n1 + n2 - 2))
c(s=s, se=se, tstat=tstat, Pvalue=pval)

         s         se      tstat     Pvalue 
17.5265589  7.2818380  0.7443176  0.4631137

5.9.2 Confidence Interval

Assuming equal variances

\[\bar{x}_{1}-\bar{x}_{2} \pm t_{n_{1}+n_{2}-2,1-\alpha/2} \times s \times \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}\]

is a $1-\alpha$ CL for $\mu_{1}-\mu_{2}$, where $s$ is the pooled estimate of $\sigma$, i.e., $s \sqrt{\ldots}$ is the estimate of the standard error of $\bar{x}_{1}-\bar{x}_{2}$

5.9.3 Probability Index

The probability index or concordance probability $c$ for a two-group comparison is a useful unitless index of an effect. If $Y_1$ and $Y_2$ represent random variables from groups 1 and 2 respectively, $c = \Pr(Y_{2} > Y_{1}) + \frac{1}{2} \Pr(Y_{2} = Y_{1})$ which is the probability that a randomly chosen subject from group 2 has a response exceeding that of a randomly chosen subject from group 1, plus $\frac{1}{2}$ the probability they are tied. $c$ can be estimated nonparametrically by the fraction of all possible pairs of subjects (one from group 1, one from group 2) with $Y_{2} > Y_{1}$ plus $\frac{1}{2}$ the fraction of ties. This is related to Somers’ $D_{yx}$ rank correlation coefficient by $D_{yx} = 2 \times (c - \frac{1}{2})$.

When $Y_i$ comes from a normal distribution with mean $\mu_i$ and variance $\sigma^2$ (equal variance case), the probability of a tie is zero, the SD of $Y_{2}-Y_{1}$ is $\sqrt{2}\sigma$, and we have the following after letting the difference in population means $\mu_{2} - \mu_{1}$ be denoted by $\delta$. \[\begin{array}{lcl} c = \Pr(Y_{2} > Y_{1}) &=& \Pr(Y_{2} - Y_{1} > 0) \\ &=& \Pr(Y_{2} - Y_{1} - \delta > - \delta)\\ &=& \Pr(\frac{Y_{2} - Y_{1} - \delta}{\sqrt{2}\sigma} > - \frac{\delta}{\sqrt{2}\sigma}) \\ &=& 1 - \Phi(- \frac{\delta}{\sqrt{2}\sigma}) = \Phi(\frac{\delta}{\sqrt{2}\sigma}) \end{array}\]

where $\Phi(x)$ is the standard normal cumultive distribution function.

When the variances are unequal, the variance of $Y_{2} - Y_{1}$ is $\sigma_{2}^{2} + \sigma_{1}^{2}$ so the SD is $\sqrt{\sigma_{2}^{2} + \sigma_{1}^{2}} = \omega = \sigma_{1}\sqrt{1 + r}$ where $r$ is the variance ratio $\frac{\sigma_{2}^{2}}{\sigma_{1}^{2}}$. The probability index $c$ is $\Phi(\frac{\delta}{\omega})$. $r$ doesn’t matter if there is no group mean difference ($\delta=0$). But when $\delta \neq 0$ the probability index depends on the variance ratio.

5.9.4 Bayesian $t$-Test

As with one-sample Bayesian $t$-test we relax the normality assumption by using a $t$ distribution for the raw data (more robust analysis)
A linear model for the two-sample $t$ test is
$Y = \mu_{0} + \delta[\textrm{group B}] + \epsilon$
where $\mu_{0}$ (the intercept) is the unknown group A mean, $\delta$ is the B-A difference in means, and $\epsilon$ is the irreducible residual (assuming we have no covariates to adjust for)
Assume:
- $\epsilon$ has a $t$-distribution with $\nu$ d.f.
- $\nu$ has a prior that allows the data distribution to be anywhere from heavy-tailed to normal
- $\mu_{0}$ has a fairly wide prior distribution (no prior knowledge may be encoded by using a flat prior)
- $\delta$ has either a prior that is informed by prior reliable research or biological knowledge, or has a skeptical prior
- residual variance $\sigma^2$ is allowed to be different for groups A and B, with a normal prior on the log of the variance ratio that favors equal variance but allows the ratio to be different from 1 (but not arbitrarily different) Note: by specifying independent priors for $\mu_{0}$ and $\delta$ we
- induce correlations in priors for the two means
- assume we know more about $\delta$ than about the individual true per-group means To specify the SD for the prior for the log variance ratio:
Let $r$ be unknown ratio of variances
Assume $P(r > 1.5) = P(r < \frac{1}{1.5}) = \gamma$
The required SD is $\frac{\log{1.5}}{-\Phi^{-1}(\gamma)}$
For $\gamma=0.15$ the required SD is:

Code

log(1.5) / -qnorm(0.15)

[1] 0.3912119

We are assuming a mean of zero for $\log(r)$ so we favor $r=1$ and give equal chances to ratios smaller or larger than 1.

5.9.5 Power and Sample Size

Consider the frequentist model
Power increases when
- $\Delta = |\mu_{1}-\mu_{2}| \uparrow$
- $n_{1} \uparrow$ or $n_{2} \uparrow$
- $n_{1}$ and $n_{2}$ are close
- $\sigma \downarrow$
- $\alpha \uparrow$
Power depends on $n_{1}, n_{2}, \mu_{1}, \mu_{2}, \sigma$ approximately through \[\frac{\Delta}{\sigma \sqrt{\frac{1}{n_{1}}+\frac{1}{n_{2}}}}\]
Note that when computing power using a program that asks for $\mu_{1}$ and $\mu_{2}$ you can just enter 0 for $\mu_{1}$ and enter $\Delta$ for $\mu_{2}$, as only the difference matters
Often we estimate $\sigma$ from pilot data, and to be honest we should make adjustments for having to estimate $\sigma$ although we usually run out of gas at this point (Bayes would help)
Use the R pwr package, or the power calculator at statpages.org or PS
Example:
Get a pooled estimate of $\sigma$ using $s$ above (17.527)
Use $\Delta=5, n_{1}=n_{2}=100, \alpha=0.05$

Code

delta <- 5
require(pwr)
pwr.t2n.test(n1=100, n2=100, d=delta / s, sig.level = 0.05)


     t test power calculation 

             n1 = 100
             n2 = 100
              d = 0.2852813
      sig.level = 0.05
          power = 0.5189751
    alternative = two.sided

Sample size depends on $k = \frac{n_{2}}{n_{1}}$, $\Delta$, power, and $\alpha$
Sample size $\downarrow$ when
- $\Delta \uparrow$
- $k \rightarrow 1.0$
- $\sigma \downarrow$
- $\alpha \uparrow$
- required power $\downarrow$
An approximate formula for required sample sizes to achieve power $=0.9$ with $\alpha=0.05$ is \[\begin{array}{ccc} n_{1} &=& \frac{10.51 \sigma^{2} (1+\frac{1}{k})}{\Delta^{2}} \\ n_{2} &=& \frac{10.51 \sigma^{2} (1+k)}{\Delta^{2}} \end{array}\]

Exact calculations assuming normality

Code

pwr.t.test(d = delta / s, sig.level = 0.05, power = 0.8)


     Two-sample t test power calculation 

              n = 193.8463
              d = 0.2852813
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

If used same total sample size of 388 but did a 2:1 randomization ratio to get 129 in one group and 259 in the other, the power is less

Code

pwr.t2n.test(n1 = 129, n2 = 259, d = delta / s, sig.level = 0.05)


     t test power calculation 

             n1 = 129
             n2 = 259
              d = 0.2852813
      sig.level = 0.05
          power = 0.7519836
    alternative = two.sided

What is the difference in means that would yield a 2-sided $P$-value of exactly 0.05 for a two-sample $t$-test with normality and equal variances when the sample sizes are both equal to $\frac{n}{2}$? We solve for $\hat{\Delta} = \bar{x}_{1} - \bar{x}_{2}$ such that $t_{n-2,1-\alpha/2} = \frac{\hat{\Delta}}{2s/\sqrt{n}}$, giving \[\hat{\Delta} = \frac{2 \times t_{n-2,1-\alpha/2} \times s}{\sqrt{n}}\] For total sample sizes of 10, 50, and 100, the “magic” values of the observed difference are the following multiples of the observed standard deviation $s$:

Code

n <- c(10, 50, 100)
tcrit <- qt(0.975, n-2)
2 * tcrit / sqrt(n)

[1] 1.4584451 0.5686934 0.3968935

Note that these thresholds are independent of the power and the effect size used in the power calculation.

5.9.6 Sample Size for a Given Precision

To design a study that will nail down the estimate of $\mu_{1}-\mu_{2}$ to within $\pm \delta$ with $1-\alpha$ confidence when $n_{1}=n_{2}=n$, and when $n$ is large enough so that the critical value $t_{2n-2,1-\alpha/2}$ may be approximated by the critical value from the normal distribution, say $z$ ($z=1.96$ when $\alpha=0.05$): \[n = 2\left[\frac{z \sigma}{\delta}\right]^{2}\] When $\alpha=0.05$, $n = 7.68 [\frac{\sigma}{\delta}]^{2}$

5.9.7 Equating Margin of Error to Detectable Difference

Suppose that a two-arm study is designed to detect a difference $\Delta$ in two means with power 0.9 at the $\alpha=0.05$ level. For large enough sample sizes, the margin of error for estimating the true difference in means for that study will be $\delta = \Delta\sqrt{\frac{7.68}{21.02}} = 0.604\Delta$.

5.9.8 Checking Assumptions of the $t$-test

Comprehensive assessment of all assumptions except independence of observations:
- Compute the two empirical cumulative distribution functions
- Transform each using the inverse normal $z$ transformation
- See of both curves are linear (checks normality assumption) and parallel (equal variance assumption)¹⁸
Box plot (one box for each of 2 groups): look for equal spread (IQR)
Informally compare $s_{1}$ and $s_{2}$¹⁹
With the Bayesian $t$-test the only important assumption to check is symmetry of the data distribution²⁰

¹⁸ There are formal tests of normality but in smaller samples these have insufficient power to detect important non-normality.

¹⁹ Rosner 8.6 shows how to make formal comparisons, but beware that the variance ratio test depends on normality, and it may not have sufficient power to detect important differences in variances.

²⁰ Since we are allowing for heavier tails than the Gaussian distribution by using a $t$ distribution for the raw data

5.10 Comprehensive Example: Two sample $t$-test

5.10.1 Study Description

Assess the effect of caffeine (vs. placebo) on muscle metabolism, measured by the respiratory exchange ratio (RER; ratio of CO$_{2}$ produced to O$_{2}$ consumed)
Treatment was randomized to 18 subjects; parallel group RCT
Goal: study effect of caffeine on RER
Must take log of RER to have a symmetric measure
- $\mu_0$ = mean log RER for placebo
- $\mu_1$ = mean log RER for caffeine
- Fold-change effect: $\exp(\mu_1 - \mu_0)$
- Estimate $\mu_{1} - \mu_{0}$
- $H_0: \mu_0 = \mu_1$
- $H_1: \mu_0 \neq \mu_1$
Note: a good statistician will take such ratios with a grain of salt; need to verify that the meaning of the ratio is independent of O$_{2}$

Dataset was from a study done by the University of Bristol UWE; web site no longer available

5.10.2 Power and Sample Size

Suppose that a pilot study or previous published research estimated $\sigma = 0.1$ for log RER
Effect size $\Delta$ is on the log RER scale
Anti-log to get effect in terms of fold change
Determine the number of subjects needed (in each group) for several value of effect size $\Delta$ ($\Delta = |\mu_1 - \mu_0|$) in order to have 0.9 power with $\alpha = 0.05$

Code

require(pwr)
s <- 0.1
fc <- c(1.1, 1.15, 1.2, 1.25, 1.5)
n <- integer(length(fc))
i <- 0
for(foldchange in fc) {
  i <- i + 1
  n[i] <- ceiling(pwr.t.test(d=log(foldchange) / s, power=0.9)$n)
}
data.frame('Fold Change'=fc, Delta=round(log(fc), 3), 'N per group'=n,
           check.names=FALSE)

  Fold Change Delta N per group
1        1.10 0.095          25
2        1.15 0.140          12
3        1.20 0.182           8
4        1.25 0.223           6
5        1.50 0.405           3

If caffeine modifies RER by a factor of 1.15, by enrolling 12 subjects in each group we will have 0.9 power to detect an effect
For $n=12$ per group the margin of error for estimating $\Delta$ at the 0.95 level is given below
See Section 5.9.6
This is anti-logged to obtain the multiplicative margin of error for estimating the caffeine:placebo ratio of RERs

Code

z <- qnorm(0.975); n <- 12
# This is approximate; use z <- qt(0.975, 12+12-2) for accuracy
moe <- z * s * sqrt(2 / n)
mmoe <- exp(moe)
c('Margin of Error'=moe, 'Multiplicative Margin of Error'=mmoe)

               Margin of Error Multiplicative Margin of Error 
                    0.08001519                     1.08330353

5.10.3 Collected Data

Code

tx <- factor(c(rep('placebo', 9), rep('caffeine', 9)), c('placebo', 'caffeine'))
rer <- c(105, 119, 100, 97, 96, 101, 94, 95, 98,
         96, 99, 94, 89, 96, 93, 88, 105, 88) / 100
d <- data.frame(subject=1:18, tx, rer, logrer=log(rer))
print(d, digits=3, row.names=FALSE)

 subject       tx  rer   logrer
       1  placebo 1.05  0.04879
       2  placebo 1.19  0.17395
       3  placebo 1.00  0.00000
       4  placebo 0.97 -0.03046
       5  placebo 0.96 -0.04082
       6  placebo 1.01  0.00995
       7  placebo 0.94 -0.06188
       8  placebo 0.95 -0.05129
       9  placebo 0.98 -0.02020
      10 caffeine 0.96 -0.04082
      11 caffeine 0.99 -0.01005
      12 caffeine 0.94 -0.06188
      13 caffeine 0.89 -0.11653
      14 caffeine 0.96 -0.04082
      15 caffeine 0.93 -0.07257
      16 caffeine 0.88 -0.12783
      17 caffeine 1.05  0.04879
      18 caffeine 0.88 -0.12783

Code

require(ggplot2)
ggplot(d, aes(x=tx, y=rer, group=tx)) +
  geom_boxplot(col='lightyellow1', alpha=.3, width=.3) +
  geom_dotplot(binaxis='y', stackdir='center', position='dodge') +
  stat_summary(fun.y=median, geom="point", col='red', shape=5, size=3) +
  xlab('') + ylab('RER') + coord_flip()

Figure 5.8: Data for two-sample RCT for effect of caffeine on respiratory exchange ratio. Diamonds depict medians.

5.10.4 Frequentist $t$-Test

To demonstrate difficulties in checking model assumptions with small $n$, consider the comprehensive approach by checking for linearity and parallelism (if equal variance assumption is used) of $z$-transformed empirical cumulative distributions.

Code

Ecdf(~ log(rer), groups=tx, fun=qnorm, data=d,
     xlab='log(RER)', ylab='Inverse Normal ECDF')

Figure 5.9: Stratified empirical cumulative distribution functions by treatment for checking all assumptions of the two-sample $t$-test. ECDFs are inverse normal transformed.

It is more accepted in practice to now use the form of the $t$-test that does not assume equal variances in the two independent groups. The unequal-variance $t$-test is used here. Note that to compute a decent approximation to the $P$-value requires the use of a “trick” d.f. when looking up against a $t$ distribution.

Code

ttest <- t.test(log(rer) ~ tx, data=d)
# Note that for the CI t.test is using caffeine as the reference group
ttest


    Welch Two Sample t-test

data:  log(rer) by tx
t = 2.0622, df = 15.337, p-value = 0.05655
alternative hypothesis: true difference in means between group placebo and group caffeine is not equal to 0
95 percent confidence interval:
 -0.002027639  0.130381419
sample estimates:
 mean in group placebo mean in group caffeine 
           0.003115688           -0.061061202

Interpretation
- Subjects given caffeine have on average a log RER that is 0.064 lower (0.95 CI: [-0.13, 0.002]) than individuals given placebo
- By anti-logging these 3 numbers we get the fold change scale:
  - Median²¹ fold change is 0.94
  - 0.95 CI for caffeine:placebo fold change: $[0.878, 1.002]$

²¹ If the log ratio has a normal distribution, the log ratio has the same mean as the median and its median anti-logged value is the anti-log of the mean (=median) log difference. The mean on the anti-logged scale is a more complex function that involves SD of log RER.

5.10.5 Bayesian $t$-Test

The R brms package makes it easy to specify an unequal variance model, because it allows one to specify a separate model for the log of the standard deviation. $\log(\sigma)$ can even depend on continuous covariates!²² brms models standard deviation parameters on the log scale. As before, analysis is on the log(RER) scale.

²² Thanks to Nathan James of the Vanderbilt Department of Biostatistics for providing R code for this section, and for providing the explanation for the output of the prior_summary function.

Code

require(brms)
# set priors

# flat (non-informative) prior for intercept
pr0 <- set_prior("", class="Intercept")

# normal(0,3) prior for difference in log ratios
pr1 <- set_prior("normal(0,0.25)", class="b", coef="txcaffeine")

# normal(0,0.3912) for log SD ratio
pr2 <- set_prior("normal(0,0.3912)", class="b", coef="txcaffeine",
                 dpar="sigma")

# fit model for two groups assuming Y follows a t distribution so less
# sensitive to outliers
# Each group has different mean and SD but same df
# The prior for the log SD for the reference group is scale t with 3 d.f.
# Prior for nu is gamma(2, 0.1)
# The bf function is used to create a compound R model formula, here
# for the mean model and the log sigma model

g <- function()
       brm(bf(log(rer) ~ tx, sigma ~ tx), data=d, family=student,
           prior=c(pr0,pr1,pr2), seed=1203)
f <- runifChanged(g, d, pr0, pr1, pr2)

There are 5 parameters in this model (2 for the regression on student-$t$ mean, 2 for the regression on student-$t$ scale ($\sigma$), and 1 for student-$t$ d.f. $\nu$). The output from prior_summary() shows all the parameters and parameter classes that can be assigned. Lines 1 & 3 in the output below are the classes of all non-intercept regression coefficients for the student-$t$ mean and student-$t$ scale, respectively. When there are multiple coefficients it is often convenient to specify a prior for all parameters in a class rather than for each individual parameter. For example, in a model with 10 covariates using set_prior("normal(0,1)", class="b") would give each of the 10 corresponding coefficient parameters a standard normal prior. For our model the class priors are superseded by the individual parameter priors.

Code

# Show priors in effect
prior_summary(f)

                prior     class       coef group resp  dpar nlpar lb ub  source
               (flat)         b                                         default
       normal(0,0.25)         b txcaffeine                                 user
               (flat)         b                       sigma             default
     normal(0,0.3912)         b txcaffeine            sigma                user
               (flat) Intercept                                            user
 student_t(3, 0, 2.5) Intercept                       sigma             default
        gamma(2, 0.1)        nu                                    1    default

Code

# model summary
# Note: sigma is on log scale, use exp(sigma) to get to natural scale
# Intercept is mean for placebo
# sigma_Intercept is log(sd) for placebo
# txcaffeine is change in mean for caffeine compared to placebo
# sigma_txcaffeine is change in log(sd) for caffeine compared to placebo
f

 Family: student 
  Links: mu = identity; sigma = log; nu = identity 
Formula: log(rer) ~ tx 
         sigma ~ tx
   Data: d (Number of observations: 18) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                 Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept           -0.00      0.02    -0.05     0.05 1.00     3726     2989
sigma_Intercept     -2.77      0.30    -3.40    -2.20 1.00     2900     2244
txcaffeine          -0.06      0.03    -0.12     0.01 1.00     3937     2703
sigma_txcaffeine    -0.06      0.29    -0.61     0.53 1.00     4037     2527

Further Distributional Parameters:
   Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
nu    18.25     13.97     2.51    54.80 1.00     2815     2043

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Code

# plot kernel density est and trace plots for posterior parameters
plot(f)

Code

# posterior predictive check
pp_check(f)

Code

# conditional effects
plot(conditional_effects(f), points = TRUE)

Code

# posterior parameter samples
p <- as.data.frame(f)
meanplacebo <- p[, 'b_Intercept']
delta       <- p[, 'b_txcaffeine']
sdplacebo   <- exp(p[, 'b_sigma_Intercept'])
sdratio     <- exp(p[, 'b_sigma_txcaffeine'])
nu          <- p[, 'nu']

# histogram of posterior distribution of difference in mean log RER
hist(delta, nclass=50, main='')

Code

# Posterior density for caffeine:placebo fold change in RER
plot(density(exp(delta)), xlab='Fold Change in RER', main='')
abline(v=1, col=gray(0.85))

Code

# Posterior density of SD ratio
plot(density(sdratio), main='', xlab='SD Ratio')
abline(v=1, col=gray(0.85))

Code

# Posterior prob that difference in means < 0
# Recall that the P operator was defined previously
# (based on mean of logical or 0/1 values = proportion positive =
#  posterior probability to within simulation error
# This is the same as P(fold change < 1)
P(delta < 0)

[1] 0.959

Code

P(exp(delta) < 1)

[1] 0.959

Code

# Prob that caffeine results in a physiologically noticeable response
P(exp(delta) < 0.95)

[1] 0.5795

Code

# Prob that caffeine and placebo have similar response
P(exp(delta) > 0.975 & exp(delta) < 1/0.975)

[1] 0.14225

Code

# Compute posterior probability of approximate normality
P(nu > 20)

[1] 0.353

Note that the 2-tailed $P$-value of 0.057 may tempt bright-line threshold advocates to conclude nothing more than insufficient evidence to reject the assumption that caffeine does not modify RER (or worse yet to just declare an “insignificant” result or even worse that caffeine does not modify RER). The Bayesian result shows that under a fairly skeptical prior for log RER one would do well to play the odds in betting on caffeine having an effect.

There is little evidence to support an assumption of normality of log RER within a treatment group.

Now compare other results results with the frequentist analysis.

Code

means <- with(d, tapply(logrer, tx, mean))
sds   <- with(d, tapply(logrer, tx, sd))
a <- c(means[1],    pmode(meanplacebo), mean(meanplacebo), median(meanplacebo))
b <- c(sds[1],      pmode(sdplacebo),   mean(sdplacebo),   median(sdplacebo))
w <- c(diff(means), pmode(delta),     mean(delta),     median(delta))
z <- c(sds[2] / sds[1], pmode(sdratio), mean(sdratio), median(sdratio))
x <- rbind(a, b, w, z)
colnames(x) <- c('Sample', 'Posterior Mode', 'Posterior Mean',
                 'Posterior Median')
rownames(x) <- c('Placebo mean', 'Placebo SD', 'Caffeine - Placebo Mean',
                 'Caffeine / Placebo SD')
round(x, 2)

                        Sample Posterior Mode Posterior Mean Posterior Median
Placebo mean              0.00          -0.01           0.00            -0.01
Placebo SD                0.07           0.06           0.07             0.06
Caffeine - Placebo Mean  -0.06          -0.06          -0.06            -0.06
Caffeine / Placebo SD     0.81           0.87           0.98             0.94

Code

# 0.95 credible interval for delta
quantile(delta, c(0.025, .975))

        2.5%        97.5% 
-0.124563952  0.008186283

Code

# 0.95 confidence interval for delta
rev(- ttest$conf.int)   # negate since t.test used different reference

[1] -0.130381419  0.002027639

For an excellent tutorial on the use of brms for a two-sample $t$-test see bit.ly/brms-t by Matti Vuorre

5.11 The Problem with Hypothesis Tests and $P$-values Revisited

5.11.1 Hypothesis Testing

Existence of ESP is a hypothesis
Assessing effects of drugs, procedures, devices involves estimation
Many studies powered to detect huge effect
If effect is not huge, no information from study

5.11.2 $P$-Values

ABD 6.2

Only provide evidence against a null hypothesis, never evidence for something
Probability of a statistic more impressive as yours if $H_0$ true
Not a probability of an effect or difference (same problem with sensitivity and specificity)
No conclusion possible from large $P$-values
Cannot conclude clinical relevance from small $P$
Adjustment of $P$-values for multiple tests is controversial and there is insufficient consensus on how to choose an adjustment method
Declaring a result as “significant” or “non-significant” is completely arbitrary and has come to mean almost nothing
- They rely on completely arbitrary $P$-value cutoffs such as 0.05
- American Statistical Association is on record recommending not using any cutoff or words like significant: bit.ly/asa-p bit.ly/asa-p2

5.11.3 How Not to Present Results

ABD 6.2

$P=0.02$ — let’s put this into clinical practice ignoring the drug’s cost or clinical effectiveness
$P=0.4$ — this drug does not kill people
$P=0.2$ but there is a trend in favor of our blockbuster drug
The observed difference was 6mmHg and we rejected $H_0$ so the true effect is 6mmHg.
The proportion of patients having adverse events was 0.01 and 0.03; the study wasn’t powered to detect adverse event differences so we present no statistical analysis
The reduction in blood pressure was 6mmHg with 0.95 C.L. of [1mmHg, 11mmHg]; the drug is just as likely to only reduce blood pressure by 1mmHg as it is by 6mmHg.
The serum pH for the 15 dogs was $7.3 \pm 0.1$ (mean $\pm$ SE instead of SD or IQR)

5.11.4 How to Present Results

ABD 6.2

Estimates should be accompanied by uncertainty intervals or posterior distributions
Confidence limits can be computed without regard to sample size or power
A computed value from a sample is only an estimate of the population value, whether or not you reject $H_0$
Best to think of an estimate from a study as a fuzz, not a point
To present variability of subjects, use SD or IQR, not SE (SE is the precision of the mean of subjects)
If you must use $P$-values, provide the $P$-value to 3 significant digits and don’t declare results as significant or no significant difference
See http://bit.ly/datamethods-freq-results for some guidelines for presenting frequentist results, and fharrell.com/post/bayes-freq-stmts for examples of Bayesian vs. frequentist summaries

5.12 Study Design Considerations

The majority of studies phrased as hypothesis testing experiments are actually estimation studies, so it is usually preferred to based sample size justifications on precision (margin of error). Whether using effect sizes in power calculations or margins of error in precision calculations, the quantity of interest should be taken on the original dependent variable scale or a transformation of it such as odds or hazard.

5.12.1 Sizing a Pilot Study

Frequently, pilot studies are used to obtain estimates of variability that allow the sample sized to be calculated for a full study. With a continuous response variable, one can think of the adequacy of the sample size in terms of the fold change or multiplicative margin of error (MMOE) in the estimate $s$ of the population standard deviation $\sigma$.

When a sample of size $n$ is drawn from a normal distribution, a $1 - \alpha$ two-sided confidence confidence interval for the unknown population variance $\sigma^2$ is given by \[ \frac{n-1}{\chi^{2}_{1-\alpha/2,n-1}} s^{2} < \sigma^{2} < \frac{n-1}{\chi^{2}_{\alpha/2,n-1}} s^{2},\] where $s^2$ is the sample variance and $\chi^{2}_{\alpha,n-1}$ is the $\alpha$ critical value of the $\chi^2$ distribution with $n-1$ degrees of freedom. The MMOE for estimating $\sigma$ is \[ \sqrt{\max(\frac{\chi^{2}_{1-\alpha/2,n-1}}{n-1}, \frac{n-1}{\chi^{2}_{\alpha/2,n-1}})}\]

Code

n    <- 10:300
low  <- sqrt((n - 1) / qchisq(.975, n - 1))
hi   <- sqrt((n - 1) / qchisq(.025, n - 1))
m    <- pmax(1 / low, hi)
ggplot(data.frame(n, m), aes(x=n, y=m)) + geom_line() +
  ylab('MMOE for s')
nmin <- min(n[m <= 1.2])

Figure 5.10: Multiplicative margin of error in estimating $\sigma$ as a function of sample size, with 0.95 confidence

From the above calculations, to achieve a MMOE of no worse than 1.2 with 0.95 confidence when estimating $\sigma$ requires a sample size of 70 subjects. A pilot study with $n=20$ will achieve a MMOE of 1.46 in estimating $\sigma$.

5.12.2 Problems with Standardized Effect Sizes

Many researchers use Cohen’s standardized effect sizes in planning a study. This has the advantage of not requiring pilot data. But such effect sizes are not biologically meaningful and may hide important issues Lenth (2001). Studies should be designed on the basis of effects that are relevant to the investigator and human subjects. If, for example, one plans a study to detect a one standard deviation (SD) difference in the means and the SD is large, one can easily miss a biologically important difference that happened to be much less than one SD in magnitude. Note that the SD is a measure of how subjects disagree with one another, not a measure of an effect (e.g., the shift in the mean). One way to see that standardized effect sizes are problematic is to note that if one were to make the measurements more noisy, the SD will increase and the purported clinically important difference to detect will increase proportionately.

5.12.3 Choice of Effect Size

If a study is designed to detect a certain effect size with a given power, the effect size should never be the observed effect from another study, which may be estimated with error and be overly optimistic. The effect size to use in planning should be the clinically or biologically relevant effect one would regret missing. Usually the only information from prior studies that is useful in sample size estimation are (in the case of a continuous response variable with a symmetric distribution) estimates of the standard deviation or the correlation between two measurements on the same subject measured at two different times, or (in the case of a binary or time to event outcome) event probabilities in control subjects. An excellent resource by Senn for understanding effect sizes in power calculations may be found at bit.ly/ssenn-effect.

Effect size for power/sample size calculation is never an observed effect in previous data
It is not the difference we believe is true
It is the difference you would not like to miss
Clinically relevant to patients or at least to physiology
Not greater than clinically relevant

5.12.4 Multiple Estimands and Hypotheses

In many experiments there are more than one estimand (what is to be estimated based on the question of interest) or hypothesis. Some frequentist statisticians and biomedical investigators believe that in such situations the family-wise error probability should be controlled²³. This probability is the probability of rejecting any null hypothesis given that all null hypotheses are true. One may accomplish this by testing every hypothesis at the $\alpha^{*}$ level, where the constant $\alpha^{*} < \alpha$ is chosen so that the overall type one error is $\alpha$, or one may elect to differentially “spend $\alpha$” by, for example, setting $\alpha=0.04$ for a primary hypothesis and $\alpha=0.01$ for a less important secondary analysis. Another alternative is closed testing procedures whereby later hypotheses can be tested at less stringent $\alpha$ levels as long as all earlier hypotheses were rejected. Unfortunately there is no unique path to deriving multiplicity adjustments, and they have the odd property of requiring one to be more stringent in assessing evidence for one hypothesis just because one had other hypotheses.

²³ As stated elsewhere, multiplicity adjustments are a byproduct of faults in frequentist inference and are completely arbitrary.

An alternative, and what we believe to be more reasonable, view is by Cook & Farewell (1996) who stated that if a study has more than one question and each question is to be answered on its own, there is no need for a multiplicity adjustment. This is especially true if a strong priority ordering for hypotheses is stated in advance. For example, an investigator may specify three hypotheses about efficacy of a treatment for the following endpoints in a cardiovascular trial, sorted from most important to least important: overall mortality, cardiovascular mortality, and cardiovascular death or myocardial infarction. As long as the researcher always reports all of the results in context, in this pre-specified order, each $P$-value can stand on its own.

Contrast this with an exploratory study in which the hypothesis is essentially that there exists an endpoint for which the treatment is effective. One should expect to have to employ a conservative multiplicity adjustment in that situation, e.g., Bonferroni’s inequality.

Consider a frequentist study with four efficacy endpoints and corresponding $P$-values in the given pre-specified priority order: all-cause mortality ($P=0.09$), stroke ($P=0.01$), myocardial infarction ($P=0.06$), hospitalization ($P=0.11$)

OK to quantify evidence against each of the 4 null hypotheses if all 4 reported in context, using the pre-specified order with separate interpretations
Reasonable conclusion: With the current sample size, there is little evidence to reject the supposition that treatment does not lower mortality. There is evidence against the supposition that treatment does not lower the rate of stroke. …
Contrast with a report that merely reports the stroke result, essentially saying “there exists an endpoint for which treatment is effective”
Example Bayesian statement: Treatment probably (0.92) lowers mortality, probably (0.995) lowers the rate of stroke, probably (0.96) lowers MI, and probably (0.96) lowers hospitalization (posterior probabilities of a treatment benefit are in parentheses)²⁴.
Perhaps better: create a 5 or more level ordinal endpoint and use the worst event as the response variable
- $Y=0, 1, 2, 3, 4$ corresponding to no event, hospitalization, MI, stroke, death
- Interpretation 1: treatment lowered the odds of an outcome or a worse outcome by a factor of 0.8
- Interpretation 2: chance of MI, stroke, or death with treatment estimated as 0.167 and for control as 0.2
  chances of stroke or death: 0.082, 0.1
- Bayesian probability of treatment benefit = $P(\textrm{OR} < 1) = 0.998$
See Section 3.6 for how properties of ordinal scales relate to power

²⁴ A Bayesian analysis would quickly add the (lower) probabilities that treatment lowers event rates by more than a trivial amount. Bayesian methods can also compute the probability that any 2 of the efficacy targets were achieved, and the expected number of targets hit—in this case 0.92+0.995+0.96+0.96=3.8 of 4 targets hit.

5.12.5 Study Design Big Picture

Choose the right question or estimand
Think hard about subject/animal selection criteria
Decide whether you are doing a pilot study or a more definitive study
- Pilot study is not used to nail down the effect of an intervention
- Is to show you can make the measurements, carry out the intervention, refine existing measurements, enroll enough subjects per month, etc.
- Power is not relevant
- May size the pilot study to be able to estimate something simple with precision (proportion, SD); adequate estimation of SD for continuous $Y$ is important for sample size calculation for the full study
Make sure the design can answer that question were the sample size 1,000,000 subjects
Decide whether you are required to have a fixed sample size
- If budgeting is flexible, use fully sequential design and stop when evidence is adequate²⁵
- For fixed budget/study duration be realistic about the effect size
If there is inadequate budget for detecting the minimal clinically important effect with high probability, be realistic about stating the study’s likely yield
- Best: compute the likely margin of error for the primary estimand
- Next best: compute the power that will be achieved with the limited sample size
Choose a response variable $Y$ that answers the question and has the greatest frequentist or Bayesian power (Section 3.6))
- Example: primary interest is patient’s inability to function physically on a 0-100 scale (100=bedridden)
- Some patients will be too sick to have their function assessed and some will die
- Define $Y$=0-100 overridden with 101 for too sick or 102 for death²⁶
- Analyze with the proportional odds model
- Interpretation:
  - Primary endpoint is degree of functional disability, penalized by death or being physically unable to be assessed
  - Proportional odds model provides an overall odds ratio for treatments B:A (ratio of odds that $Y \geq j$ for any $j$)
  - Model can also be used to estimate the median disability, where sickness or death will shift the median to the right a little
  - May also be summarized by estimating for each treatment the probability that a patient has level 50 functional disability or worse where “or worse” means 51-100, too sick, or dead, after estimating the overall odds ratio for treatment
Use multiple measurements over time to increase power/precision and to allow more questions to be answered
Greatest power comes from having a continuous $Y$ or ordinal $Y$ with many well-populated levels, where $Y$ is also measured at baseline and is adjusted for as a covariate in ANCOVA, allowing for a smooth nonlinear effect (without assuming the slope is 1.0 as is assumed by change-from-baseline analysis)
Never use change from baseline as the response variable except in a non-randomized pre-post design (the weakest of all designs)
If treatment is short-term and wears off, fully using each subject as her own control in a randomized crossover study may be ideal
For a parallel-group randomized study, accurately collect key baseline variables that explain outcome heterogeneity
For an observational study, accurately capture a host of baseline variables that are likely to result in adequate confounder adjustment
- don’t merely rationalize that variables available in an existing dataset are adequate
Use a research data management tool such as REDCap that allows for extensive data quality checking
Don’t forget the many subject selection, ethical, and good clinical practice issues
Recommended reading: Designing Clinical Research Hulley et al. (2013)

²⁵ Bayesian sequential designs require no penalty for infinitely many such data looks. See fharrell.com/post/bayes-seq.

²⁶ Ordinal analysis will not be affected by placing the clinical events at 1001 and 1002 or any other levels that are higher than the highest functional disability level.

5.13 One-Sample $t$-Test Revisited

5.13.1 Study Description

Compare the effects of two soporific drugs (optical isomers of hyoscyamine hydrobromide)
Crossover study
Each subject receives placebo run-in, then Drug 1, then Drug 2
Investigator may not have randomized order of treatments
Dependent variable: Number of hours of increased sleep when compared to a placebo run-in period (raw data not shown)
Drug 1 given to $n$ subjects, Drug 2 given to same $n$ subjects
Study question: Is Drug 1 or Drug 2 more effective at increasing sleep?
- $H_0: \mu_d = 0 ~~ \mathrm{where} ~~ \mu_d = \mu_1 - \mu_2$
- $H_1: \mu_d \neq 0$

5.13.2 Power and Sample Size

Pilot study or previous published research shows the standard deviation of the difference ($\sigma_d$) is $1.2$ hours
Determine the number of subjects needed for several value of effect size $\Delta$ ($\Delta = |\mu_1 - \mu_2|$)with 0.9 power, $\alpha = 0.05$

$\Delta$	0.5	1	1.5	2
$n$	62	16	8	5

If Drug 1 (or 2) increases sleep by 1.5 hours more than Drug 2 (or 1), by enrolling 8 subjects we will have 0.9 power to detect an association.
More powerful than the two sample test (need 10 subjects in each group for $\Delta = 3.0$ hours)

5.13.3 Collected Data

Here are the data for the 10 subjects. This is the R built-in dataset sleep.

Subject	Drug 1	Drug 2	Diff (2-1)
1	0.7	1.9	1.2
2	-1.6	0.8	2.4
3	-0.2	1.1	1.3
4	-1.2	0.1	1.3
5	-0.1	-0.1	0.0
6	3.4	4.4	1.0
7	3.7	5.5	1.8
8	0.8	1.6	0.8
9	0.0	4.6	4.6
10	2.0	3.4	1.4

Mean	0.75	2.33	1.58
SD	1.79	2.0	1.2

Code

drug1 <- c(.7, -1.6, -.2, -1.2, -.1, 3.4, 3.7, .8, 0, 2)
drug2 <- c(1.9, .8, 1.1, .1, -.1, 4.4, 5.5, 1.6, 4.6, 3.4)
d <- data.frame(Drug=c(rep('Drug 1', 10), rep('Drug 2', 10),
                  rep('Difference', 10)),
                extra=c(drug1, drug2, drug2 - drug1))
w <- data.frame(drug1, drug2, diff=drug2 - drug1)

ggplot(d, aes(x=Drug, y=extra)) +
  geom_boxplot(col='lightyellow1', alpha=.3, width=.5) +
  geom_dotplot(binaxis='y', stackdir='center', position='dodge') +
  stat_summary(fun.y=mean, geom="point", col='red', shape=18, size=5) +
  geom_segment(data=w, aes(x='Drug 1', xend='Drug 2', y=drug1, yend=drug2),
               col=gray(.8)) +
  geom_segment(data=w, aes(x='Drug 1', xend='Difference', y=drug1, yend=drug2 - drug1),
               col=gray(.8)) +
  xlab('') + ylab('Extra Hours of Sleep') + coord_flip()

Figure 5.11: Raw data and box plots for paired data and their paired differences, with lines connecting points from the same subject. Diamonds depict means.

5.13.4 Statistical Test

Code

with(d, t.test(drug1, drug2, paired=TRUE))


    Paired t-test

data:  drug1 and drug2
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference 
          -1.58

Interpretation
- A person who takes Drug 2 sleeps on average 1.58 hours longer (0.95 CI: [0.70, 2.46]) than a person who takes Drug 1

5.14 Comprehensive Example: Crossover Design and Analysis

In the previous example, it was not clear if the order of placebo, Drug 1, and Drug 2 was the same for every patient
In a crossover design, each patient receives both drugs
- Can serve as own control
- Effectively adjusts for all baseline variables without measuring them²⁷
- Order is randomized
Carryover effects
- Def: An effects that carries over from one experimental condition to another
- Need a washout period between drugs to remove carryover effects
- Time to remove carryover effects should be based on biology, not statistics
- Statistical tests for carryover effects are often not precise enough to make definitive conclusions (see example)
- The test for carryover is correlated with the overall test of efficacy
- Pre-testing for carryover then deciding whether to only use phase 1 data results in a huge inflation of type I error in the test for efficacy

²⁷ If there is no interaction between covariate and treatment order

5.14.1 Study Description

Compare the effects of two soporific drugs.
Each subject either (1) starts with Drug 1 and crosses over to Drug 2 or (2) starts with Drug 2 and crosses over to Drug 1
- No placebo run-in in this example
- Order randomly assigned
- Suitable period of time ($\sim 5$ half-lives) between drug crossovers to washout effects of previous drug
Dependent variable: Number of hours of sleep on each drug
Drug 1 given to $n$ subjects, Drug 2 given to same $n$ subjects
Study question: Is Drug 1 or Drug 2 more effective at increasing sleep?
- $H_0: \mu_d = 0 ~~ \mathrm{where} ~~ \mu_d = \mu_1 - \mu_2$
- $H_1: \mu_d \neq 0$

5.14.2 Power and Sample Size

Pilot study or previous published research shows the standard deviation of the difference ($\sigma_d$) is $1.2$ hours
Determine the number of subjects needed for several value of effect size $\Delta$ ($\Delta = |\mu_1 - \mu_2|$)with 0.9 power, $\alpha = 0.05$
Assume no carryover effects

$\Delta$	0.5	1	1.5	2
$n$	62	16	8	5

If Drug 1 (or 2) increases sleep by 1.5 hours more than Drug 2 (or 1), by enrolling 8 subjects we will have 0.9 power to detect an association.
Same power calculation as paired $t$-test

5.14.3 Collected Data

Subject	Drug 1	Drug 2	Diff (2-1)
1	8.7	9.9	1.2
2	6.4	8.8	2.4
3	7.8	9.1	1.3
4	6.8	8.1	1.3
5	7.9	7.9	0.0
6	11.4	12.4	1.0
7	11.7	13.5	1.8
8	8.8	9.6	0.8
9	8.0	12.6	4.6
10	10.0	11.4	1.4

Mean	8.75	10.33	1.58
SD	1.79	2.0	1.2

5.14.4 Statistical Tests

Code

drug1 <- c(87, 64, 78, 68, 79, 114, 117, 88, 80, 100)/10
drug2 <- c(99, 88, 91, 81, 79, 124, 135, 96, 126, 114)/10
t.test(drug1, drug2, paired=TRUE)


    Paired t-test

data:  drug1 and drug2
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -2.4598858 -0.7001142
sample estimates:
mean difference 
          -1.58

Interpretation
- A person who takes Drug 2 sleeps on average 1.58 hours longer (0.95 CI: [0.70, 2.50]) than a person who takes Drug 1

5.14.5 Carryover Effects

Is there any evidence for a carryover effect?
Assume that the first 5 subjects received Drug 1 first and the second 5 subjects received drug 2 first
If we assume there are no carryover effects, then the mean difference in sleep for subjects receiving drug 1 first should be the same as the mean difference for subjects receiving drug 2 first
Assessing carryover effect distorts the efficacy analysis inference
Null hypothesis is that there are no carryover effects
Can rearrange the difference data to clarify the structure

Subject	Drug 1 First	Drug 2 First
1	1.2
2	2.4
3	1.3
4	1.3
5	0.0
6		1.0
7		1.8
8		0.8
9		4.6
10		1.4

Mean	1.24	1.92
SD	0.85	1.55

For this design we might expect the variance of the differences to be the same for both orders, so we use the equal-variance $t$-test.

Code

# Unpaired t-test
t.test((drug2 - drug1)[1:5], (drug2 - drug1)[6:10], var.equal=TRUE)


    Two Sample t-test

data:  (drug2 - drug1)[1:5] and (drug2 - drug1)[6:10]
t = -0.86152, df = 8, p-value = 0.414
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.500137  1.140137
sample estimates:
mean of x mean of y 
     1.24      1.92

Interpretation
- Large $P$-value has no interpretation
- With 0.95 confidence, the carryover effect is between [-2.5 and 1.1] hours, which is not scientifically convincing either way
- In general, be very cautious when the null hypothesis is something you want to fail to reject in order to validate your analysis method
  - Tests of normality are sometimes used to validate using a parametric over a non-parametric test
  - There are also statistical tests for equal variance
  - Both tests may be unreliable and will distort the final inference that is conditional on pre-assessments being correct
As Stephen Senn has warned, be wary of doing anything about empirically quantified carryover effects, as the carryover effect estimate has a correlation of $\frac{1}{2}$ with the overall treatment effect estimate, causing the carryover test to ruin the operating characteristics of the treatment test

5.14.6 Bayesian Analysis

Reasonable to put prior knowledge on parameters, especially carryover event
Reasonable to restrict carryover effect to be less than the treatment effect
For related discussions and references for Bayesian crossover analysis see
bit.ly/datamethods-bbr7

	True state of \(H_0\)
Decision	\(H_0\) true	\(H_0\) false
Reject \(H_0\)	Type I error (\(\alpha\))	Correct
Do Not Reject \(H_0\)	Correct	Type II error (\(\beta\))

5.1 Overview

5.1.1 Central Limit Theorem

5.2 Hypotheses

5.2.1 Scientific Hypotheses and Questions

5.2.2 Statistical Hypotheses

5.3 Branches of Statistics

5.4 Errors in Hypothesis Testing; \(P\) Values

5.4.1 Problems With Type I Error

5.4.2 Misinterpretation of \(P\)-values

5.5 Interval Estimation

5.5.1 Frequentist Confidence Intervals

5.5.2 Bayesian Credible Intervals

5.6 One Sample Test for Mean

5.6.1 Frequentist Method

5.6.2 Bayesian Methods

Decoding the Effect of the Prior

5.6.3 Power and Sample Size

5.6.4 Confidence Interval

5.6.5 Sample Size for a Given Precision

5.7 One Sample Method for a Probability

5.7.1 Frequentist Methods

5.7.2 Bayesian

5.7.3 Power and Sample Size

5.7.4 Sample Size for Given Precision

5.8 Paired Data and One-Sample Tests

5.9 Two Sample Test for Means

5.9.1 Frequentist \(t\)-Test

5.9.2 Confidence Interval

5.9.3 Probability Index

5.9.4 Bayesian \(t\)-Test

5.9.5 Power and Sample Size

5.9.6 Sample Size for a Given Precision

5.9.7 Equating Margin of Error to Detectable Difference

5.9.8 Checking Assumptions of the \(t\)-test

5.10 Comprehensive Example: Two sample \(t\)-test

5.10.1 Study Description

5.10.2 Power and Sample Size

5.10.3 Collected Data

5.10.4 Frequentist \(t\)-Test

5.10.5 Bayesian \(t\)-Test

5.11 The Problem with Hypothesis Tests and \(P\)-values Revisited

5.11.1 Hypothesis Testing

5.11.2 \(P\)-Values

5.11.3 How Not to Present Results

5.11.4 How to Present Results

5.12 Study Design Considerations

5.12.1 Sizing a Pilot Study

5.12.2 Problems with Standardized Effect Sizes

5.12.3 Choice of Effect Size

5.12.4 Multiple Estimands and Hypotheses

5.12.5 Study Design Big Picture

5.13 One-Sample \(t\)-Test Revisited

5.13.1 Study Description

5.13.2 Power and Sample Size

5.13.3 Collected Data

5.13.4 Statistical Test

5.14 Comprehensive Example: Crossover Design and Analysis

5.14.1 Study Description

5.14.2 Power and Sample Size

5.14.3 Collected Data

5.14.4 Statistical Tests

5.14.5 Carryover Effects

5.14.6 Bayesian Analysis