Glossary of Statistical Terms

Vanderbilt University
School of Medicine
Department of Biostatistics

April 26, 2025

To request or improve a definition click here

adjusting or controlling for a variable: Assessing the effect of one variable while accounting for the effect of another (confounding) variable. Adjustment for the other variable can be carried out by stratifying the analysis (especially if the variable is categorical) or by statistically estimating the relationship between the variable and the outcome and then subtracting out that effect to study which effects are “left over.” For example, in a non-randomized study comparing the effects of treatments $A$ and $B$ on blood pressure reduction, the patients’ ages may have been used to select the treatment. It would be advisable in that case to control for the effect of age before estimating the treatment effect. This can be done using a regression model with blood pressure as the dependent variable and treatment and age as the independent variables (controlling for age using subtraction) or crudely and approximately (with some residual confounding) by stratifying by deciles of age and averaging the treatment effects estimated within the deciles. Adjustment results in adjusted odds ratios, adjusted hazard ratios, adjusted slopes, etc.

allocation ratio: In a parallel group randomized trial of two treatments, is the ratio of sample sizes of the two groups.

ANCOVA: Analysis of covariance is just multiple regression (i.e., a linear model) where one variable is of major interest and is categorical (e.g., treatment group). In classic ANCOVA there is a treatment variable and a continuous covariate used to reduce unexplained variation in the dependent variable, thereby increasing power.

ANOVA: Analysis of variance usually refers to an analysis of a continuous dependent variable where all the predictor variables are categorical. One-way ANOVA, where there is only one predictor variable (factor; grouping variable), is a generalization of the 2-sample $t$-test. ANOVA with 2 groups is identical to the $t$-test. Two-way ANOVA refers to two predictors, and if the two are allowed to interact in the model, two-way ANOVA involves cross-classification of observations simultaneously by both factors. It is not appropriate to refer to repeated measures within subjects as two-way ANOVA (e.g., treatment $\times$ time). An ANOVA table sometimes refers to statistics for more complex models, where explained variation from partial and total effects are displayed and continuous variables may be included.

artificial intelligence: Frequently confused with machine learning, AI is a procedure for flexibly learning from data, which may be built from elements of machine learning, but is distinguished by the underlying algorithms being created so that the “machine” can accept new inputs after the developer has completed the initial algorithm. In that way the machine can continue to update, refine, and teach itself. John McCarthy defined artificial intelligence as “the science and engineering of making intelligent machines.”

Bayes’ rule or theorem: $\Pr(A | B) = \frac{\Pr(B | A) \Pr(A)}{\Pr(B)}$, read as the probability that event $A$ happens given that event $B$ has happened equals the probability that $B$ happens given that $A$ has happened multiplied by the (unconditional) probability that $A$ happens and divided by the (unconditional) probability that $B$ happens. Bayes’ rule follows immediately from the law of conditional probability which states that $\Pr(A | B) = \frac{\Pr(A \mathrm{~and~} B)}{\Pr(B)}$.

Bayesian inference: A branch of statistics based on Bayes’ theorem. Bayesian inference doesn’t use $P$-values and generally does not test hypotheses. It requires one to formally specify a probability distribution encapsulating the prior knowledge about, say, a treatment effect. The state of prior knowledge can be specified as “no knowledge” by using a flat distribution, although this can lead to wild and nonsensical estimates. Once the prior distribution is specified, the data are used to modify the prior state of knowledge to obtain the post-experiment state of knowledge. Final probabilities computed in the Bayesian framework are probabilities of various treatment effects. The price of being able to compute probabilities about the data generating process is the necessity of specifying a prior distribution to anchor the calculations.

bias: A systematic error. Examples: a miscalibrated machine that reports cholesterol too high by 20mg% on the average; a satisfaction questionnaire that leads patients to never report that they are dissatisfied with their medical care; using each patient’s lowest blood pressure over 24 hours to describe a drug’s antihyptertensive properties. Bias typically pertains to the discrepancy between the average of many estimates over repeated sampling and the true value of a parameter. Therefore bias is more related to frequentist statistics than to Bayesian statistics.

big data: A dataset too large to fit on an ordinary workstation computer.

binary variable: A variable whose only two possible values, usually zero and one.

bootstrap: A simulation technique for studying properties of statistics without the need to have the infinite population available. The most common use of the bootstrap involves taking random samples (with replacement) from the original dataset and studying how some quantity of interest varies. Each random sample has the same number of observations as the original dataset. Some of the original subjects may be omitted from the random sample and some may be sampled more than once. The bootstrap can be used to compute standard deviations and confidence limits (compatibility limits) without assuming a model. For example, if one took 200 samples with replacement from the original dataset, computed the sample median from each sample, and then computed the sample standard deviation of the 200 medians, the result would be a good estimate of the true standard deviation of the original sample median. The bootstrap can also be used to internally validate a predictive model without holding back patient data during model development.

calibration: Reliability of predicted values, i.e., extent to which predicted values agree with observed values. For a predictive model a calibration curve is constructed by relating predicted to observed values in some smooth manner. The calibration curve is judged against a $45^{\circ}$ line. Miscalibration could be called bias. Calibration error is frequently assessed for predicted event probabilities. If for example 0.4 of the time it rained when the predicted probability of rain was 0.4, the rain forecast is perfectly calibrated. There are specific classes of calibration. Calibration in the large refers to being accurate on the average. If the average daily rainfall probability in your region was $\frac{1}{7}$ and it rained on $\frac{1}{7}$$^{\textrm{th}}$ of the days each year, the probability estimate would be perfectly calibrated in the large. Calibration in the small refers to each level of predicted probability being accurate. On days in which the rainfall probability was $\frac{1}{5}$ did it rain $\frac{1}{5}$$^{\textrm{th}}$ of the time? One could go further and define calibration in the tiny as the extent to which a given type of subject (say a 35 year old male) and a given outcome probability for that subject is accurate. Or is an 0.4 rainfall forecast accurate in the spring?

case-control study: A study in which subjects are selected on the basis of their outcomes, and then exposures (treatments) are ascertained. For example, to assess the association between race and operative mortality one might select all patients who died after open heart surgery in a given year and then select an equal number of patients who survived, matching on several variables other than race so as to equalize (control for) their distributions between the cases and non-cases.

categorical variable: A variable having only certain possible values for which there is no logical ordering of the values. Also called a nominal, polytomous, discrete categorical variable or factor.

causal inference: The study of how/whether outcomes vary across levels of an exposure when that exposure is manipulated. Done properly, the study of causal inference typically concerns itself with defining target parameters, precisely defining the conditions under which causality may be inferred, and evaluation of sensitivity to departures from such conditions. In a randomized and properly blinded experiment in which all experimental units adhere to the experimental manipulation called for in the design, most experimentalists are willing to make a causal interpretation of the experimental effect without further ado. In more complex situations involving observational data or imperfect adherence, things are more nuanced. See Pearl Sections 2.1-2.3 for more information (Pearl, Judea, 2009).

censoring: When the response variable is the time until an event, subjects not followed long enough for the event to have occurred have their event times censored at the time of last follow-up. This kind of censoring is right censoring. For example, in a follow-up study, patients entering the study during its last year will be followed a maximum of 1 year, so they will have their time until event censored at 1 year or less. Left censoring means that the time to the event is known to be less than some value. In interval censoring the time is known to be in a specified interval. Most statistical analyses assume that what causes a subject to be censored is independent of what would cause her to have an event. If this is not the case, informative censoring is said to be present. For example, if a subject is pulled off of a drug because of a treatment failure, the censoring time is indirectly reflecting a bad clinical outcome and the resulting analysis will be biased.

classification and classifier: When considering patterns of associations between inputs and categorical outcomes, classification is the act of assigning a predicted outcome on the basis of all the inputs. A classifier is an algorithm developed for classification. Classification is a forced choice and the result is not a probability. It could be deemed a premature decision, or a decision based on optimizing an implicit or explicit utility/loss/cost function. When the utility function is not specified by the end-user, classification may not be consistent with good decision making. Classification ignores close calls. Logistic regression is frequently mislabeled as a classifier; it is a direct probability estimator. The term classification is frequently used improperly when the outcome variable is categorical (i.e., represents classes) and a probability estimator is used to analyze the data to make probability predictions. The correct term for this situation is prediction.

clinical trial: Though almost always used to denote a randomized experiment, a clinical trial may be any type of prospective study of human subjects in which therapies or clinical strategies are compared. Treatments may be assigned to individual patients or to groups, the latter including cluster randomized trials. For a randomized clinical trial or randomized controlled trial (RCT), the choice and timing of treatments is outside of the control of the physician and patient but is (usually) set in advance by a randomization device. This may be used for traditional parallel group designs, or using a randomized crossover design. Randomization is used to remove the connection between patient characteristics and treatment assignment so that treatment selection bias due to both known and unknown (at the time of randomization) factors is avoided. RCTs do not require representative patients but do require representative treatment effects. If a patient characteristic interacts with the treatment effect, and a wide spectrum of patients over the distribution of the interacting factor is not included in the trial, the trial results may not apply to patients outside (with respect to the interacting factor) of those studied. For example, if age is an effect modifier for treatment and a trial included primarily patients aged 40-65, the relative benefit of a treatment for those older than 65 may not be estimable. RCTs may involve more than two therapies. The “controlled” in randomized controlled trial often refers to having a reference treatment arm that is a placebo or standard of care. But the comparison group can be anything including active controls (as in head-to-head comparisons of drugs). The RCT is the gold standard for establishing causality. An RCT may be mechanistic as in a pure efficacy study, a policy or strategy study, or an effectiveness study. The latter pertains to the attempt to mimic clinical practice in the field.

cohort study: A study in which all subjects meeting the entry criteria are included. Entry criteria are defined at baseline, e.g., at time of diagnosis or treatment.

comparative trial: Trials with two or more treatment groups, designed with sufficient power or precision to detect relevant clinical differences in treatment efficacy among the groups.

conditioning: Conditioning on something means to assume it is true, or in more statistical terms, to set its value to some constant or assume it belongs to some set of values. We might say that the mean systolic blood pressure conditional on the person being female is 125mmHg, which is concisely stated as “of females, the mean SBP is 125mmHg.” Conditioning statements are “if statements.” The notation used for conditioning in statistics is to place the qualifying condition after a vertical bar. See marginalization.

conditional probability: The probability of the veracity of a statement or of an event $A$ given that a specific condition $B$ holds or that an event $B$ has already occurred, denoted by $P(A|B)$. This is a probability in the presence of knowledge captured by $B$. For example, if the condition $B$ is that a person is male, the conditional probability is the probability of $A$ for males. It could be argued that there is no such thing as a completely unconditional probability. In this example one is implicitly conditioning on humans even if not considering the person’s sex.

confidence limits: To say that the 0.95 confidence limits for an unknown quantity are $[a, b]$ means that 0.95 of similarly constructed confidence limits in repeated samples from the same population would contain the unknown quantity. Very loosely speaking one could say that she is 0.95 “confident” that the unknown value is in the interval $[a, b]$, although in the frequentist school unknown parameters are constants, so they are either inside or outside intervals and there are no probabilities associated with these events. The interpretation of a single confidence interval in frequentist statistics is highly problematic, and in fact the word confidence is poorly defined and was just an attempt to gloss over this problem. Note that a confidence interval should be symmetric about a point estimate only when the distribution of the point estimate is symmetric. Many confidence intervals are asymmetric, e.g., intervals for probabilities, odds ratios, and other ratios. Another way to define a confidence interval is the set of all values that if null hypothesized would not be rejected at one minus the confidence level by a specific statistical test. For that reason, confidence intervals are better called compatibility intervals.

confounder: A variable measured before the exposure (treatment) that is a common cause of (or is just associated with) the response and the exposure variable. A confounder, when properly controlled for, can explain away an apparent association between the exposure and the response. A formal definition is: a “pre-exposure covariate $C$ for which there exists a set of other covariates $X$ such that effect of the exposure on the outcome is unconfounded conditional on $(X, C)$ but such that for no proper subset of $(X, C)$ is the effect of the exposure on the outcome unconfounded given the subset.”

continuous variable: A variable that can take on any number of possible values. Practically speaking, when a variable can take on at least, say, 10 values, it can be treated as a continuous variable. For example, it can be plotted on a scatterplot and certain meaningful calculations can be made using the variable.

covariate: See predictor

Cox model: The Cox proportional hazards regression model (Cox, 1972) is a model for relating a set of patient descriptor variables to time until death or other event. Cox analyses are based on the entire survival curve. The time-to-event may be censored due to loss to follow-up or by another event, as long as the censoring is independent of the risk of the event under study. Descriptor variables may be used in two ways: as part of the regression model and as stratification factors. For variables that enter as regressors, the model specifies the relative effect of a variable through its impact on the hazard or instantaneous risk of death at any given time since enrollment. For stratification factors, no assumption is made about how these factors affect survival, i.e., the proportional hazards assumption is not made. Separately shaped survival curves are allowed for these factors. The logrank test for comparing two survival distributions is a special case of the Cox model. Also see survival analysis. Cox models are used to estimate adjusted hazard ratios.

critical value: The value of a test statistic (e.g., $t$, $F$, $\chi^2$, $z$) that if exceeded by the observed test statistic would result in statistical significance at a chosen $\alpha$ level or better. For a $z$-test (normal deviate test) the critical level of $z$ is 1.96 when $\alpha=0.05$ for a two-sided test. For $t$ and $F$ tests, critical values decrease as the sample size increases, as one requires less penalty for having to estimate the population variance as $n$ gets large.

cross-validation: This technique involves leaving out $m$ patients at a time, fitting a model on the remaining $n-m$ patients, and obtaining an unbiased evaluation of predictive accuracy on the $m$ patients. The estimates are averaged over $\geq n/m$ repetitions. Cross-validation provides estimates that have more variation than those from bootstrapping. It may require $>200$ model fits to yield precise estimates of predictive accuracy.

cumulative incidence: For an event that can occur only once if at all, the probability of having the event by time $t$.

data science: A same-sex marriage between statistics and computer science.

degrees of freedom: The number of degrees of freedom (d.f.) has somewhat different meanings depending on the context. In general, d.f. is the number of “free floating” parameters or the number of opportunities a statistical estimator or method was given. For a continuous variable $Y$, there are two types of d.f.: numerator d.f. and denominator d.f. Denominator d.f. is also called error d.f. and is the sample size minus the number of parameters needing to be estimated. It is the denominator of a variance estimator. Numerator d.f. is more aligned with opportunities and is the number of parameters currently being considered/tested. For example, in a “chunk” test for testing whether either height or weight is associated with blood pressure, the test has 2 d.f. if linearity and absence of interaction are assumed. In a traditional ANOVA comparing 4 groups, the comparisons have 3 d.f. because any 3 differences involving the 4 means or combinations of means will uniquely define all possible differences in the 4. One can say that the d.f. for a hypothesis is the number of opportunities one gives associations to be present (relationships to be non-flat), which is the same as the number of restrictions one needs to place on parameters so that the null hypothesis of no association (flat relationships) holds. See also effective degrees of freedom.

dependent, response, outcome, endpoint variable: a binary, categorical, ordinal, continuous, or censored time-to-event variable that is considered to be the target of prediction or the target of an intervention. For the latter, the response variable is assumed to measure a response to treatment some time after the treatment has started (or after randomization to a treatment). The dependent variable may be univariate, i.e., measured at one point of time or representing time until a first event, or multivariate, representing multiple response variables (e.g., both systolic and diastolic blood pressure) or representing one response variable measured longitudinally.

More Information

The most common mistake made in formulating a response variable is failing to use all the information in the measured outcomes in an attempt to create a simpler outcome measure. This often involves a translation of longitudinal data into univariate responses, for example, by analyzing time until doubling of serum creatinine rather than analyzing the underlying continuous variable creatinine in a longitudinal analysis. Time to doubling of creatinine has different meanings for patients with different starting creatinine values and treats a transient doubling the same as a prolonged doubling. It also has a problem when there are missing creatinine measurements for some days, unlike a longitudinal analysis that can use all available data. Efficient analysis of raw data can always be efficiently translated into any needed clinical readouts. For example, estimated parameters from a full longitudinal analysis of serum creatinine can be transformed into derived parameters such as expected time until creatinine exceeds any chosen value. The result of such dichotomization is difficult to interpret, is a function of arbitrary thresholds, fails to recognize dependence on the baseline value, and loses a great deal of statistical power requiring inflation of the sample size.

Another example of information-losing dichotomization is pooling events of different severities into a “time to first event” analysis rather than respecting the ordinal nature of such data. For example, it is very common to lose power by analyzing time to first major adverse cardiovascular event where the event may be hospitalization, myocardial infarction, or death. Treating hospitalization the same as death results in loss of information and power and makes clinical interpretation difficult. It also results in ignoring deaths that occur after a hospitalization. Yet another example occurs when patient status (symptoms, hospitalization, death, etc.) is assessed daily and these rich longitudinal data are condensed into time-to-recovery (TTR). This results in a major loss of statistical power and difficulties in interpretation. Some of the problems created by not respecting the raw data include

There are special interpretation problems when the main outcome is a “good” event that is mixed in with “bad events”. TTR must consider death as a competing risk, meaning that the cumulative incidence of TTR is interpreted as risk of recovery that precedes death and that death is considered to be no worse than non-recovery by the end of follow-up.
Patients who recover from their illness but who later had to be hospitalized have their “unrecoveries” ignored.
If recovery is defined as mild or no symptoms, patients who have prolonged mild symptoms are counted the same as those who are cured.
Patients requiring admission to an intensive care unit are counted the same as those who are able to remain at home with moderate symptoms.
TTR does not handle missing patient status on some days.

All these problems are remedied by analyzing the raw data (or at least analyzing daily summaries of raw data) longitudinally, using for example a longitudinal ordinal state transition model model.

detectable difference: The value of a true population treatment effect (difference between two treatments) that if held would result in a statistical test having exactly the desired power. If the detectable difference is greater than the minimally clinically important difference, a clinical trial may well miss a clinically important difference.

discrimination: A variable or model’s discrimination ability is its ability to separate subjects having a low responses from subjects having high responses. One way to quantify discrimination is the ROC curve area.

dummy variable: A device used in a multivariable regression model to describe a categorical predictor without assuming a numeric scoring. Indicator variable might be a better term. For example, treatments $A, B, C$ might be described by the two dummy predictor variables $X_{1}$ and $X_{2}$, where $X_{1}$ is a binary variable taking on the value of 1 if the treatment for the subject is $B$ and 0 otherwise, and $X_{2}$ takes on the value 1 if the subject is under treatment $C$ and 0 otherwise. The two dummy variables completely define 3 categories, because when $X_{1}=X_{2}=0$ the treatment is $A$.

effective degrees of freedom: When a model is fully pre-specified and the parameters of the model fully identified, the effective degrees of freedom for the whole model or for a predictor in the model is the usual degrees of freedom, i.e., the number of parameters involved. When informal analysis such as visual inspection of relationships is involved and this informal analysis is used to narrow the number of parameters devoted to a variable, the effective degrees of freedom (edf) will be larger than the apparent d.f. in the formal analysis. The larger edf is the number such that when confidence intervals are computed or hypothesis tests are performed, inserting edf into the usual formulas will result in accurate confidence coverage or $P$-values because model uncertainty is taken into account. Edf also comes into play when penalization (shrinkage) is used in estimating regression coefficients. Shinkage involves discounting apparent effects of predictors by making their coefficients closer to zero. This makes the edf smaller than the apparent d.f. (number of parameters). In this case the edf is the number such that certain true (smaller, because of penalization) variances are approximated by lowering d.f. to edf in certain formulas. Another way to look at it is that a likelihood ratio $\chi^2$ statistic computed for the model upon shrinkage will have approximately edf degrees of freedom. See also degrees of freedom.

effective sample size: Pertaining to outcome variables, the sample size for an analysis of a continuous response with no ties in the data such that the continuous response analysis has the same statistical power as the analysis of the original response variable using the higher apparent sample size. For a study with a sample size $n$ on a continuous response $Y$ the effective sample size (ess) is $n$. For time to event analysis for right-censored data from an exponential distribution or using the Cox proportional hazards model/logrank test, ess is the number of observed events. For ordinal $Y$, or continuous $Y$ with some ties, one way to estimate ess is by solving for the sample size that makes a Wilcoxon two-sample test based on a response with no ties have the same power as a Wilcoxon test based on the larger sample size with the magnitude of ties being as observed in the real data. See this for more information.

entry time: The time when a patient starts contributing to the study. In randomized studies or observational studies where all patients have come under observation before the study starts (for example, studies of survival after surgery) the entry time and time origin of the study will be identical. However, for some observational studies, the patient may not start follow-up until after the time origin of the study and these patients contribute to the study group only after their ‘late entry.’ (Bull & Spiegelhalter, 1997)

estimand: An unknown statistical parameter or a function of multiple parameters that is considered to be a target of interest in a study. Examples include treatment differences in unknown means or medians, covariate-specific risk differences, average risk differences with respect to a given covariate distribution, differences in cumulative incidence of an event by a fixed time, a hazard ratio, and differences in mean time in states 6-10 in a 10-state transition model. In clinical trials it is desireable for the estimand to be clinically relevant.

estimator: A statistical formula that transforms data into an estimate of an unknown parameter.

estimate: A statistical estimate of a parameter based on the data. See parameter. Examples include the sample mean, sample median, and estimated regression coefficients.

frequentist statistical inference: Currently the most commonly used statistical philosophy. It uses hypothesis testing, type I and II assertion probabilities, power, $P$-values, confidence limits (compatibility intervals), and adjustments of $P$-values for testing multiple hypotheses from the same study. Probabilities computed using frequentist methods, $P$-values, are probabilities of obtaining values of statistics. The frequentist approach is also called the sampling approach as it considers the distribution of statistics over hypothetical repeated samples from the same population. The frequentist approach is concerned with long-run operating characteristics of statistics and estimates. Because of this and because of the backwards time/information ordering of $P$-values, frequentist testing requires complex multiplicity adjustments but provides no guiding principles for exactly how those adjustments should be derived. Frequentist statistics involves confusion of two ideas: (1) the apriori probability that an experiment will generate misleading information (e.g., the chance of an assertion of an effect when there is no effect, i.e., type I assertion probability $\alpha$) and (2) the evidence for an assertion after the experiment is run. The latter should not involve a multiplicity adjustment, but because the former does, frequentists do not know how to interpret the latter when multiple hypotheses are tested or when a single hypothesis is tested sequentially. Frequentist statistics as typically practiced places emphasis on hypothesis testing rather than estimation.

Gaussian distribution: See normal distribution.

generalizability: See replication, reproduction, robust, generalizable

generalized linear model: A model that has the same right-hand side form as a linear regression model but whose dependent variable can be categorical or can have a continuous distribution that is not normal. Examples of GLMs include binary logistic regression, probit regression, Poisson regression, and models for continuous $Y$ having a $\gamma$ distribution, plus the Gaussian distribution special case of the linear regression model. GLMs can be fitted by maximum likelihood, quasi-likelihood, or Bayesian methods.

Gini’s mean difference: A measure of variability (dispersion) that is much more interpretable than the standard deviation and more robust to outliers, and also applies to non-symmetric distributions. It is the mean absolute difference between all possible pairs of observations. There is a fast computing formula for the index, and the index is highly statistical efficient.

goodness of fit: Assessment of the agreement of the data with either a hypothesized pattern (e.g., independence of row and column factors in a contingency table or the form of a regression relationship) or a hypothesized distribution (e.g., comparing a histogram with expected frequencies from the normal distribution).

hazard rate: The instantaneous risk of a patient experiencing a particular event at each specified time (Bull & Spiegelhalter, 1997). The instantaneous rate with which an event occurs at a single point in time. It is the probability that the event occurs between time $t$ and time $t + \delta$ given that it has not yet occurred by time $t$, divided by $\delta$, as $\delta$ becomes vanishingly small. Note that rates, unlike probabilities, can exceed 1.0 because they are quotients.

hazard ratio: The ratio of hazard rates at a single time $t$, for two types of subjects. Hazard ratios are in the interval $[0, \infty)$, and they are frequently good ways to summarize the relative effects of two treatments at a specific time $t$. When the hazard ratio is independent of $t$, the ratio provides a prospective intent-to-treat causal interpretation. Like odds ratios, hazard ratios can apply to any level of risk for the reference group. Note that a hazard ratio is distinct from a risk ratio, the latter being the ratio of two simple probabilities and not the ratio of two rates.

When the hazard ratio is not constant, a statistical model that allows it to vary can still be used to estimate intent-to-treat quantities such as cumulative outcome incidence by a fixed time or restricted mean survival time.

Hawthorne effect: A change in a subject response that results from the subject knowing she is being observed.

heterogeneity of treatment effect: Variation of the effect of a treatment on a scale for which it is mathematically possible for a treatment that has a nonzero effect on the average to have the same effect for different types of subjects. HTE should not be considered on the absolute risk scale (see risk magnification) but rather on a relative scale such as log odds or log hazard. HTE is best thought of as something due to a particular combination of treatment and patient that is mechanistic and not just related to the generalized risk that sicker patients are operating under. For example, patients with more severe coronary artery disease may get more relative benefit from revascularization, and patients who are poor metabolizers of a drug may get less relative benefit of the drug. Variation in the absolute risk reduction (ARR) due to a treatment is often misstated as HTE. Since ARR must vary by subject when risk factors exist and when the overall treatment effect is nonzero, variation in ARR is a mathematical necessity. It is dominated by subjects’ baseline risk so is more accurately termed heterogeneity in subjects rather than heterogeneity of treatment effects.

intention-to-treat: Subjects in a randomized clinical trial are analyzed according to the treatment group to which they were assigned, even if they did not receive the intended treatment or received only a portion of it. If in a randomized study an analysis is done which does not classify all patients to the groups to which they were randomized, the study can no longer be strictly interpreted as a randomized trial, i.e., the randomization is “broken”. Intention-to-treat analyses are pragmatic in that they reflect real-world non-adherence to treatment.

inter-quartile range: The range between the outer quartiles ($25^\text{th}$ and $75^\text{th}$ percentiles). It is a measure of the spread of the data distribution (dispersion), i.e., a central interval containing half the sample.

Kaplan-Meier estimator: A nonparametric (distribution-free) estimator of the survival function (Kaplan & Meier (1958)) used to estimate the probability of being free of an event (or of any of a set of possible types of events) by time $t$ for $t$ running from zero until the end of follow-up. The estimator works by using partial information when follow-up time is censored for some subjects. This is done by having shrinking denominators as follow-up time $t$ increases, where the denominators represent the number of subjects still at risk of the event of interest and still being followed at least until $t$. Assumptions made by this estimator are that the sample is homogeneous (every subject has the same survival curve) and that censoring is independent of impending risk of the event. One minus the Kaplan-Meier estimator is a cumulative incidence estimator.

A violation of the independent censoring assumption would be withdrawal of a patient from follow-up if her condition worsens.

least squares estimate: The value of a regression coefficient that results in the minimum sum of squared errors, where an error is defined as the difference between an observed and a predicted dependent variable value.

likelihood function: The probability of the observed data as a function of the unknown parameters for the data distribution. Here we use “probability” in a loose sense (and call it likelihood) so that it can apply to both discrete and continuous outcome variables. When the outcome variable $Y$ can take on only discrete values (e.g., $Y$ is binary or categorical), given a statistical model one can compute the exact probability that any given possible value of $Y$ can be observed. In this case, the joint probability of a set of such occurrences can easily be computed. When the observations are independent, this joint probability is the product of all the individual probabilities. The likelihood function is then the joint probability that all the observed values of $Y$ would have occurred, as a function of the unknown parameters that create the entire distribution of an individual observation’s $Y$. When $Y$ is continuous, the probability elements making up the likelihood function are the probability density function values evaluated at the observed data. Because joint probabilities of many observations are very small, and for another reason about to be given, it is customary to state natural logs of likelihoods rather than using the original scale. The log likelihood achieved by a model, that is, the log likelihood at the maximum likelihood estimates of the unknown parameters, is a gold standard information measure and is used to compute various statistics including $R^2$, AIC, and likelihood ratio $\chi^2$ tests of association. See maximum likelihood estimate, which is the set of parameter values making the observed data most likely to have been observed.

The actual probability of a specific value for a continuous variable is zero.The probability that they actually occur is now moot since the $Y$ values have already been observed.

linear regression model: This is also called OLS or ordinary least squares and refers to regression for a continuous dependent variable, and usually to the case where the residuals are assumed to be Gaussian. The linear model is sometimes called general linear model, not to be confused with generalized linear model where the distribution can take on many non-Gaussian forms.

logistic regression model: A multivariable regression model relating one or more predictor variables to the probabilities of various outcomes. The most commonly used logistic model is the binary logistic model (Spanos et al., 1989; Walker & Duncan, 1967) which predicts the probability of an event as a function of several variables. There are several types of ordinal logistic models for predicting an ordinal outcome variable, and there is a polytomous logistic model for categorical responses. The binary and polytomous models generalize the $\chi^{2}$ test for testing for association between categorical variables. One commonly used ordinal model, the proportional odds model (Brazer et al., 1991), generalizes the Wilcoxon 2-sample rank test. Binary logistic models are useful for predicting events in which time is not very important. They can be used to predict events by a specified time, but this can result in a loss of information. Logistic models are used to estimate adjusted odds ratios as well as probabilities of events.

longitudinal or serial data: a response variable measured at more than one post-time-zero time on a subject. The analysis may adjust for baseline covariates or also have updated time-dependent covariates.

machine learning: An algorithmic procedure for prediction or classification that tends to be empirical, nonparametric, flexible, and does not capitalize on additivity of predictor effects. Arthur Samuel defined machine learning as “field of study that gives computers the ability to learn without being explicitly programmed.” Machine learning does not use a data model, i.e., a probability distribution for the outcome variable given the inputs, and does not place emphasis on interpretable parameters. Examples of machine learning algorithms include neural networks, support vector machines, bagging, boosting, recursive partitioning, and random forests. Ridge regression, the lasso, elastic net, and other penalized regression techniques (which have identified parameters and make heavy use of additivity assumptions) fall under statistical models rather than machine learning. By allowing high-order interactions to be potentially as important as main effects, machine learning is data hungry, as sample sizes needed to estimate interaction effects are much larger than sample sizes needed to estimate additive main effects. Machine learning is not to be confused with artificial intelligence.

marginal, marginalization: A marginal quantity or marginal estimate is a quantity that is averaged over some units or characteristics. It is a kind of weighted average of conditional quantities. In a $2\times 2$ frequency table of four regions $\times$ two outcome states, each row is used to estimate the conditional probability of a positive outcome given region. Marginal estimates over regions sum each of the two columns over all the rows to obtain non-region-specific marginal probability estimates. The act of marginalization is the act of un-conditioning on a factor. See conditioning.

More Examples

For a regression model with age and sex as predictors one obtains a predicted mean Y for age 50 equal to 10 for females and 12 for females. If one desired to not know the sex of the subject and hence to estimate the mean Y that is not sex-dependent one would have to specify the f:m ratio. The estimate of the mean for age=50 in a population that is 0.6 female and 0.4 female would be $0.6\times 10 + 0.4\times 12$.
A clinical trial provides a covariate-adjusted treatment B : treatment A odds ratio (OR) of 0.8. For unknown reasons a policy maker wants do know the OR over a distribution of patients. She has at least two ways of constructing the target distribution.
- She desires a population-averaged OR. The statistician asks her for the sampling probability weights that were used in selecting patients for the clinical trial. She responds that the trial did not use a probability sample but instead used volunteers, hence sampling probabilities are unknown. The statistician responds that there is no way to get a population-averaged OR.
- She desires a within-trial sample-averaged OR. The statistician informs the policy maker that such an average OR will only apply to a population that has the same joint distribution of the patient characteristics as those volunteering for the trial. The statistician also mentions that interpretation of the marginal OR is near impossible. The policy maker decides to push ahead for the second estimate. The statistician proceeds by estimating the average outcome risk over the sample, setting treatment to B, then doing the same setting treatment to A, then plugging the two average risks into the OR formula. The policy maker then realizes that a more interpretable estimate is to average all the conditional ORs. The statistician responds that this is what the original adjusted OR represents, since all the ORs are the same (in the absence of interactions).
A longitudinal study in which participants have many follow-up visits models the response using a hierarchical random intercepts and slopes model¹. Estimated random effects from this model are used to obtain participant-specific outcomes conditional on baseline covariates. The regression coefficients for the baseline covariates account for between-subject outcome heterogeneity. One can marginalize out the random effects to get average covariate effects over subjects.
Contrast the last example with the use of a serial correlation longitudinal model that is not useful for estimating participant-specific outcomes but focuses instead on effects of group-level variables such as treatment. Random effects are omitted. The resulting regression parameter estimates may be attenuated towards zero if there is extreme participant outcome heterogeneity. More information about the choice of hierarchical vs. marginal longitudinal models may be found here.

¹ Random intercepts alone would not induce a reasonable serial correlation pattern

masking: Preventing the subject, treating physician, patient interviewer, study director, or statistician from knowing which treatment a patient is given in a comparative study. A single-masked study is one in which the patient does not know which treatment she’s getting. A double-masked study is one in which neither the patient nor the treating physician or other personnel involved in data collection know the treatment assignment. A triple-masked study is one in which the statistician is unaware of which treatment is which. Masking is also known as blinding.

maximum likelihood estimate: An estimate of a statistical parameter (such as a regression coefficient, mean, variance, or standard deviation) that is the value of that parameter making the data most likely to have been observed. MLEs have excellent statistical properties in general, such as converging to population values as the sample size increases, and having the best precision from among all such competing estimators, when the statistical model is correctly specified. When the data are normally distributed, maximum likelihood estimates of regression coefficients and means are equivalent to least squares estimates. When the data are not normally distributed (e.g. binary outcomes, or survival times), maximum likelihood is the standard method to estimate the regression coefficients (e.g. logistic regression, Cox regression). Unlike Bayesian estimators, MLEs cannot take extra-study information into account. MLEs can be overfitted when the data’s information content does not allow reliable estimation of the number of parameters involved (see overfitting). Penalized MLEs can solve this problem, by maximizing a penalized log likelihood function. When extra-study information is not allowed to be utilized, MLE is considered a gold standard estimation technique. See likelihood function.

mean: Arithmetic average, i.e., the sum of all the values divided by the number of observations. The mean of a binary variable is equal to the proportion of ones because the sum of all the zero and one values equals the number of ones. The mean can be heavily influenced by outliers. When the tails of the distribution are not heavy, this influence of more extreme values is what gives the mean its efficiency compared to other estimators such as the median. When the data distribution is symmetric, the population mean and median are the same. The sample mean is a better estimator of the population median than is the sample median, when the data distribution is symmetric and Gaussian-like.

median: Value such that half of the observations’ values are less than and half are greater than that value. The median is also called the $50^{th}$ percentile or the $0.5$ quantile. The sample median is not heavily influenced by outliers so it can be more representative of “typical” subjects. When the data happen to be normally (Gaussian) distributed, the sample median is not as precise as the mean in describing the central tendency, its efficiency being $\frac{2}{\pi} \approx 0.64$.

minimum clinically important difference, MCID: Not to be confused with detectable difference, MCID is a proper target of a power calculation, i.e., a study may be sized so that the power to detect the MCID is reasonable (say 0.9). In randomized clinical trials, the MCID is the treatment effect that would be meaningful to patients, i.e., the effect that one would not want to miss. MCID should never be derived from observed results but instead should always be derived from clinical expertise and from what patients deem important.

Tragically many clinical trials are powered for budgetary or time reasons to detect an effect greater than the MCID, frequently resulting in missing true clinical effects.

multiple comparisons: It is common for one study to involve the calculation of more than one $P$-value. For example, the investigator may wish to test for treatment effects in 3 groups defined by disease etiology, she may test the effects on 4 different patient response variables, or she may look for a significant difference in blood pressure at each of 24 hourly measurements. When multiple statistical tests are done, the chances of at least one of them resulting in an assertion of an effect when there are no effects increases as the number of tests increase. This is called “inflation of type I assertion probability $\alpha$.” When one wishes to control the overall type I probability, individual tests can be done using a more stringent $\alpha$ level, or individual $P$-values can be adjusted upward. Such adjustments are usually dictated when using frequentist statistics, as $P$-values mean the probability of getting a result this impressive if there is really no effect, and “this impressive” can be taken to mean “this impressive given the large number of statistics examined.” Multiple comparisons and related inflation of type I probability are solely the result of chances that a frequentist gives data to be more extreme. In Bayesian inference, one deals with the (prior) chances that the true unknown multiple effects are large, and multiplicity per se does not apply.

multistate model: A statistical model that accounts for subjects transitioning (possibly back-and-forth) between multiple states (e.g., well, ill, hospitalized, dead) over time. Such models are also called state transition models. The underlying estimands are state transition probabilities, e.g., the probability of being in hospital at time $t+1$ conditional on a patient being alive at home at time $t$. State transition probabilities may be un-conditioned using standard rules of probabilities to obtain unconditional probabilities called state occupancy probabilities, for example, the probability that a patient will be in the hospital or dead at day 3 no matter what their state on day 2. Multistate models may be specified in discrete time or continuous time, the latter focusing on intensities or hazard rates. Discrete time models are easier to interpret because they use ordinary probabilities, cumulative incidences, etc.

Multistate models are the most general ways to model multiple event types, allowing for absorbing events (death), recurrent events (hospitalization), categorical states, ordinal states, and missing data.

multivariable model: A model relating multiple predictor variables (risk factors, treatments, etc.) to a single response or dependent variable. The predictor variables may be continuous, binary, or categorical. When a continuous variable is used, a linearity assumption is made unless the variable is expanded to include nonlinear terms. Categorical variables are modeled using dummy variables so as to not assume numeric assignments to categories.

multivariate model: A model that simultaneously predicts more than one dependent variable, e.g. a model to predict systolic and diastolic blood pressure or a model to predict systolic blood pressure 5 min. and 60 min. after drug administration.

nominal significance level: In the context of multiple comparisons involving multiple statistical tests, the apparent significance level $\alpha$ of each test is called the nominal significance level. The overall type I assertion probability for the study, the probability of at least one positive assertion when the true effect is zero, will be greater than $\alpha$.

non-inferiority study: A study designed to show that a treatment is not clinically significantly worse than another treatment. Regardless of the significance/non-significance of a traditional superiority test for comparing the two treatments (with $H_{0}$ at a zero difference), the new treatment would be accepted as non-inferior to the reference treatment if the confidence interval (compatibility interval) for the unknown true difference between treatments excludes a clinically meaningful worsening of outcome with the new treatment.

Non-inferiority studies are infamous for using non-inferiority margins that are much larger than MCIDs to reduce the “needed” sample size. It is common for example to see a clinical trial protocol where the margin is a 25% increase in mortality when a 10% decrease in mortality would have been deemed worthwhile in an efficacy study, thus allowing a drug with a potential 20% increase in mortality to be marketed.

nonparametric estimator: A method for estimating a parameter without assuming an underlying distribution for the data. Examples include sample quantiles, the empirical cumulative distribution, and the Kaplan-Meier survival curve estimator.

nonparametric tests: A test that makes minimal assumptions about the distribution of the data or about certain parameters of a statistical model. Nonparametric tests for ordinal or continuous variables are typically based on the ranks of the data values. Such tests are unaffected by any one-one transformation of the data, e.g., by taking logs. Even if the data come from a normal distribution, rank tests lose very little efficiency (they have a relative efficiency of $\frac{3}{\pi} = 0.955$ if the distribution is normal) compared with parametric tests such as the $t$-test and the linear correlation test. If the data are not normal, a rank test can be much more efficient than the corresponding parametric test. For these reasons, it is not very fruitful to test data for normality and then to decide between the parametric and nonparametric approaches. In addition, tests of normality are not always very powerful. Examples of nonparametric tests are the 2-sample Wilcoxon-Mann-Whitney test, the 1-sample Wilcoxon signed-rank test (usually used for paired data), and the Spearman, Kendall, or Somers’ rank correlation tests. Even though nonparametric tests do not assume a specific distribution for a group, they may assume a connection between the distributions of any two groups. For example, the logrank test assumes proportional hazards, i.e., that the survival curve for group A is a power of the survival curve for group B. The Wilcoxon test, for optimal power, assumes that the cumulative distributions are in proportional odds. Distribution-free is perhaps a better term to use for rank tests.

normal distribution: A symmetric, bell-shaped distribution that is most useful for approximating the distribution of statistical estimators. Also called the Gaussian distribution. The normal distribution cannot be relied upon to approximate the distribution of raw data. The normal distribution’s bell shape follows a rigid mathematical equation of the form e$^{-x^{2}}$. For a normal distribution the probability that a measurement will fall within $\pm 1.96$ standard deviations of the mean is 0.95.

null hypothesis: Customarily but not necessarily a hypothesis of no effect, e.g., no reduction in mean blood pressure or no correlation between age and blood pressure. The null hypothesis, labeled $H_{0}$, is often used in the frequentist branch of statistical inference as a “straw person”; classical statistics often assumes what one hopes doesn’t happen (no effect of a treatment) and attempts to gather evidence against that assumption (i.e., tries to reject $H_{0}$). $H_{0}$ usually specifies a single point such as 0mmHg reduction in blood pressure, but it can specify an interval, e.g., $H_{0}$: blood pressure reduction is between -1 and +1 mmHg. “Null hypotheses” can also be e.g. $H_{0}$: correlation between $X$ and $Y$ is 0.5.

number needed to treat: A quantity that applies to an extremely oversimplified and unrealistic situation where (1) there is a special time horizon $t$ and (2) all patients have the same absolute risk of having an outcome by time $t$, i.e., risk factors do not exist (otherwise NNT cannot be a single number). Specifically, NNT is the number of patients needed to be treated to prevent one bad outcome by time $t$, which is the reciprocal of the absolute outcome risk difference between two treatments. When there are risk factors, absolute risk difference varies tremendously over patient types, so an NNT may not apply to anyone in the patient population. Typically, sicker patients get more benefit of treatment, so the risk difference magnifies and NNT falls for them. There are a huge number of serious problems with NNT, detailed here. Confidence intervals for NNT are problematic but whether done correctly or incorrectly are often so wide as to cast doubt on the use of the point estimate.

observational study: Study in which no experimental condition (e.g., treatment) is manipulated by the investigator, i.e., randomization is not used. Such studies are frequently used to estimate characteristics of subjects (means, proportions, etc.) and to assess associations between variables. They have known limitations for therapeutic comparisons, because of unknown confounders.

odds: The probability an event occurs divided by the probability that it doesn’t occur. An event that occurs 0.90 of the time has 9:1 odds of occurring since $\frac{0.9}{1 - 0.9} = 9$.

odds ratio: The odds ratio for comparing two groups ($A, B$) on their probabilities of an outcome occurring is the odds of the event occurring for group $A$ divided by the odds that it occurs for group $B$. If $P_{A}$ and $P_{B}$ represent the probability of the outcome for the two groups of subjects, the $A:B$ odds ratio is $\frac{P_{A}}{1 - P_{A}} / \frac{P_{B}} {1 - P_{B}}$. Odds ratios are in the interval $[0, \infty)$. An odds ratio for a treatment is a measure of relative effect of that treatment on a binary outcome. As summary measures, odds ratios have advantages over risk ratios: they don’t depend on which of two possible outcomes is labeled the “event”, and any odds ratio can apply to any probability of outcome in the reference group. Because of this, one frequently finds that odds ratios for comparing treatments are relatively constant across different types of patients. The same is not true of risk ratios or risk differences; these depend on the level of risk in the reference group.

one-sided test: A test designed to test a directional hypothesis, yielding a one-sided $P$-value. For example, one might test the null hypothesis $H_{0}$ that there is no difference in mortality between two treatments, with the alternative hypothesis being that the new drug lowers mortality. See also two-sided test.

ordinal variable: A categorical variable for which there is a definite ordering of the categories. For example, severity of lower back pain could be ordered as none, mild, moderate, severe, and coded using these names or using numeric codes such as 0,1,2,10. Spacings between codes are not important.

overfitting: In the context of a prediction tool developed using a statistical model or using an algorithmic procedure such as machine learning, the tendency for the predicted values to be too extreme. Too-extreme predictions make the calibration curve show symptoms of regression to the mean: a flattening of the curve to be less steep than the $45^{\circ}$ line of identify. When overfitting is present, low predicted values are too low and/or high predicted values are too high. Overfitting is synonymous with over-interpretation caused by slicing the data into pieces that do not have huge denominators. The cause of overfitting is typically having too many candidate features in a supervised learning (informed by $Y$) feature selection setting or estimating too many parameters in a pre-specified model. Each parameter estimated may be unbiased, but predictions are formed by putting all the parameters together, and unless the model is intentionally underfitted using penalization (shrinkage; regularization), the combination of parameters exhibits the “low values too low or high values too high” phenomenon. This is due in part to sorting predicted values or to selecting subjects with extreme predictions. It is possible that the overall mean predicted value be unbiased even with extreme overfitting. That is why it is important to estimate the entire calibration curve.

$P$-value: The probability of getting a result (e.g., $t$ or $\chi^2$ statistics) as or more extreme than the observed statistic had $H_{0}$ been true. An $\alpha$-level test would reject $H_{0}$ if $P \leq \alpha$. However, the $P$-value can be reported instead of choosing an arbitrary value of $\alpha$. Examples: (1) An investigator compared two randomized groups for differences in systolic blood pressure, with the two mean pressures being 134.4 mmHg and 138.2 mmHg. She obtained a two-tailed $P=0.03$. This means that if there is truly no difference in the population means, one would expect to find a difference in means exceeding 3.8 mmHg in absolute value 0.03 of the time. The investigator might conclude there is evidence for a treatment effect on mean systolic blood pressure if the statistical test’s assumptions are true. (2) An investigator obtained $P=0.23$ for testing a correlation being zero, with the sample correlation being 0.08. The probability of getting a correlation this large or larger in absolute value if the population correlation is zero is 0.23. No conclusion is possible other than (a) more data are needed and (b) there is no convincing evidence for or against a zero correlation. For both of these examples compatibility (confidence) intervals would be helpful. The $P$-value is not the probability that the null hypothesis is true, and is not the probability that the results are due to chance. $P$ is computed under the assumption that the results are due to chance.

paired data: When each subject has two response measurements, there is a natural pairing to the data and the two responses are correlated. The correlation results from the fact that generally there is more variation between subjects than there is within subjects. Sometimes one can take the difference or log ratio of the two responses for each subject, and then analyze these “effect measures” using an unpaired one-sample approach such as the Wilcoxon signed-rank test or the paired $t$-test. One must be careful that the effect measure is properly chosen so that it is independent of the baseline value.

parameter: An unknown quantity such as the population mean, population variance, difference in two means, or regression coefficient.

parametric model: A model based on a mathematical function having a few unknown parameters. Typically the number of parameters in a parametric model does not grow with the sample size, and a specific distribution is assumed for the dependent variable $Y$, conditional on $X$. See also semiparametric model.

parametric test: A test which makes specific assumptions about the distribution of the data or specific assumptions about model parameters. Examples include the $t$-test and the Pearson product-moment linear correlation test.

percentile, quantile: The $p$-th percentile is the value such that $\frac{np}{100}$ of the observations’ values are less than that value. The $p$-th quantile is the value such that $np$ of the observations’ values are less. Sample percentiles and quantiles only work well for continuous variables, not performing well if there are many ties in the data. For example if there many ties at the median, adding many extremely high values to the data may not budge the median in some cases, and adding one extreme value may move the median quite a lot in other cases. See also quartiles.

phase I: Studies to obtain preliminary information on dosage, absorption, metabolism, and the relationship between toxicity and the dose-schedule of treatment.

phase II: Studies to determine feasibility and estimate treatment activity and safety in diseases (or for example tumor types) for which the treatment appears promising. Generates hypotheses for later testing.

phase III: Comparative trial to determine the effectiveness and safety of a new treatment relative to standard therapy. These trials usually represent the most rigorous proof of treatment efficacy (pivotal trials) and are the last stage before product licensing..

phase IV: Post-marketing studies of licensed products.

posterior probability: In a Bayesian context, this is the probability of an event after making use of the information in the data. In other words, it is the prior probability of an event after updating it with the data. Posterior probability can also be called post-test probability if one equates a diagnostic test with “data” (see also ROC curve).

power: In a frequentist setting, the probability of rejecting the null hypothesis for a set value of the unknown effect. Power could also be called the sensitivity of the statistical test in detecting that effect. Power increases when the sample size and true unknown effect increase and when the inter-subject variability decreases. In a two-group comparison, power generally increases as the allocation ratio gets closer to 1:1. For a given experiment it is desirable to use a statistical test expected to have maximum power (sensitivity). A less powerful statistical test will have the same power as a better test that was applied after discarding some of the observations. For example, testing for differences in the proportion of patients with hypertension in a 500-patient study may yield the same power as a 350-patient study which used blood pressure as a continuous variable. See type II probability. In a Bayesian paradigm, power may be defined as the probabilitiy that the posterior probability of an effect will be high. See also MCID.

precision: Degree of absence of random error. The precision of a statistical estimator is related to the expected error that occurs when approximating the infinite-data value. In other words, when you try to estimate some measure in a population, the precision is related to the error in the estimate. So precision can be thought of as a “margin of error” in estimating some unknown value. Precision can be quantified by the width of a confidence (compatibility) interval and sometimes by a standard deviation of the estimator (standard error). For the confidence intervals, a “margin for error” is computed so that the quoted interval has a certain probability of containing the true value (e.g., population mean difference). Some authors define precision as the reciprocal of the variance of an estimate. By that definition, precision increases linearly as the sample size increases. If instead one defines precision on the original scale of measurement instead of its square (i.e., if one uses the standard error or width of a confidence interval), precision increases as the square root of the sample size.

predictor, explanatory variable, risk factor, covariate, covariable, independent variable: quantities which may be associated with better or worse outcome (Bull & Spiegelhalter, 1997). Without further information, predictors (covariates) are taken to be measured at baseline. Time-dependent covariates are updated with post-baseline measurements. An external time-dependent covariate is one whose future values were already known at baseline. For example, in a crossover study, the new treatment assignment at one month (the crossover time) was already known at the time of randomization. Effects of external time-dependent covariates are easy to interpret. Internal time-dependent covariates (e.g., updated cholesterol measurements) may reflect changing subject condition. An especially difficult-to-interpret situation is a randomized trial in which one estimates the (supposedly constant) treatment effect after adjusting for internal time-dependent covariates.

prior probability: The probability of an event as it could best be assessed before the experiment. In diagnostic testing this is called the pre-test probability. The prior probability can come from an objective model based on previously available information, or it can be based on expert opinion. In some Bayesian analyses, prior probabilities are expressed as probability distributions which are flat lines, to reflect a complete absence of knowledge about an event. Such distributions are called non-informative, flat, or reference distributions, and analyses based on them fully let the data “speak for themselves.”

probability: The probability that an event will occur, that an invisible event has already occurred, or that an assertion is true, is a number between 0 and 1 inclusive such that (1) of all possible outcomes (including non-events) the probability of some possible outcome occurring is 1, and (2) the probability of any of a set of mutually exclusive events (i.e., union of events) occurring is the sum of the individual event probabilities. The meaning attached to the metric known as a probability is up to the user; it can represent long-run relative frequency of repeatable observations, a degree of belief, or a measure of veracity or plausibility. In the frequentist school, the probability of an event denotes the limit of the long-term fraction of occurrences of the event. This notion of probability implies that the same experiment which generated the outcome of interest can be repeated infinitely often. Even a coin will change after 100,000 flips. Likewise, some may argue that a patient is “one of a kind” and that repetitions of the same experiment are not possible. One could reasonably argue that a “repetition” does not denote the same patient at the same stage of the disease, but rather any patient with the same severity of disease (measured with current technology). There are other schools of probability that do not require the notion of replication at all. For example, the school of subjective probability (associated with the Bayesian school) “considers probability as a measure of the degree of belief of a given subject in the occurrence of an event or, more generally, in the veracity of a given assertion” (see P. 55 of Kotz & Johnson (1988)). de Finetti defined subjective probability in terms of wagers and odds in betting. A risk-neutral individual would be willing to wager $$P$ that an event will occur when the payoff is $1 and her subjective probability is $P$ for the event. The domain of application of probability is all-important. We assume that the true event status (e.g., dead/alive) is unknown, and we also assume that the information the probability is conditional upon (e.g. $\Pr${death $|$ male, age=70}) is what we would check the probability against. In other words, we do not ask whether $\Pr$(death $|$ male, age=70) is accurate when compared against $\Pr$(death $|$ male, age=70, meanbp=45, patient on downhill course). It is difficult to find a probability that is truly not conditional on anything. What is conditioned upon is all important. Probabilities are maximally useful when, as with Bayesian inference, they condition on what is known to provide a forecast for what is unknown. These are “forward time” or “forward information flow” probabilities. Forward time probabilities can meaningfully be taken out of context more often than backward-time probabilities, as they don’t need to consider “what might have happened.” In frequentist statistics, the $P$-value is a backward information flow probability, being conditional on the unknown effect size. This is why $P$-values must be adjusted for multiple data looks (“what might have happened”) whereas the current Bayesian posterior probability merely overrides any posterior probabilities computed at earlier data looks, because they now condition on current data. As IJ Good has written, the axioms defining the “rules” under which probabilities must operate (e.g., a probability is between 0 and 1) do not define what a probability actually means. He also states that all probabilities are subjective, because they depend on the knowledge of the particular observer.

These are Kolmogorov’s axioms of probability. All other probability rules can be derived from these axioms.An excellent discussion of the epistemological interpretation of probability is found here. Some excerpts: probability is “a tool used to quantify our uncertainty about unknown things, subject to a set of coherence requirements. … a probability measure arises as an effective tool for the quantification of uncertainty if we wish to measure uncertainty using real numbers … probability is a tool created by humans to analyse the world, not an inherent part of the world itself.”

probability density function: When a random variable $Y$ is continuous, i.e., it can take on every possible number within some interval, the probability density function is a function of $y$ which is the limit, as the width $\delta$ of some interval goes to zero, of the probability that $Y$ will be within the interval $[y, y + \delta]$, divided by $\delta$. This is the first derivative (slope) of the cumulative probability distribution function for $Y$.

proper accuracy scoring rule: When applied to predicting categorical outcomes, a proper probability accuracy scoring rule is a measure that is optimized when the predicted probabilities are the true outcome probabilities. Examples of proper accuracy scores include the Brier score, the logarithmic probability score, and the log-likelihood from a correct statistical model. Examples of improper scoring rules, i.e., rules that are optimized by a bogus model, are proportion classified correctly, sensitivity, specificity, precision, recall, and the $c$-index (area under the receiver operating characteristic curve).

proportional hazards: This assumption is fulfilled if two categories of patient are being compared and their hazard ratio is constant over time (though the instantaneous hazards may vary) (Bull & Spiegelhalter, 1997).

prospective study: One in which the study is first designed, then the subjects are enrolled. Prospective studies are usually characterized by intentional data collection.

pseudomedian: The median of the midpoints (means) of all possible pairs of observations, including pairing an observation with itself. This is also called the Hodges-Lehman estimator. It solves the problem with the mean in being too sensitive to extreme values, and it solves the problem with the sample median in being too imprecise (high standard error). The efficiency of the pseudomedian in estimating the mean of a normal distribution is $\frac{3}{\pi} \approx 0.955$ when compared to the full efficiency of the sample mean under normality. When the underlying distribution is symmetric, the pseudomedian estimates both the population mean and the population median. But it estimates a meaningful parameter in all cases.

quartiles: The $25^\text{th}$ and $75^\text{th}$ percentiles and the median. The three values divide a variables distributions into four intervals containing equal numbers of observations. See percentiles and quantiles.

random error: An error caused by sampling from a group rather than knowing the true value of a quantity such as the mean blood pressure for the entire group, e.g., healthy men over age 80. One can also speak of random errors in single measurements for individual subjects, e.g., the error in using a single blood pressure measurement to represent a subject’s long-term blood pressure.

random sample: A sample selected by a random device that ensures that the sample (if large enough) is representative of the infinite group. A probability sample is a kind of random sample in which each possible subject has a known probability of being sampled, but the probabilities can vary. For example, one may wish to over-sample African-Americans in a study to ensure good representation. In that case one could sample African-Americans with probability of 1.0 and others with a probability of 0.5.

randomized controlled trial: See clinical trial.

randomness: Absence of a systematic pattern. One might wish to examine whether some hormone level varies systematically over the day as opposed to having a random pattern, or whether events such as epileptic seizures tend to cluster or occur randomly in time. Sometimes the residuals in an ordinary regression model are plotted against the order in which subjects were accrued to make sure that the pattern is random (e.g., there was no learning trend for the investigators).

rate: A ratio such as a change per unit time. Rates are often limits, and shouldn’t be confused with probabilities. The latter are constrained to be between 0 and 1 whereas there are no constraints on possible values for rates. A rate may also be a ratio such as “falls per distance walked” or “bacteria per unit of surface area.” R. A. Fisher defined a rate as a ratio of two quantities having different units of measurement.

regression to the mean: Tendency for a variable that has an extreme value on its first measurement to have a more typical value on its second measurement. For example, suppose that subjects must have LDL cholesterol $> 190$mg% to qualify for a study, and the median LDL cholesterol for qualifying subjects at the screening visit was 230 mg%. The median LDL cholesterol value at their second visit might be 200mg%, with several of the subjects having values below 190. This is the “sophomore slump” in baseball; second-year players are watched when they have phenomenal rookie years. Regression to the mean also takes many other forms, all arising because variables or subgroups are not examined at random but rather because they appear “impressive”: (1) One might compare 5 treatments with a control and choose the treatment having the maximum difference. On a repeated study that treatment’s average response will be found to be much closer to that of the control. (2) In a randomized controlled trial the investigators may wish to estimate the effect of treatment in multiple subgroups. They find that in 40 left-handed diabetics the treatment multiplies mortality by 0.4. If the study is replicated, they would find that the mortality reduction in left-handed diabetics is much closer to the mortality reduction in the overall sample of patients. (3) Researchers study the association between 40 possible risk factors and some outcome, and find that the factor with the strongest association had a correlation of 0.5 with the response. On replication, the correlation will be much lower. This result is very related to what happens in stepwise variable selection, where the most statistically significant variables selected will have their importance (regression coefficients) greatly overstated.

relative risk or risk ratio: The ratio of the probabilities of two events. Unlike the odds ratios and hazard ratios, risk ratios are not capable of being constant but instead must depend on the base risk (e.g., the risk for a subject who does not have a risk factor). For example, a risk ratio of 2 may apply only to subjects with base risks $< \frac{1}{2}$. Also unlike the odds ratio, the risk ratio depends greatly on which of two outcomes is labeled as the “event”; a mortality ratio does not equal the survival ratio. The term relative risk is often inappropriately used to describe an odds ratio or a hazard ratio. (Bull & Spiegelhalter, 1997).

replication, reproduction, robust, generalization: Reproduction means to execute what is apparently the same data analysis used by the original authors, on their data. Replication means to do the original analysis on new data. A robust result is getting largely the same result with a different analysis on the original dataset. Generalization means to operationalize the experiment and analysis differently, use new data, and get largely the same result (e.g., using a different genetics, proteomics, or imaging platform or translating a questionnaire to a different language and doing a survey in a different country). Generalization is also taken to mean validating that a treatment works similarly on patients who are different from those in a clinical study. Potential reproducibility means that the investigators have provided data manipulation and analysis code that is fully self-contained and could be executed by another person to obtain all there analytical results obtained by the original researchers.

residual: A statistical quantity that should be unrelated to certain other variables because their effects should have already been subtracted out. In ordinary multiple regression, the most commonly used residual is the difference between predicted and observed values.

restricted mean survival time: The area under a survival curve from zero until a fixed time horizon $\tau$. The RMST represents the mean time that subjects are event-free over the interval $[0, \tau]$. This is a special case of the more general multistate model which can handle multiple types of events and recurrent events in estimating state occupancy probabilities, e.g., the mean number of days alive and not in hospital.

retrospective study: A study in which subjects were already enrolled before the study was designed, or the outcome of interest has occurred before the start of the study (an in a case control study). Such studies often have difficulties such as absence of needed adjustment (confounder) variables and missing data.

risk: Often used as another name for probability but a more accurate definition is the probability of an adverse event $\times$ the severity of the loss that experiencing that event would entail.

risk magnification: A treatment, even one for which there are no interactions with baseline covariates, that has a nonzero effect on a relative scale, by necessity must have different absolute effects. The variation of absolute differences is risk magnification due to baseline risk. Subjects having baseline risks near 0 or 1 have nowhere to go; absolute risk differences are less restricted in the middle of the baseline risk distribution. Treatments have greater absolute benefit for sicker patients, up to a point, even if their relative effects are universal.

risk set: The set of patients in the study at a specified time (Bull & Spiegelhalter, 1997).

ROC curve: When an ordinal or continuous marker is used to diagnose a binary disease, a receiver operating characteristic or ROC curve can be drawn to study the discrimination ability of the marker. The ROC curve is a plot of sensitivity vs. one minus specificity of all possible dichotomizations of the marker as the cutpoints are varied. A major problem with the ROC curve is that it tempts the researcher to publish cutpoints to somewhat arbitrarily classify patients as “diseased” and “normal”. In fact when the diagnostic analysis is based on a cohort study, the marker’s value can be converted into a post-test probability of disease allowing different physicians to use different cutpoints when the need arises (e.g., depending on available resources). Another benefit of the latter approach is that the current probability of disease also defines the probability of an error. For example, if a physician elects not to treat when the probability of disease is 0.04, the false negative probability is 0.04. The area under the ROC curve is one way to summarize the diagnostic discrimination. This area is identical to another more intuitive and easily computed measure of discrimination, the probability that in a randomly chosen pair of patients, one with and one without disease, the one with disease is the one with a higher value of the marker or post-test probability. This is also called the probability of concordance between predicted and observed disease states. A frequently used index of rank correlation, Somers’ $D_{xy}$ equals $2 \times (c - \frac{1}{2})$ where $c$ is the concordance (discrimination) probability. It is important to note that ROC curves play no role in formal decision making, as they ignore the utility (cost; loss) function or the cost of false positives and false negatives.

semiparametric model: ‘Parametric’ assumptions may be made about some aspects of a model, while other components may be estimated ‘non-parametrically’. In the Cox regression procedure, a parametric model for the relative hazard is overlaid on a nonparametric (also called distribution-free) estimate of baseline hazard (Bull & Spiegelhalter, 1997). Like the proportional odds ordinal logistic model, the Cox semiparametric (proportional hazards) model is fully parametric on the right hand side, and nonparametric on the left hand (dependent variable $Y$) side. These types of semiparametric models essentially have an intercept for each distinct value of $Y$ occurring in the data, allowing for estimation of the distribution of $Y$ in a way that is very similar to the empirical cumulative distribution function, a nonparametric distribution estimator. Semiparametric models typically have an increasing number of parameters as the sample size increases (e.g., the number of intercepts or number of steps in the underlying survival curve). These parameters are not costly because they are separate from the parameters associated with the right-hand-side of the model. In other words, adding more intercepts or steps does not alter the fact that one may be testing a single slope, and does not significantly increase the effective number of degrees of freedom in a model.

sensitivity and specificity: One way to quantify the utility of a diagnostic test when both the disease and the test are binary. The sensitivity is the probability that a patient with disease will have a positive test, and the specificity is the probability that a patient without disease will have a negative test. Because these probabilities are conditional on the outcome, they are more useful for retrospective case-control studies. In general, it is more natural and useful to study variations in post-test probabilities of disease given different test results and different patient pre-test characteristics because (1) in general both the sensitivity and specificity will vary with the type of patient being diagnosed, (2) sensitivity increases with the severity of the disease present unless the disease is all-or-nothing, (3) specificity can vary with gradations in pre-clinical amount of disease, and (4) many diagnostic tests are based on continuous rather than binary measurements (Hlatky et al., 1984). Multivariable models are very useful for estimating post-test probabilities. The calibration and discrimination of the post-test probabilities can be quantified.

significance level: A preset value of $\alpha$ against which $P$-values are judged in order to reject $H_{0}$ (see Type I assertion probability). Sometimes a $P$-value itself is called the significance level.

standard deviation: A measure of the variability (spread) of measurements across subjects. The standard deviation has a simple interpretation only if the data distribution is Gaussian (normal), and in that restrictive case the mean $\pm 1.96$ standard deviations is expected to cover 0.95 of the distribution of the measurement. Standard deviation is the square root of the variance. It does not apply very well to asymmetric (skewed) distributions, and is not robust to outliers.

standard error: The standard deviation of a statistical estimator. For example, the standard deviation of a mean is called the standard error of the mean, and it equals the standard deviation of individual measurements divided by the square root of the sample size. Standard errors describe the precision of a statistical summary, not the variability across subjects. Standard errors go to zero as the sample size $\rightarrow \infty$.

state transition model: See multistate model

statistical model: A model with identified parameters that comprises a model for the data through a probability distribution and favors additivity of effects. Examples of statistical models include ordinary linear regression with an assumption of a Gaussian distribution for the residuals, logistic regression, Cox proportional hazards regression, longitudinal models, quantile regression, ridge regression, lasso, and elastic net.

survival analysis: A branch of statistics dealing with the analysis of the time until an event such as death. Survival analysis is distinguished by its emphasis on estimating the time course of events and in dealing with censoring. See Cox model.

survival function: The probability of being free of the event at a specified time (Bull & Spiegelhalter, 1997). The survival function (also called the survival curve) is typically estimated by a Cox model, a parametric survival model, or, if there are no covariates, Kaplan-Meier estimates possibly stratified by a purely categorical baseline variable such as treatment.

survival time: Interval between the time origin and the occurrence of the event or censoring (Bull & Spiegelhalter, 1997).

symmetric distribution: One in which values to the left of the mean by a certain amount are just as likely to be observed as values to the right of the mean by the same amount. For symmetric distributions, the population mean and median are identical and the distance between the $25^{th}$ and $50^{th}$ percentiles equals the distance between the $50^{th}$ and $75^{th}$ percentiles.

time-dependent covariate: See predictor

time origin: The beginning of the story the study aims at telling. In observational studies, the patients may come under observation before or after the time origin of the study (Bull & Spiegelhalter, 1997), but one often attempts to define time zero as date of diagnosis, initiation of exposure, or treatment. In randomized trials, the time origin is the date of randomization.

two-sided test: A test that is non-directional and that leads to a two-sided $P$-value. If the null hypothesis $H_{0}$ is that two treatments have the same mortality outcome, a two-sided alternative is that the mortality difference is nonzero. Two-sided $P$-values are larger than one-sided $P$-values (they are double if the distribution of the test statistic is symmetric). They can be thought of as a multiplicity adjustment that would allow a claim to be made that a treatment lowers or raises mortality. See also one-sided test.

type I assertion probability $\alpha$: Frequently confusingly labeled as a false positive probability, this is the probability of rejecting $H_{0}$ (i.e., declaring “statistical significance” — not recommended) when the null hypothesis is assumed to be true. The type I assertion probability is often called $\alpha$ and is the probability of making an assertion of an effect when any assertion of effect is by definition false. It is usually called a rate but this is not accurate. In common use, the type I probability is the probability that the nominal $P$-value will be $< 0.05$ if there is no effect. This will be 0.05 when (1) only one $P$-value is computed, (2) all model and experimental design assumptions made by the $P$-value calculation are exactly true, and (3) the $P$-value is computed exactly. See here for a detailed discussion of the distinction between assertion probabilities and decision error probabilities.

It is valid to say that $\alpha$ is the probability of indicating an effect when there is no effect, but this is much different from the probability of being wrong in asserting that an effect is present. This probability cannot be derived from a probability of asserting an effect given that the effect is zero. The probability of being wrong in asserting an effect is computed properly by taking one minus the Bayesian posterior probability of an effect being present.

type II assertion probability $\beta$: Frequently confusingly labeled as a false negative probability, this is the probability of not asserting an effect (i.e., failing to reject $H_{0}$) when there truly is a specific magnitude of effect. The type II probability is referred to as $\beta$, which is one minus the power of the test. In other words, the power of the test is $1 - \beta$. This probability $\beta$ is often wrongly called a rate.

Type II probability may be called the probability of a false negative assertion, but this is very distinct from the probability that there is an effect when one does not assert an effect. This probability cannot be derived from a probability of failing to assert an effect given the effect is at a certain nonzero level. The Bayesian posterior probability of an effect is the unconditional (except for the data) probability of a nonzero effect.

variable: a characteristic of interest that you measure, record, and analyze. Symbols for variables are traditionally $X, Y, Z, T$ (the latter standing for event times) and such symbols may stand for unspecified data values.

variance: A measure of the spread or variability of a distribution, equaling the average value of the squared difference between measurements and the population mean measurement. From a sample of measurements, the variance is estimated by the sample variance, which is the sum of squared differences from the sample mean, divided by the number of measurements minus 1. The minus 1 is a kind of “penalty” that corrects for estimating the population mean with the sample mean. Variances are typically only useful when the measurements follow a normal or at least a symmetric distribution.

Other Resources

Glossary of Statistical Terms from the UC Berkeley Statistics Department
Glossary of Probability and Statistics in Wikipedia

Acknowledgments

Richard Goldstein provided valuable additions and clarifications to the glossary and additional medical statistics citations. As noted in the glossary, several definitions came from (Bull & Spiegelhalter, 1997). Thanks to Sebastian Baumeister for the definition of confounder. Raphael Peter extended the definition of rate. Rob Zinkov and Raphael Peter provided input in the definition of clinical trials. Julia Rohrer provided the essence of the definitions of reproducibility, replicability, robustness, and generalizability. Ronan Conroy improved definitions of inter-quartile range and observational study and prompted improvements on parametric model and creation of a definition for degrees of freedom. Thanks to Bryan Shepherd for pointing out the best formal definition of confounding. Andrew Spieker provided the definition for causal inference. Variable was defined by Jim Frost.

References

Brazer, S. R., Pancotto, F. S., Long III, T. T., Harrell, F. E., Lee, K. L., Tyor, M. P., & Pryor, D. B. (1991). Using ordinal logistic regression to estimate the likelihood of colorectal neoplasia. J Clin Epi, 44, 1263–1270.

Bull, K., & Spiegelhalter, D. (1997). Survival analysis in observational studies. Stat Med, 16, 1041–1074.

Cox, D. R. (1972). Regression models and life-tables (with discussion). J Roy Stat Soc B, 34, 187–220.

Hlatky, M. A., Pryor, D. B., Harrell, F. E., Califf, R. M., Mark, D. B., & Rosati, R. A. (1984). Factors affecting the sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med, 77, 64–71. http://www.sciencedirect.com/science/article/pii/0002934384904376#

Kaplan, E. L., & Meier, P. (1958). Nonparametric estimation from incomplete observations. J Am Stat Assoc, 53, 457–481.

Kotz, S., & Johnson, N. L. (Eds.). (1988). Encyclopedia of Statistical Sciences (Vol. 9). Wiley.

Pearl, Judea. (2009). Causal inference in statistics: An overview (pp. 96–146). http://ftp.cs.ucla.edu/pub/stat_ser/r350.pdf

DOI:10.1214/09-SS057See Sections 2.1-2.3

Spanos, A., Harrell, F. E., & Durack, D. T. (1989). Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262, 2700–2707. https://doi.org/10.1001/jama.262.19.2700

Walker, S. H., & Duncan, D. B. (1967). Estimation of the probability of an event as a function of several independent variables. Biometrika, 54, 167–178.