2 General Aspects of Fitting Regression Models

Regression modeling meets many analytic needs:

Prediction, capitalizing on efficient estimation methods such as maximum likelihood and the predominant additivity in a variety of problems
- E.g.: effects of age, smoking, and air quality add to predict lung capacity
- When effects are predominantly additive, or when there aren’t too many interactions and one knows the likely interacting variables in advance, regression can beat machine learning techniques that assume interaction effects are likely to be as strong as main effects
Isolate effects of variables (especially exposure and treatment)
Hypothesis testing
Deep understanding of uncertainties associated with all model components
- Simplest example: confidence interval for the slope of a predictor
- Confidence intervals for predicted values; simultaneous confidence intervals for a series of predicted values
  - E.g.: confidence band for \(Y\) over a series of values of \(X\)

Alternative: Stratification

Cross-classify subjects on the basis of the \(X\)s, estimate a property of \(Y\) for each stratum
Only handles a small number of \(X\)s
Does not handle continuous \(X\)

Alternative: Single Trees (recursive partitioning/CART)

Interpretable because they are over-simplified and usually wrong
Cannot separate effects
Finds spurious interactions
Require huge sample size
Do not handle continuous \(X\) effectively; results in very heterogeneous nodes because of incomplete conditioning
Tree structure is unstable so insights are fragile

Alternative: Machine Learning

E.g. random forests, bagging, boosting, support vector machines, neural networks, deep learning
Allows for high-order interactions and does not require pre-specification of interaction terms
Almost automatic; can save analyst time and do the analysis in one step (long computing time)
Uninterpretable black box
Effects of individual predictors are not separable
Interaction effects (e.g., differential treatment effect = precision medicine = personalized medicine) not available
Because of not using prior information about dominance of additivity, can require 200 events per candidate predictor when \(Y\) is binary (van der Ploeg et al. (2014))
- Logistic regression may require 20 events per candidate predictor
- Can create a demand for “big data” where additive statistical models can work on moderate-size data
- See this article in Harvard Business Review for more about regression vs. complex methods
Many claims about improved predictive discrimination from ML are exaggerated - see Kapoor & Narayanan (2023)

2.1 Notation for Multivariable Regression Models

Weighted sum of a set of independent or predictor variables
Interpret parameters and state assumptions by linearizing model with respect to regression coefficients
Analysis of variance setups, interaction effects, nonlinear effects
Examining the 2 regression assumptions

Symbol	Meaning
\(Y\)	response (dependent) variable
\(X\)	\(X_{1},X_{2},\ldots,X_{p}\) – list of predictors
\(\beta\)	\(\beta_{0},\beta_{1},\ldots,\beta_{p}\) – regression coefficients
\(\beta_0\)	intercept parameter(optional)
\(\beta_{1},\ldots,\beta_{p}\)	weights or regression coefficients
\(X\beta\)	\(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{p}X_{p}, X_{0}=1\)

Model: connection between \(X\) and \(Y\)
\(C(Y|X)\) : property of distribution of \(Y\) given \(X\), e.g.
\(C(Y|X) = {\rm E}(Y|X)\) or \(\Pr(Y=1|X)\).

2.2 Model Formulations

General regression model \[C(Y|X) = g(X) .\]

General linear regression model \[C(Y|X) = g(X\beta) .\] Examples

\[\begin{array}{ccc} C(Y|X) =& E(Y|X) =& X\beta, \\ Y|X &\sim N(X\beta,\sigma^{2}) & \\ C(Y|X) =& \Pr(Y=1|X) =& (1+\exp(-X\beta))^{-1} \\ \end{array}\]

Linearize: \(h(C(Y|X))=X\beta, h(u)=g^{-1}(u)\)
Example:

\[\begin{array}{ccc} C(Y|X)=\Pr(Y=1|X)&=&(1+\exp(-X\beta))^{-1} \\ h(u)=\mathrm{logit}(u)&=&\log(\frac{u}{1-u}) \\ h(C(Y|X)) &=& C'(Y|X)\ {\rm (link)} \end{array}\]

General linear regression model:
\(C'(Y|X)=X\beta\).

2.3 Interpreting Model Parameters

Suppose that \(X_{j}\) is linear and doesn’t interact with other \(X\)’s¹.

¹ Note that it is not necessary to “hold constant” all other variables to be able to interpret the effect of one predictor. It is sufficient to hold constant the weighted sum of all the variables other than \(X_{j}\). And in many cases it is not physically possible to hold other variables constant while varying one, e.g., when a model contains \(X\) and \(X^{2}\) (David Hoaglin, personal communication).

\[\begin{array}{ccc} C'(Y|X) &=& X\beta = \beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{p}X_{p} \\ \beta_{j} &=& C'(Y|X_{1}, X_{2}, \ldots, X_{j}+1, \ldots, X_{p}) \\ &-& C'(Y|X_{1}, X_{2}, \ldots, X_{j}, \ldots, X_{p}) \end{array}\]

Drop \('\) from \(C'\) and assume \(C(Y|X)\) is property of \(Y\) that is linearly related to weighted sum of \(X\)’s.

2.3.1 Nominal Predictors

Nominal (polytomous) factor with \(k\) levels : \(k-1\) indicator variables. E.g. \(T=J,K,L,M\):

\[\begin{array}{ccc} C(Y|T=J) &=& \beta_{0} \nonumber\\ C(Y|T=K) &=& \beta_{0}+\beta_{1}\\ C(Y|T=L) &=& \beta_{0}+\beta_{2}\nonumber\\ C(Y|T=M) &=& \beta_{0}+\beta_{3}\nonumber . \end{array}\]

\[C(Y|T) = X\beta= \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\beta_{3} X_{3},\]

where

\[\begin{array}{ccc} X_{1} = 1 & {\rm if} \ \ T=K, & 0 \ \ {\rm otherwise} \nonumber\\ X_{2} = 1 & {\rm if} \ \ T=L, & 0 \ \ {\rm otherwise} \\ X_{3} = 1 & {\rm if} \ \ T=M, & 0 \ \ {\rm otherwise} \nonumber. \end{array}\]

The test for any differences in the property \(C(Y)\) between treatments is \(H_{0}:\beta_{1}=\beta_{2}=\beta_{3}=0\).

2.3.2 Interactions

\(X_{1}\) and \(X_{2}\), effect of \(X_{1}\) on \(Y\) depends on level of \(X_{2}\). One way to describe interaction is to add \(X_{3}=X_{1}X_{2}\) to model: \[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2} .\]

\[\begin{array}{ccc} C(Y|X_{1}+1, X_{2})&-&C(Y|X_{1}, X_{2})\nonumber\\ &=&\beta_{0}+\beta_{1} (X_{1}+1)+\beta_{2}X_{2}\nonumber\\ &+&\beta_{3} (X_{1}+1)X_{2}\\ &-&[\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2}]\nonumber\\ &=&\beta_{1}+\beta_{3}X_{2} .\nonumber \end{array}\]

One-unit increase in \(X_{2}\) on \(C(Y|X)\) : \(\beta_{2}+\beta_{3} X_{1}\).

Worse interactions:

If \(X_{1}\) is binary, the interaction may take the form of a difference in shape (and/or distribution) of \(X_{2}\) vs. \(C(Y)\) depending on whether \(X_{1}=0\) or \(X_{1}=1\) (e.g. logarithm vs. square root).

This paper describes how interaction effects can be misleading. See Interaction effects need interaction controls by Uri Simonsohn for an explanation of the possible need for controlling for other interactions when the interest is on a specific interaction.

2.3.3 Example: Inference for a Simple Model

Postulate the model \(C(Y|age,sex) = \beta_{0}+\beta_{1} age + \beta_{2} [sex=f] + \beta_{3} age [sex=f]\) where \([sex=f]\) is an indicator indicator variable for sex=female, i.e., the reference cell is sex=male².

² You can also think of the last part of the model as being \(\beta_{3} X_{3}\), where \(X_{3} = age \times [sex=f]\).

Model assumes

age is linearly related to \(C(Y)\) for males,
age is linearly related to \(C(Y)\) for females, and
interaction between age and sex is simple
whatever distribution, variance, and independence assumptions are appropriate for the model being considered.

Interpretations of parameters:

Parameter	Meaning
\(\beta_{0}\)	\(C(Y \| age=0, sex=m)\)
\(\beta_{1}\)	\(C(Y \| age=x+1, sex=m) - C(Y \| age=x, sex=m)\)
\(\beta_{2}\)	\(C(Y \| age=0, sex=f) - C(Y \| age=0, sex=m)\)
\(\beta_{3}\)	\(C(Y \| age=x+1, sex=f) - C(Y \| age=x, sex=f) -\)
	\([C(Y \| age=x+1, sex=m) - C(Y \| age=x, sex=m)]\)

\(\beta_{3}\) is the difference in slopes (female – male).

When a high-order effect such as an interaction effect is in the model, be sure to interpret low-order effects by finding out what makes the interaction effect ignorable. In our example, the interaction effect is zero when age=0 or sex is male.

Hypotheses that are usually inappropriate:

\(H_{0}: \beta_{1}=0\): This tests whether age is associated with \(Y\) for males
\(H_{0}: \beta_{2}=0\): This tests whether sex is associated with \(Y\) for zero year olds

More useful hypotheses follow. For any hypothesis need to

Write what is being tested
Translate to parameters tested
List the alternative hypothesis
Don’t forget what the test is powered to detect
- Test against nonzero slope has maximum power when linearity holds
- If true relationship is monotonic, test for non-flatness will have some but not optimal power
- Test against a quadratic (parabolic) shape will have some power to detect a logarithmic shape but not against a sine wave over many cycles
Useful to write e.g. “\(H_{a}:\) age is associated with \(C(Y)\), powered to detect a linear relationship”

Most Useful Tests for Linear age \(\times\) sex Model

Null or Alternative Hypothesis	Mathematical Statement
Effect of age is independent of sex or Effect of sex is independent of age or age and sex are additive age effects are parallel	\(H_{0}: \beta_{3}=0\)
age interacts with sex age modifies effect of sex sex modifies effect of age sex and age are non-additive (synergistic)	\(H_{a}: \beta_{3} \neq 0\)
age is not associated with \(Y\)	\(H_{0}: \beta_{1}=\beta_{3}=0\)
age is associated with \(Y\) age is associated with \(Y\) for either females or males	\(H_{a}: \beta_{1} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)
sex is not associated with \(Y\)	\(H_{0}: \beta_{2}=\beta_{3}=0\)
sex is associated with \(Y\) sex is associated with \(Y\) for some value of age	\(H_{a}: \beta_{2} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)
Neither age nor sex is associated with \(Y\)	\(H_{0}: \beta_{1}=\beta_{2}=\beta_{3}=0\)
Either age or sex is associated with \(Y\)	\(H_{a}: \beta_{1} \neq 0 \textrm{~or~} \beta_{2} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)

Note: The last test is called the global test of no association. If an interaction effect is present, there is both an age and a sex effect. There can also be age or sex effects when the lines are parallel. The global test of association (test of total association) has 3 d.f. instead of 2 (age + sex) because it allows for unequal slopes.

2.3.4 Review of Composite (Chunk) Tests

In the model

Code

y ~ age + sex + weight + waist + tricep

we may want to jointly test the association between all body measurements and response, holding age and sex constant.

This 3 d.f. test may be obtained two ways:
- Remove the 3 variables and compute the change in \(SSR\) or \(SSE\)
- Test \(H_{0}:\beta_{3}=\beta_{4}=\beta_{5}=0\) using matrix algebra (e.g., anova(fit, weight, waist, tricep) if fit is a fit object created by the R rms package)

2.4 Relaxing Linearity Assumption for Continuous Predictors

2.4.1 Avoiding Categorization

Natura non facit saltus
(Nature does not make jumps)
— Gottfried Wilhelm Leibniz

Lucy D’Agostino McGowan

Relationships seldom linear except when predicting one variable from itself measured earlier
Categorizing continuous predictors into intervals is a disaster; see Royston et al. (2006), Altman (1991), Hilsenbeck & Clark (1996), Lausen & Schumacher (1996), Altman et al. (1994), Belcher (1992), Faraggi & Simon (1996), Ragland (1992), Suissa & Blais (1995), Buettner et al. (1997), Maxwell & Delaney (1993), Schulgen et al. (1994), Altman (1998), Holländer et al. (2004), Moser & Coombs (2004), Wainer (2006), Fedorov et al. (2009), Giannoni et al. (2014), Collins et al. (2016), Bennette & Vickers (2012) and Biostatistics for Biomedical Research, Chapter 18.
Some problems caused by this approach:

Estimated values have reduced precision, and associated tests have reduced power
Categorization assumes relationship between predictor and response is flat within intervals; far less reasonable than a linearity assumption in most cases
To make a continuous predictor be more accurately modeled when categorization is used, multiple intervals are required
Because of sample size limitations in the very low and very high range of the variable, the outer intervals (e.g., outer quintiles) will be wide, resulting in significant heterogeneity of subjects within those intervals, and residual confounding
Categorization assumes that there is a discontinuity in response as interval boundaries are crossed. Other than the effect of time (e.g., an instant stock price drop after bad news), there are very few examples in which such discontinuities have been shown to exist.
Categorization only appears to yield interpretable estimates. E.g. odds ratio for stroke for persons with a systolic blood pressure \(> 160\) mmHg compared to persons with a blood pressure \(\leq 160\) mmHg \(\rightarrow\) interpretation of OR depends on distribution of blood pressures in the sample (the proportion of subjects \(> 170\), \(> 180\), etc.). If blood pressure is modeled as a continuous variable (e.g., using a regression spline, quadratic, or linear effect) one can estimate the ratio of odds for exact settings of the predictor, e.g., the odds ratio for 200 mmHg compared to 120 mmHg.
Categorization does not condition on full information. When, for example, the risk of stroke is being assessed for a new subject with a known blood pressure (say 162~mmHg), the subject does not report to her physician “my blood pressure exceeds 160” but rather reports 162 mmHg. The risk for this subject will be much lower than that of a subject with a blood pressure of 200 mmHg.
If cutpoints are determined in a way that is not blinded to the response variable, calculation of \(P\)-values and confidence intervals requires special simulation techniques; ordinary inferential methods are completely invalid. E.g.: cutpoints chosen by trial and error utilizing \(Y\), even informally \(\rightarrow\) \(P\)-values too small and CLs not accurate³.
Categorization not blinded to \(Y\) \(\rightarrow\) biased effect estimates (Altman et al. (1994), Schulgen et al. (1994))
“Optimal” cutpoints do not replicate over studies. Holländer et al. (2004) state that “… the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the ‘optimal’ cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology.” Giannoni et al. (2014) demonstrated that many claimed “optimal cutpoints” are just the observed median values in the sample, which happens to optimize statistical power for detecting a separation in outcomes.
Disagreements in cutpoints (which are bound to happen whenever one searches for things that do not exist) cause severe interpretation problems. One study may provide an odds ratio for comparing body mass index (BMI) \(> 30\) with BMI \(\leq 30\), another for comparing BMI \(> 28\) with BMI \(\leq 28\). Neither of these has a good definition and the two estimates are not comparable.
Cutpoints are arbitrary and manipulable; cutpoints can be found that can result in both positive and negative associations Wainer (2006).
If a confounder is adjusted for by categorization, there will be residual confounding that can be explained away by inclusion of the continuous form of the predictor in the model in addition to the categories.

³ If a cutpoint is chosen that minimizes the \(P\)-value and the resulting \(P\)-value is 0.05, the true type I error can easily be above 0.5 Holländer et al. (2004).

To summarize: The use of a (single) cutpoint \(c\) makes many assumptions, including:

Relationship between \(X\) and \(Y\) is discontinuous at \(X=c\) and only \(X=c\)
\(c\) is correctly found as the cutpoint
\(X\) vs. \(Y\) is flat to the left of \(c\)
\(X\) vs. \(Y\) is flat to the right of \(c\)
The choice of \(c\) does not depend on the values of other predictors

Interactive demonstration of power loss of categorization vs. straight line and quadratic fits in OLS, with varying degree of nonlinearity and noise added to \(X\) (must run in RStudio)

Interactive demonstration of lack of fit after categorization of a continuous predictor, and comparison with spline fits, by Stefan Hansen

Code

require(Hmisc)
# Install the manipulate package to make catgNoise work
getRs('catgNoise.r')

Example⁴ of misleading results from creating intervals (here, deciles) of a continuous predictor. Final interval is extremely heterogeneous and is greatly influenced by very large glycohemoglobin values, creating the false impression of an inflection point at 5.9.

⁴ From NHANES III; Diabetes Care 32:1327-34; 2009 adapted from Diabetes Care 20:1183-1197; 1997.

See this for excellent graphical examples of the harm of categorizing predictors, especially when using quantile groups.

2.4.2 Simple Nonlinear Terms

\[C(Y|X_{1}) = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{1}^{2} .\]

\(H_{0}:\) model is linear in \(X_{1}\) vs. \(H_{a}:\) model is quadratic in \(X_{1} \equiv H_{0}: \beta_{2}=0\).
Test of linearity may be powerful if true model is not extremely non-parabolic
Predictions not accurate in general as many phenomena are non-quadratic
Can get more flexible fits by adding powers higher than 2
But polynomials do not adequately fit logarithmic functions or “threshold” effects, and have unwanted peaks and valleys.

2.4.3 Splines for Estimating Shape of Regression Function and Determining Predictor Transformations

Draftsman’s spline: flexible strip of metal or rubber used to trace curves.
Spline Function: piecewise polynomial
Linear Spline Function: piecewise linear function
- Bi-linear regression: model is \(\beta_{0}+\beta_{1}X\) if \(X \leq a\), \(\beta_{2}+\beta_{3}X\) if \(X > a\).
- Problem with this notation: two lines not constrained to join
- To force simple continuity: \(\beta_{0} + \beta_{1}X + \beta_{2}(X-a)\times [X>a] = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\),
  where \(X_{2} = (X_{1}-a) \times [X_{1} > a]\).
- Slope is \(\beta_{1}, X \leq a\), \(\beta_{1}+\beta_{2}, X > a\).
- \(\beta_{2}\) is the slope increment as you pass \(a\)

QSee this for a nice review and information about resources in R.

More generally: \(x\)-axis divided into intervals with endpoints \(a,b,c\) (knots).

\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}(X-a)_{+}+\beta_{3}(X-b)_{+} \nonumber\\ &+& \beta_{4}(X-c)_{+} , \end{array}\]

where

\[\begin{array}{ccc} (u)_{+}=&u,&u>0 ,\nonumber\\ &0,&u\leq0 . \end{array}\]

\[\begin{array}{ccc} f(X) & = \beta_{0}+\beta_{1}X, & X\leq a \nonumber \\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a) & a<X\leq b \\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a)+\beta_{3}(X-b) & b<X\leq c \nonumber\\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a) & \nonumber\\ & + \beta_{3}(X-b)+\beta_{4}(X-c) & c<X. \nonumber \end{array}\]

Figure 2.1: A linear spline function with knots at \(a = 1, b = 3, c = 5\).

\[C(Y|X) = f(X) = X\beta,\] where \(X\beta = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\beta_{3}X_{3}+\beta_{4} X_{4}\), and

\[\begin{array}{cc} X_{1}=X & X_{2} = (X-a)_{+}\nonumber\\ X_{3}=(X-b)_{+} & X_{4} = (X-c)_{+} . \end{array}\]

Overall linearity in \(X\) can be tested by testing \(H_{0} : \beta_{2} = \beta_{3} = \beta_{4} = 0\).

2.4.4 Cubic Spline Functions

Cubic splines are smooth at knots (function, first and second derivatives agree) — can’t see joins.

\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}\nonumber\\ &+&\beta_{4}(X-a)_{+}^{3}+ \beta_{5}(X-b)_{+}^{3}+\beta_{6}(X-c)_{+}^{3}\\ &=& X\beta\nonumber \end{array}\] \[\begin{array}{cc} X_{1}=X & X_{2}=X^{2}\nonumber\\ X_{3}=X^{3} & X_{4}=(X-a)_{+}^{3}\\ X_{5}=(X-b)_{+}^{3} & X_{6}=(X-c)_{+}^{3}\nonumber . \end{array}\]

\(k\) knots \(\rightarrow k+3\) coefficients excluding intercept.

\(X^2\) and \(X^3\) terms must be included to allow nonlinearity when \(X < a\).

stats.stackexchange.com/questions/421964 has some useful descriptions of what happens at the knots, e.g.:

Knots are where different cubic polynomials are joined, and cubic splines force there to be three levels of continuity (the function, its slope, and its acceleration or second derivative (slope of the slope) do not change) at these points. At the knots the jolt (third derivative or rate of change of acceleration) is allowed to change suddenly, meaning the jolt is allowed to be discontinuous at the knots. Between knots, jolt is constant.

The following graphs show the function and its first three derivatives (all further derivatives are zero) for the function given by \(f(x) = x + x^{2} + 2x^{3} + 10(x - 0.25)^{3}_{+} - 50(x - 0.5)^{3}_{+} -100(x - 0.75)^{3}_{+}\) for \(x\) going from 0 to 1, where there are three knots, at \(x=0.25, 0.5, 0.75\).

Code

spar(bty='l', mfrow=c(4,1), bot=-1.5)
x  <- seq(0, 1, length=500)
x1 <- pmax(x - .25, 0)
x2 <- pmax(x - .50, 0)
x3 <- pmax(x - .75, 0)
b1 <- 1; b2 <- 1; b3 <- 2; b4 <- 10; b5 <- -50; b6 <- -100
y  <- b1 * x + b2 * x^2 + b3 * x^3 + b4 * x1^3 + b5 * x2^3 + b6 * x3^3
y1 <- b1     + 2*b2*x   + 3*b3*x^2 + 3*b4*x1^2 + 3*b5*x2^2 + 3*b6*x3^2
y2 <-          2*b2     + 6*b3*x   + 6*b4*x1   + 6*b5*x2   + 6*b6*x3
y3 <-                     6*b3     + 6*b4*(x1>0)+ 6*b5*(x2>0) + 6*b6*(x3>0)

g <- function() abline(v=(1:3)/4, col=gray(.85))
plot(x, y, type='l', ylab=''); g()
text(0, 1.5, 'Function', adj=0)

plot(x, y1, type='l', ylab=''); g()
text(0, -15, 'First Derivative: Slope\nRate of Change of Function',
     adj=0)
plot(x, y2, type='l', ylab=''); g()
text(0, -125, 'Second Derivative: Acceleration\nRate of Change of Slope',
     adj=0)
plot(x, y3, type='l', ylab=''); g()
text(0, -400, 'Third Derivative: Jolt\nRate of Change of Acceleration',
     adj=0)

Figure 2.2: A regular cubic spline function with three levels of continuity that prevent the human eye from detecting the knots. Also shown is the function’s first three derivatives. Knots are located at \(x=0.25, 0.5, 0.75\). For \(x\) beyond the outer knots, the function is not restricted to be linear. Linearity would imply an acceleration of zero. Vertical lines are drawn at the knots.

2.4.5 Restricted Cubic Splines

Stone & Koo (1985): cubic splines poorly behaved in tails. Constrain function to be linear in tails.
\(k+3 \rightarrow k-1\) parameters Devlin & Weeks (1986).

To force linearity when \(X < a\): \(X^2\) and \(X^3\) terms must be omitted
To force linearity when \(X\) is beyond the last knot: last two \(\beta\) s are redundant, i.e., are just combinations of the other \(\beta\) s.

The restricted spline function with \(k\) knots \(t_{1}, \ldots, t_{k}\) is given by Devlin & Weeks (1986) \[f(X) = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\ldots+\beta_{k-1} X_{k-1},\] where \(X_{1} = X\) and for \(j=1, \ldots, k-2\),

\[\begin{array}{ccc} X_{j+1} &= &(X-t_{j})_{+}^{3}-(X-t_{k-1})_{+}^{3} (t_{k}-t_{j})/(t_{k}-t_{k-1})\nonumber\\ &+&(X-t_{k})_{+}^{3} (t_{k-1}-t_{j})/(t_{k}-t_{k-1}). \end{array} \tag{2.1}\]

\(X_{j}\) is linear in \(X\) for \(X\geq t_{k}\).

For numerical behavior and to put all basis functions for \(X\) on the same scale, R Hmisc and rms package functions by default divide the terms above by \(\tau = (t_{k} - t_{1})^{2}\).

Code

require(Hmisc)
spar(mfrow=c(1,2), left=-2)
x <- rcspline.eval(seq(0,1,.01),
                   knots=seq(.05,.95,length=5), inclx=T)
xm <- x
xm[xm > .0106] <- NA
matplot(x[,1], xm, type="l", ylim=c(0,.01),
        xlab=expression(X), ylab='', lty=1)
matplot(x[,1], x,  type="l",
        xlab=expression(X), ylab='', lty=1)

Figure 2.3: Restricted cubic spline component variables for \(k = 5\) and knots at \(X = .05, .275, .5, .725\), and \(.95\). Nonlinear basis functions are scaled by \(\tau\). The left panel is a \(y\)–magnification of the right panel. Fitted functions such as those in Figure 2.4 will be linear combinations of these basis functions as long as knots are at the same locations used here.

Code

spar(left=-2, bot=2, mfrow=c(2,2), ps=13)
x <- seq(0, 1, length=300)
for(nk in 3:6) {
  set.seed(nk)
  knots <- seq(.05, .95, length=nk)
  xx <- rcspline.eval(x, knots=knots, inclx=T)
  for(i in 1 : (nk - 1))
    xx[,i] <- (xx[,i] - min(xx[,i])) /
              (max(xx[,i]) - min(xx[,i]))
  for(i in 1 : 20) {
    beta  <- 2*runif(nk-1) - 1
    xbeta <- xx %*% beta + 2 * runif(1) - 1
    xbeta <- (xbeta - min(xbeta)) /
             (max(xbeta) - min(xbeta))
    if(i == 1) {
      plot(x, xbeta, type="l", lty=1,
           xlab=expression(X), ylab='', bty="l")
      title(sub=paste(nk,"knots"), adj=0, cex=.75)
      for(j in 1 : nk)
        arrows(knots[j], .04, knots[j], -.03,
               angle=20, length=.07, lwd=1.5)
    }
    else lines(x, xbeta, col=i)
  }
}

Figure 2.4: Some typical restricted cubic spline functions for \(k = 3, 4, 5, 6\). The \(y\)–axis is \(X\beta\). Arrows indicate knots. These curves were derived by randomly choosing values of \(\beta\) subject to standard deviations of fitted functions being normalized.

Interactive demonstration of linear and cubic spline fitting, plus ordinary \(4^{th}\) order polynomial. This can be run with RStudio or in an ordinary R session.

Code

require(Hmisc)
getRs('demoSpline.r', put='rstudio')  # to see code in RStudio window
getRs('demoSpline.r')                 # to just run the code

Paul Lambert’s excellent self-contained interactive demonstrations of continuity restrictions, cubic polynomial, linear spline, cubic spline, and restricted cubic spline fitting is at pclambert.net/interactive_graphs. Jordan Gauthier has another nice interactive demonstration at drjgauthier.shinyapps.io/spliny.

See also the excellent resources from Michael Clark here and here.

Once \(\beta_{0}, \ldots, \beta_{k-1}\) are estimated, the restricted cubic spline can be restated in the form

\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}(X-t_{1})_{+}^{3}+\beta_{3}(X-t_{2})_{+}^{3}\nonumber\\ && +\ldots+ \beta_{k+1}(X-t_{k})_{+}^{3} \end{array} \tag{2.2}\]

by dividing \(\beta_{2},\ldots,\beta_{k-1}\) by \(\tau\) and computing

\[\begin{array}{ccc} \beta_{k} &=& [\beta_{2}(t_{1}-t_{k})+\beta_{3}(t_{2}-t_{k})+\ldots\nonumber\\ && +\beta_{k-1}(t_{k-2}-t_{k})]/(t_{k}-t_{k-1})\nonumber\\ \beta_{k+1} &= & [\beta_{2}(t_{1}-t_{k-1})+\beta_{3}(t_{2}-t_{k-1})+\ldots\\ && + \beta_{k-1}(t_{k-2}-t_{k-1})]/(t_{k-1}-t_{k})\nonumber . \end{array}\]

A test of linearity in X can be obtained by testing

\[H_{0} : \beta_{2} = \beta_{3} = \ldots = \beta_{k-1} = 0.\]

Example: Selvin et al. (2010)

2.4.6 Choosing Number and Position of Knots

Knots are specified in advance in regression splines
Locations not important in most situations— Stone (1986), Durrleman & Simon (1989)
Place knots where data exist — fixed quantiles of predictor’s marginal distribution
Fit depends more on choice of \(k\)

k	Quantiles

3			.10	.5	.90
4			.05	.35	.65	.95
5		.05	.275	.5	.725	.95
6	.05	.23	.41	.59	.77	.95
7	.025	.1833	.3417	.5	.6583	.8167	.975

\(n<100\) – replace outer quantiles with 5th smallest and 5th largest \(X\) (Stone & Koo (1985)).

Choice of \(k\):

Flexibility of fit vs. \(n\) and variance
Usually \(k=3,4,5\). Often \(k=4\)
Large \(n\) (e.g. \(n\geq 100\)) – \(k=5\)
Small \(n\) (\(<30\), say) – \(k=3\)
Can use Akaike’s information criterion (AIC) (Atkinson (1980), van Houwelingen & le Cessie (1990)) to choose \(k\)
This chooses \(k\) to maximize model likelihood ratio \(\chi^{2} - 2k\).

See Govindarajulu et al. (2007) for a comparison of restricted cubic splines, fractional polynomials, and penalized splines.

2.4.7 Nonparametric Regression

Estimate tendency (mean or median) of \(Y\) as a function of \(X\)
Few assumptions
Especially handy when there is a single \(X\)
Plotted trend line may be the final result of the analysis
Simplest smoother: moving average

\(X\):	1	2	3	5	8
\(Y\):	2.1	3.8	5.7	11.1	17.2

\[\begin{array}{ccc} \hat{E}(Y | X=2) &=& \frac{2.1+3.8+5.7}{3} \\ \hat{E}(Y | X=\frac{2+3+5}{3}) &=& \frac{3.8+5.7+11.1}{3} \end{array}\]

overlap OK
problem in estimating \(E(Y)\) at outer \(X\)-values
estimates very sensitive to bin width
Moving linear regression far superior to moving avg. (moving flat line)
Cleveland (1979) moving linear regression smoother loess (locally weighted least squares) is the most popular smoother. To estimate central tendency of \(Y\) at \(X=x\):
- take all the data having \(X\) values within a suitable interval about \(x\) (default is \(\frac{2}{3}\) of the data)
- fit weighted least squares linear regression within this neighborhood
- points near \(x\) given the most weight⁵
- points near extremes of interval receive almost no weight
- loess works much better at extremes of \(X\) than moving avg.
- provides an estimate at each observed \(X\); other estimates obtained by linear interpolation
- outlier rejection algorithm built-in
loess works for binary \(Y\) — just turn off outlier detection
Other popular smoother: Friedman’s “super smoother”
For loess or supsmu amount of smoothing can be controlled by analyst
Another alternative: smoothing splines⁶
Smoothers are very useful for estimating trends in residual plots

⁵ Weight here means something different than regression coefficient. It means how much a point is emphasized in developing the regression coefficients.

⁶ These place knots at all the observed data points but penalize coefficient estimates towards smoothness.

2.4.8 Advantages of Regression Splines over Other Methods

Regression splines have several advantages (Harrell et al. (1988)):

Parametric splines can be fitted using any existing regression program
Regression coefficients estimated using standard techniques (ML or least squares), formal tests of no overall association, linearity, and additivity, confidence limits for the estimated regression function are derived by standard theory.
The fitted function directly estimates transformation predictor should receive to yield linearity in \(C(Y|X)\).
Even when a simple transformation is obvious, spline function can be used to represent the predictor in the final model (and the d.f. will be correct). Nonparametric methods do not yield a prediction equation.
Extension to non-additive models.
Multi-dimensional nonparametric estimators often require burdensome computations.

2.5 Recursive Partitioning: Tree-Based Models

Breiman et al. (1984): CART (Classification and Regression Trees) — essentially model-free

Method:

Find predictor so that best possible binary split has maximum value of some statistic for comparing 2 groups
Within previously formed subsets, find best predictor and split maximizing criterion in the subset
Proceed in like fashion until \(<k\) obs. remain to split
Summarize \(Y\) for the terminal node (e.g., mean, modal category)
Prune tree backward until it cross-validates as well as its “apparent” accuracy, or use shrinkage

Advantages/disadvantages of recursive partitioning:

Does not require functional form for predictors
Does not assume additivity — can identify complex interactions
Can deal with missing data flexibly
Interactions detected are frequently spurious
Does not use continuous predictors effectively
Penalty for overfitting in 3 directions
Often tree doesn’t cross-validate optimally unless pruned back very conservatively
Very useful in messy situations or those in which overfitting is not as problematic (confounder adjustment using propensity scores Cook & Goldman (1988); missing value imputation)

See Austin et al. (2010).

2.5.1 New Directions in Predictive Modeling

The approaches recommended in this course are

fitting fully pre-specified models without deletion of “insignificant” predictors
using data reduction methods (masked to \(Y\)) to reduce the dimensionality of the predictors and then fitting the number of parameters the data’s information content can support
use shrinkage (penalized estimation) to fit a large model without worrying about the sample size.

The data reduction approach can yield very interpretable, stable models, but there are many decisions to be made when using a two-stage (reduction/model fitting) approach, Newer approaches are evolving, including the following. These new approach handle continuous predictors well, unlike recursive partitioning.

lasso (shrinkage using L1 norm favoring zero regression coefficients) - Tibshirani (1996), Steyerberg et al. (2000)
elastic net (combination of L1 and L2 norms that handles the \(p > n\) case better than the lasso) Zou & Hastie (2005)
adaptive lasso H. H. Zhang & Lu (2007), H. Wang & Leng (2007)
more flexible lasso to differentially penalize for variable selection and for regression coefficient estimation (Radchenko & James (2008))
group lasso to force selection of all or none of a group of related variables (e.g., indicator variables representing a polytomous predictor)
group lasso-like procedures that also allow for variables within a group to be removed (S. Wang et al. (2009))
sparse-group lasso using L1 and L2 norms to achieve spareness on groups and within groups of variables (N. Simon et al. (2013))
adaptive group lasso (Wang & Leng)
Breiman’s nonnegative garrote (Xiong (2010))
“preconditioning”, i.e., model simplification after developing a “black box” predictive model - Paul et al. (2008),Nott & Leng (2010)
sparse principal components analysis to achieve parsimony in data reduction Witten & Tibshirani (2008),Zhou et al. (2006),Leng & Wang (2009),Lee et al. (2010)
bagging, boosting, and random forests T. Hastie et al. (2008)

One problem prevents most of these methods from being ready for everyday use: they require scaling predictors before fitting the model. When a predictor is represented by nonlinear basis functions, the scaling recommendations in the literature are not sensible. There are also computational issues and difficulties obtaining hypothesis tests and confidence intervals.

When data reduction is not required, generalized additive models T. J. Hastie & Tibshirani (1990), Wood (2006) should also be considered.

2.5.2 Choosing Between Machine Learning and Statistical Modeling

Statistical models allow for complexity (nonlinearity, interaction)
Easy to allow every predictor to have nonlinear effect
Easy to handle unlimited numbers of candidate predictors if assume additivity (e.g., using ridge regression, lasso, elastic net)
Interactions should be pre-specified
Machine learning is gaining attention but is oversold in some settings
Researchers are under the mistaken impression that machine learning can be used on small samples
See A brief tour of deep learning from a statistical perspective by E Nalisnick, P Smyth, D Tran, 2023.

blogblog

Considerations in Choosing One Approach over Another

A statistical model may be the better choice if

Uncertainty is inherent and the signal:noise ratio is not large; even with identical twins, one twin may get colon cancer and the other not; model tendencies instead of doing classification
One could never have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error
One wants to isolate effects of a small number of variables
Uncertainty in an overall prediction or the effect of a predictor is sought
Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be pre-specified
The sample size isn’t huge
One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor
One wants the entire model to be interpretable

Machine learning may be the better choice if

The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an “E” or not an “E”
The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)
Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)
One is not very interested in estimating uncertainty in forecasts or in effects of select predictors
Non-additivity is expected to be strong and can’t be isolated to a few pre-specified variables (e.g., in visual pattern recognition the letter “L” must have both a dominating vertical component and a dominating horizontal component)
The sample size is huge (van der Ploeg et al. (2014))
One does not need to isolate the effect of a special variable such as treatment
One does not care that the model is a “black box”

But see this for navigating resources exposing problems in ML applications such as the following:

Weak design and methods
Poor reporting
High risk of bias
Spin
Non-open science

2.6 Multiple Degree of Freedom Tests of Association

\[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}^{2} ,\] \(H_{0}: \beta_{2}=\beta_{3}=0\) with 2 d.f. to assess association between \(X_{2}\) and outcome.

In the 5-knot restricted cubic spline model \[C(Y|X) = \beta_{0}+\beta_{1}X+\beta_{2}X'+\beta_{3}X''+\beta_{4}X''' ,\] \(H_{0}: \beta_{1}=\ldots=\beta_{4}=0\)

Test of association: 4 d.f.
Insignificant \(\rightarrow\) dangerous to interpret plot
What to do if 4 d.f. test insignificant, 3 d.f. test for linearity insig., 1 d.f. test sig. after delete nonlinear terms?

Grambsch & O’Brien (1991) elegantly described the hazards of pretesting

Studied quadratic regression
Showed 2 d.f. test of association is nearly optimal even when regression is linear if nonlinearity entertained
Considered ordinary regression model
\(E(Y|X) =\beta_{0}+\beta_{1}X+\beta_{2}X^{2}\)
Two ways to test association between \(X\) and \(Y\)
Fit quadratic model and test for linearity (\(H_{0}: \beta_{2}=0\))
\(F\)-test for linearity significant at \(\alpha=0.05\) level \(\rightarrow\) report as the final test of association the 2 d.f. \(F\) test of \(H_{0}: \beta_{1}=\beta_{2}=0\)
If the test of linearity insignificant, refit without the quadratic term and final test of association is 1 d.f. test, \(H_{0}:\beta_{1}=0 | \beta_{2}=0\)
Showed that type I error \(> \alpha\)
Fairly accurate \(P\)-value obtained by instead testing against \(F\) with 2 d.f. even at second stage
Cause: are retaining the most significant part of \(F\)
BUT if test against 2 d.f. can only lose power when compared with original \(F\) for testing both \(\beta\)s
\(SSR\) from quadratic model \(> SSR\) from linear model

2.7 Assessment of Model Fit

2.7.1 Regression Assumptions

The general linear regression model is \[C(Y|X) = X\beta =\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{k}X_{k} .\] Verify linearity and additivity. Special case: \[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2},\] where \(X_{1}\) is binary and \(X_{2}\) is continuous.

Figure 2.5: Regression assumptions for one binary and one continuous predictor

Methods for checking fit:

1. Fit simple linear additive model and check examine residual plots for patterns

For OLS: box plots of \(e\) stratified by \(X_{1}\), scatterplots of \(e\) vs. \(X_{2}\) and \(\hat{Y}\), with trend curves (want flat central tendency, constant variability)
For normality, qqnorm plots of overall and stratified residuals

Advantage: Simplicity

Disadvantages:

Can only compute standard residuals for uncensored continuous response
Subjective judgment of non-randomness
Hard to handle interaction
Hard to see patterns with large \(n\) (trend lines help)
Seeing patterns does not lead to corrective action

2. Scatterplot of \(Y\) vs. \(X_{2}\) using different symbols according to values of \(X_{1}\)

Advantages: Simplicity, can see interaction

Disadvantages:

Scatterplots cannot be drawn for binary, categorical, or censored \(Y\)
Patterns difficult to see if relationships are weak or \(n\) large

3. Stratify the sample by \(X_{1}\) and quantile groups (e.g. deciles) of \(X_{2}\); estimate \(C(Y|X_{1},X_{2})\) for each stratum

Advantages: Simplicity, can see interactions, handles censored \(Y\) (if you are careful)

Disadvantages:

Requires large \(n\)
Does not use continuous var. effectively (no interpolation)
Subgroup estimates have low precision
Dependent on binning method

4. Separately for levels of \(X_{1}\) fit a nonparametric smoother relating \(X_{2}\) to \(Y\)

Advantages: All regression aspects of the model can be summarized efficiently with minimal assumptions

Disadvantages:

Does not apply to censored \(Y\)
Hard to deal with multiple predictors

5. Fit flexible nonlinear parametric model

Advantages:

One framework for examining the model assumptions, fitting the model, drawing formal inference
d.f. defined and all aspects of statistical inference “work as advertised”

Disadvantages:

Complexity
Generally difficult to allow for interactions when assessing patterns of effects

Confidence limits, formal inference can be problematic for methods 1-4.

Restricted cubic spline works well for method 5.

\[\begin{array}{ccc} \hat{C}(Y|X) &=& \hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\hat{\beta}_{2}X_{2}+\hat{\beta}_{3}X_{2}'+\hat{\beta}_{4}X_{2}'' \nonumber\\ &=& \hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\hat{f}(X_{2}) , \end{array}\]

where \[\hat{f}(X_{2}) = \hat{\beta}_{2}X_{2}+\hat{\beta}_{3}X_{2}'+\hat{\beta}_{4}X_{2}'' ,\] \(\hat{f}(X_{2})\) spline-estimated transformation of \(X_{2}\).

Plot \(\hat{f}(X_{2})\) vs. \(X_{2}\)
\(n\) large \(\rightarrow\) can fit separate functions by \(X_{1}\)
Test of linearity: \(H_{0}:\beta_{3}=\beta_{4}=0\)
Few good reasons to do the test other than to demonstrate that linearity is not a good default assumption
Nonlinear \(\rightarrow\) use transformation suggested by spline fit or keep spline terms
Tentative transformation \(g(X_{2})\) \(\rightarrow\) check adequacy by expanding \(g(X_{2})\) in spline function and testing linearity
Can find transformations by plotting \(g(X_{2})\) vs. \(\hat{f}(X_{2})\) for variety of \(g\)
Multiple continuous predictors \(\rightarrow\) expand each using spline
Example: assess linearity of \(X_{2}, X_{3}\)

\[\begin{array}{ccc} C(Y|X) &=& \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}'+\beta_{4}X_{2}'' \nonumber\\ &+& \beta_{5}X_{3}+\beta_{6}X_{3}'+\beta_{7}X_{3}'' , \end{array}\]

Overall test of linearity \(H_{0}: \beta_{3}=\beta_{4}=\beta_{6}=\beta_{7}=0\), with 4 d.f.

2.7.2 Modeling and Testing Complex Interactions

Note: Interactions will be misleading if main effects are not properly modeled (M. Zhang et al. (2020)).

Suppose \(X_1\) binary or linear, \(X_2\) continuous:

\[\begin{array}{ccc} C(Y|X) & = & \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}'+\beta_{4}X_{2}'' \\ \nonumber &&+\beta_{5}X_{1}X_{2}+\beta_{6}X_{1}X_{2}'+\beta_{7}X_{1}X_{2}'' \end{array}\]

Simultaneous test of linearity and additivity: \(H_{0}: \beta_{3} = \ldots = \beta_{7} = 0\).

2 continuous variables: could transform separately and form simple product
But transformations depend on whether interaction terms adjusted for, so it is usually not possible to estimate transformations and interaction effects other than simultaneously
Compromise: Fit interactions of the form \(X_{1} f(X_{2})\) and \(X_{2} g(X_{1})\):

\[\begin{array}{ccc} C(Y|X) & = & \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{1}'+\beta_{3}X_{1}'' \nonumber\\ &+& \beta_{4}X_{2}+\beta_{5}X_{2}'+\beta_{6}X_{2}'' \nonumber\\ &+& \beta_{7}X_{1}X_{2}+\beta_{8}X_{1}X_{2}'+\beta_{9}X_{1}X_{2}'' \\ &+& \beta_{10}X_{2}X_{1}'+\beta_{11}X_{2}X_{1}'' \nonumber \end{array} \tag{2.3}\]

Test of additivity is \(H_{0}: \beta_{7} = \beta_{8} = \ldots = \beta_{11} = 0\) with 5 d.f.
Test of lack of fit for the simple product interaction with \(X_{2}\) is \(H_{0}: \beta_{8} = \beta_{9} = 0\)
Test of lack of fit for the simple product interaction with \(X_{1}\) is \(H_{0}: \beta_{10} = \beta_{11} = 0\)

General spline surface:

Cover \(X_{1} \times X_{2}\) plane with grid and fit patch-wise cubic polynomial in two variables
Restrict to be of form \(aX_{1}+bX_{2}+cX_{1}X_{2}\) in corners
Uses all \((k-1)^{2}\) cross-products of restricted cubic spline terms
See Gray (1992), Gray (1994) for penalized splines allowing control of effective degrees of freedom. See Berhane et al. (2008) for a good discussion of tensor splines.

Figure 2.6: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given \(t\)-PA, using a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot) regions are higher risk.

Figure 2.6 is particularly interesting because the literature had suggested (based on approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke whereas this flexible modeling approach (based on approximately 230 strokes) suggests that mean arterial blood pressure (roughly a \(45^\circ\) line) is what is most important over a broad range of blood pressures. At the far right one can see that pulse pressure (axis perpendicular to \(45^\circ\) line) may have an impact although a non-monotonic one.

Other issues:

\(Y\) non-censored (especially continuous) \(\rightarrow\) multi-dimensional scatterplot smoother (Chambers & Hastie (1992))
Interactions of order \(>2\): more trouble
2-way interactions among \(p\) predictors: pooled tests
\(p\) tests each with \(p-1\) d.f.

Some types of interactions to pre-specify in clinical studies:

Treatment \(\times\) severity of disease being treated
Age \(\times\) risk factors
Age \(\times\) type of disease
Measurement \(\times\) state of a subject during measurement
Race \(\times\) disease
Calendar time \(\times\) treatment
Quality \(\times\) quantity of a symptom
Measurement \(\times\) amount of deterioration of the measurement

The last example is worth expanding as an example in model formulation. Consider the following study.

A sample of patients seen over several years have a blood sample taken at time of hospitalization
Blood samples are frozen
Long after the last patient was sampled, the blood samples are thawed all in the same week and a blood analysis is done
It is known that the quality of the blood analysis deteriorates roughly logarithmically by the age of the sample; blood measurements made on old samples are assumed to be less predictive of outcome
This is reflected in an interaction between a function of sample age and the blood measurement B⁷
Patients were followed for an event, and the outcome variable of interest is the time from hospitalization to that event
To not assume a perfect logarithmic relationship for sample age on the effect of the blood measurement, a restricted cubic spline model with 3 default knots will be fitted for log sample age
Sample age is assumed to not modify the effects of non-blood predictors patient age and sex
Model may be specified the following way using the R rms package to fit a Cox proportional hazards model
Test for nonlinearity of sampleAge tests the adequacy of assuming a plain logarithmic trend in sample age

⁷ For continuous \(Y\) one might need to model the residual variance of \(Y\) as increasing with sample age, in addition to modeling the mean function.

Code

f <- cph(Surv(etime, event) ~ rcs(log(sampleAge), 3) * rcs(B, 4) +
           rcs(age, 5) * sex, data=mydata)

The B \(\times\) sampleAge interaction effects have 6 d.f. and tests whether the sample deterioration affects the effect of B. By not assuming that B has the same effect for old samples as for young samples, the investigator will be able to estimate the effect of B on outcome when the blood analysis is ideal by inserting sampleAge = 1 day when requesting predicted values as a function of B.

2.7.3 Fitting Ordinal Predictors

Small no. categories (3-4) \(\rightarrow\) polytomous factor, indicator variables
Design matrix for easy test of adequacy of initial codes \(\rightarrow\) \(k\) original codes + \(k-2\) indicators
More categories \(\rightarrow\) score using data-driven trend. Later tests use \(k-1\) d.f. instead of 1 d.f.
E.g., compute logit(mortality) vs. category
Much better: used penalized maximum likelihood estimation (R ordSmooth package) or Bayesian shrinkage (R brms package).

2.7.4 Distributional Assumptions

Some models (e.g., logistic): all assumptions in \(C(Y|X)=X\beta\) (implicitly assuming no omitted variables!)
Parametric models make an absolute distributional assumption
- E.g. linear regression: \(Y \sim X\beta + \epsilon, \epsilon \sim n(0,\sigma^{2})\)
- Examine distribution of residuals
Some models (Weibull, Cox (1972)):
\(C(Y|X) = C(Y=y|X) = d(y) + X\beta\)
\(C =\) log hazard
Check form of \(d(y)\) for parametric models such as Weibull
For semiparametric models (Cox and ordinal regression) \(d(y)\) does not need to be specified
- \(d(y)\) is estimated by a step function (is link function of empirical CDF when \(\beta=0\))
Show \(d(y)\) does not interact with \(X\) (parallelism assumption)

2.8 Complex Curve Fitting Example

Restricted cubic spline
Discontinuity (interrupted time series analysis)
Cyclic trend (seasonality)
Data from academic.oup.com/ije/article/46/1/348/2622842 by Bernal et al. (2017)
Rates of hospital admissions for acute coronary events in Sicily before and after a smoking ban on 2005-01
Poisson regression on case counts, adjusted for population size as an offset variable (analyzes event rate) (see stats.stackexchange.com/questions/11182)
Classic interrupted time series puts a discontinuity at the intervention point and assesses statistical evidence for a nonzero jump height
We will do that and also fit a continuous cubic spline but with multiple knots near the intervention point
All in context of long-term and seasonal trends
Uses the rms package gTrans function documented at hbiostat.org/R/rms/gtrans.html
Can do this without gTrans but harder to plot predicted values, get contrasts, and get chunk tests
Time variable is months after 2001-12-01

Code

require(rms)  # engages rms which also engages Hmisc which provides getHdata
options(prType='html')   # applies to printing model fits
getHdata(sicily)   # fetch dataset from hbiostat.org/data
d    <- sicily
dd   <- datadist(d);  options(datadist='dd')

Start with a standard restricted cubic spline fit, 6 knots at default quantile locations. From the fitted Poisson model we estimate the number of cases per a constant population size of 100,000.

Code

require(ggplot2)
g   <- function(x) exp(x) * 100000
off <- list(stdpop=mean(d$stdpop))    # offset for prediction (383464.4)
w   <- geom_point(aes(x=time, y=rate), data=d)
v   <- geom_vline(aes(xintercept=37, col=I('red')))
yl  <- ylab('Acute Coronary Cases Per 100,000')
f   <- Glm(aces ~ offset(log(stdpop)) + rcs(time, 6),
           data=d, family='poisson')
f$aic

[1] 721.5237

Code

ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

To add seasonality to the model can add sine and/or cosine terms
See pmean.com/07/CyclicalTrends.html by Steve Simon
If you knew the month at which incidence is a minimum, could just add a sine term to the model
Adding both sine and cosine terms effectively allows for a model parameter that estimates the time origin
Assume the period (cycle length) is 12m

Code

# Save knot locations
k  <- attr(rcs(d$time, 6), 'parms')
k

[1]  5.00 14.34 24.78 35.22 45.66 55.00

Code

kn <- k
# rcspline.eval is the rcs workhorse
h <- function(x) cbind(rcspline.eval(x, kn),
                       sin=sin(2*pi*x/12), cos=cos(2*pi*x/12))
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family='poisson')
f$aic

[1] 674.112

Code

ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

Next add more knots near intervention to allow for sudden change

Code

kn <- sort(c(k, c(36, 37, 38)))
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family='poisson')
f$aic

[1] 661.7904

Code

ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

Now make the slow trend simpler (6 knots) but add a discontinuity at the intervention. More finely control times at which predictions are requested, to handle discontinuity.

Code

h <- function(x) {
  z <- cbind(rcspline.eval(x, k),
             sin=sin(2*pi*x/12), cos=cos(2*pi*x/12),
             jump=x >= 37)
  attr(z, 'nonlinear') <- 2 : ncol(z)
  z
    }
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family='poisson', x=TRUE, y=TRUE) # x, y for LRTs
f$aic

[1] 659.6044

Code

times <- sort(c(seq(0, 60, length=200), 36.999, 37, 37.001))
ggplot(Predict(f, time=times, fun=g, offset=off)) + w + v + yl

Look at fit statistics especially evidence for the jump

Code

General Linear Model

Glm(formula = aces ~ offset(log(stdpop)) + gTrans(time, h), family = "poisson", 
    data = d, x = TRUE, y = TRUE)

	Model Likelihood Ratio Test
Obs 59	LR χ² 169.64
Residual d.f. 51	d.f. 7
g 0.080	Pr(>χ²) <0.0001

	β	S.E.	Wald Z	Pr(>\|Z\|)
Intercept	-6.2118	0.0095	-656.01	<0.0001
time	0.0635	0.0113	5.63	<0.0001
time'	-0.1912	0.0433	-4.41	<0.0001
time''	0.2653	0.0760	3.49	0.0005
time'''	-0.2409	0.0925	-2.61	0.0092
sin	0.0343	0.0067	5.11	<0.0001
cos	0.0380	0.0065	5.86	<0.0001
jump	-0.1268	0.0313	-4.06	<0.0001

Evidence for an intervention effect (jump)
Evidence for seasonality
Could have analyzed rates using a semiparametric model

Compute likelihood ratio \(\chi^2\) test statistics for this model

Code

anova(f, test='LR')

	χ²	d.f.	P
Likelihood Ratio Statistics for `aces`
time	169.64	7	<0.0001
Nonlinear	127.03	6	<0.0001
TOTAL	169.64	7	<0.0001

Get a joint LR test of seasonality and discontinuity by omitting 3 terms from the model

Code

g <- Glm(aces ~ offset(log(stdpop)) + rcs(time, k),
         data=d, family='poisson')
lrtest(f, g)


Model 1: aces ~ offset(log(stdpop)) + gTrans(time, h)
Model 2: aces ~ offset(log(stdpop)) + rcs(time, k)

  L.R. Chisq         d.f.            P 
6.591931e+01 2.000000e+00 4.884981e-15

2.9 Contrasts and Model Reparameterization

Contrasts allow one to estimate and test any kind of effects that are napped to the model’s regression parameters. Examples in Section 2.10.2 show how far you can take this idea, including double difference contrasts for estimating interaction effects or degree of nonlinearity. The rms package contrast.rms function makes it easy to estimate contrasts to get estimates such as the effect of age changing from 30 to 50 for males. Contrasts are just differences in predicted values, so all that’s needed is to specify the design matrices for one condition vs. another and to subtract these two matrices and multiply this difference by \(\hat{\beta}\). A standard formula is used to compute standard errors of such contrasts which leads to Wald confidence intervals.

Since Wald confidence intervals for contrasts are so easy to compute, it is tempting to always compute contrasts in an after-fitting step. This is optimal for bootstrapping, or for Bayesian models where the contrast is computed separately for each posterior draw, leading to posterior samples of the contrasts, which are summarized exactly as basic parameters are summarized (posterior median, highest posterior density uncertainty interval, etc.). But in a frequentist setting outside of a standard Gaussian linear model, Wald confidence intervals are not ideal. For example the sampling distribution of the contrast may be asymmetric, so symmetric confidence intervals may have poor coverage on at least one side. In addition, Wald confidence intervals may inconsistent with gold-standard likelihood ratio \(\chi^2\) tests, whereas profile likelihood confidence intervals are consistent with them.

To compute profile likelihood confidence intervals on a contrast, one must reparameterize the model with \(p\) parameters so that the contrast is represented by a single variable, with the model’s \(p-1\) parameters reparameterized so that the contrast and the \(p-1\) parameters span the same space as the original model. This allows profile likelihood algorithms to fix the contrast “parameter” at a given constant \(c\), compute maximum likelihood estimates of the remaining new parameters, and repeat this process many times to solve for the two values of \(c\) such that the interval between them is the set of values that if hyothesized to be true would not lead to rejection of the null using the likelihood ratio \(\chi^2\) test.

How does one derive the “contrast variable” and the remaining \(p-1\) variables? How can the contrast variable be constructed so that its estimated regression coefficient is exactly equal to the estimated contrast? This is a useful exercise in understanding regression model algebra even if not using profile likelihoods.

Consider a very simple case where the model is \(E(Y | X=x) = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} = f(x_1, x_2)\), where the vector \(x\) is \([x_1, x_2]\). Suppose the contrast of interest is just \(\beta_1\), i.e., the contrast is \(f(1, a) - f(0, a)\). The setting for \(x_2\), \(a\), is irrelevant and will cancel out since \(x_1\) and \(x_2\) do not interact. Se we are not reparameterizing the model; the matrix multiplication that translates from the original \(x\) space to the new space involves a \(2\times 2\) identity matrix, and one fits the model with the original \(x\) values to get the contrast of interest as the \(\hat{\beta_1}\).

Next consider a more meaningful case there the contrast of interest is \(\beta_1 - \beta_2\). Intuition tells us that the rephrasing of parameters needed for the model’s second variable is \(\beta_1 + \beta_2\) which is orthogonal to the first combination. Let’s use a general procedure to derive this second parameter. If we augment the first new parameter with the identity matrix that represents the original model, a singular value decomposition will yield the second variable that will make the modified model span the same space as the original. The R code below demonstrates this.

Code

p <- 2                   # number of non-intercept parameters
D <- c(1, -1)            # contrast of interest
C <- rbind(D, diag(p))   # diag(p) is 2x2 identity matrix
C

  [,1] [,2]
D    1   -1
     1    0
     0    1

Code

s <- svd(C)
s

$d
[1] 1.732051 1.000000

$u
           [,1]          [,2]
[1,] -0.8164966  1.855775e-16
[2,] -0.4082483 -7.071068e-01
[3,]  0.4082483 -7.071068e-01

$v
           [,1]       [,2]
[1,] -0.7071068 -0.7071068
[2,]  0.7071068 -0.7071068

Code

v <- s$v[, 2 : p, drop=FALSE]
v

           [,1]
[1,] -0.7071068
[2,] -0.7071068

The coefficients of \(x_1\) and \(x_2\) in v are the same, so the second parameter is associated with the sum of \(x_1\) and \(x_2\), with an arbitrary scaling constant.

Now consider a more interested case. A model has a predictor \(x\) that is modeled with an ordinary cubic polynomial plus a discontinuity at \(x=3\) and includes another covariate \(z\):

\[ \begin{aligned} E(Y | x, z) = f(x, z) &= \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \\ &+ \beta_4 [x > 3] + \beta_5 z \end{aligned} \]

Here \([s]\) is a 0/1 indicator variable for the truth of assertion \(s\). Our contrast of interest is \(f(3, a) - f(1, a)\) which is \(2\beta_1 + 8\beta_2 + 26\beta_3\). \(a\) cancels out and \(\beta_4\) is not involved either since \(x \leq 3\) for both contrast predictor settings. We need to create a new variable \(u\) that, when adjusted for the correct parameterization of the other variables, will result in a regression coefficient that is exactly \(2\beta_1 + 8\beta_2 + 26\beta_3\) in the original \(\beta\)s. Try \(u = (\frac{x}{2} + \frac{x^2}{8} + \frac{x^3}{26}) / 3\). Simulate some data and use the singular value decomposition to solve for the matrix to multiply the design matrix by to reparameterize the other parts of the model. Below xc (x contrast variable) is \(u\).

Code

set.seed(1)
n   <- 100; b0=3; b1=1; b2=.2; b3=.05; b4=1; b5=2
x   <- 2 * rnorm(n)
z   <- rnorm(n)
res <- 2 * rnorm(n)
d  <- data.frame(x, z)
dd <- datadist(d); options(datadist='dd')
y  <- b0 + b1*x + b2*x^2 + b3*x^3 + b4*(x > 3) + b5*z + res
# Form a general transformation function fo x
g <- function(x) cbind(x=x, x2=x^2, x3=x^3, x4=x > 3)
f  <- orm(y ~ gTrans(x, g) + z, x=TRUE, y=TRUE)
f

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = y ~ gTrans(x, g) + z, x = TRUE, y = TRUE)

	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs 100	LR χ² 149.06	R² 0.775	ρ 0.833
ESS 100	d.f. 5	R²_5,100 0.763	D_xy 0.651
Distinct Y 100	Pr(>χ²) <0.0001	R²_5,100 0.763
Y_0.5 3.747608	Score χ² 245.33	\|Pr(Y ≥ median)-½\| 0.333
max \|∂log L/∂β\| 9×10^-13	Pr(>χ²) <0.0001

	β	S.E.	Wald Z	Pr(>\|Z\|)
x	0.4238	0.2150	1.97	0.0487
x2	0.2207	0.0800	2.76	0.0058
x3	0.1179	0.0304	3.88	0.0001
x4	1.6193	1.7626	0.92	0.3583
z	1.7016	0.2337	7.28	<0.0001

Code

# For the contrast matrix setting z arbitrarily to pi
X1 <- predict(f, data.frame(x=1, z=pi), type='x')
X2 <- predict(f, data.frame(x=3, z=pi), type='x')
rbind(X1, X2)

  gTrans(x, g)x gTrans(x, g)x2 gTrans(x, g)x3 gTrans(x, g)x4        z
1             1              1              1              0 3.141593
1             3              9             27              0 3.141593

Code

D <- X2 - X1
D

  gTrans(x, g)x gTrans(x, g)x2 gTrans(x, g)x3 gTrans(x, g)x4 z
1             2              8             26              0 0

Code

D %*% coef(f)[- (1 : num.intercepts(f))]

      [,1]
1 5.678912

Code

p <- ncol(D)
C <- rbind(D, diag(p)); C

  gTrans(x, g)x gTrans(x, g)x2 gTrans(x, g)x3 gTrans(x, g)x4 z
1             2              8             26              0 0
              1              0              0              0 0
              0              1              0              0 0
              0              0              1              0 0
              0              0              0              1 0
              0              0              0              0 1

Code

s <- svd(C); s

$d
[1] 27.29469  1.00000  1.00000  1.00000  1.00000

$u
            [,1]          [,2] [,3] [,4]          [,5]
[1,] 0.999328634 -2.499025e-18    0    0 -1.260631e-16
[2,] 0.002686367  4.998049e-18    0    0  9.973082e-01
[3,] 0.010745469 -9.557790e-01    0    0 -2.156342e-02
[4,] 0.034922775  2.940858e-01    0    0 -7.008112e-02
[5,] 0.000000000  0.000000e+00   -1    0  0.000000e+00
[6,] 0.000000000  0.000000e+00    0   -1  0.000000e+00

$v
           [,1]       [,2] [,3] [,4]        [,5]
[1,] 0.07332356  0.0000000    0    0  0.99730821
[2,] 0.29329423 -0.9557790    0    0 -0.02156342
[3,] 0.95320625  0.2940858    0    0 -0.07008112
[4,] 0.00000000  0.0000000   -1    0  0.00000000
[5,] 0.00000000  0.0000000    0   -1  0.00000000

Code

# Need to account for a scaling constant used in SVD
v <- s$v * sqrt(sum(D ^ 2))
v

     [,1]       [,2]      [,3]      [,4]       [,5]
[1,]    2   0.000000   0.00000   0.00000 27.2029410
[2,]    8 -26.070176   0.00000   0.00000 -0.5881717
[3,]   26   8.021592   0.00000   0.00000 -1.9115580
[4,]    0   0.000000 -27.27636   0.00000  0.0000000
[5,]    0   0.000000   0.00000 -27.27636  0.0000000

Code

v <- s$v / sqrt(sum(D ^ 2))   # do the reverse when solving for x
v

            [,1]        [,2]        [,3]        [,4]          [,5]
[1,] 0.002688172  0.00000000  0.00000000  0.00000000  0.0365630928
[2,] 0.010752688 -0.03504056  0.00000000  0.00000000 -0.0007905534
[3,] 0.034946237  0.01078171  0.00000000  0.00000000 -0.0025692984
[4,] 0.000000000  0.00000000 -0.03666178  0.00000000  0.0000000000
[5,] 0.000000000  0.00000000  0.00000000 -0.03666178  0.0000000000

Code

# Get original design matrix and rotate it
# Contrast of interest is in the 1st column, all other columns
# are orthogonal to that
X <- f$x
Z <- X %*% v
orm(y ~ Z)

Logistic (Proportional Odds) Ordinal Regression Model

orm(formula = y ~ Z)

	Model Likelihood Ratio Test	Discrimination Indexes	Rank Discrim. Indexes
Obs 100	LR χ² 149.06	R² 0.775	ρ 0.833
ESS 100	d.f. 5	R²_5,100 0.763	D_xy 0.651
Distinct Y 100	Pr(>χ²) <0.0001	R²_5,100 0.763
Y_0.5 3.747608	Score χ² 245.33	\|Pr(Y ≥ median)-½\| 0.333
max \|∂log L/∂β\| 9×10^-13	Pr(>χ²) <0.0001

	β	S.E.	Wald Z	Pr(>\|Z\|)
Z[1]	5.6789	1.0985	5.17	<0.0001
Z[2]	-4.8087	1.9667	-2.45	0.0145
Z[3]	-44.1680	48.0777	-0.92	0.3583
Z[4]	-46.4143	6.3740	-7.28	<0.0001
Z[5]	11.1745	5.8960	1.90	0.0581

Compute the contrast on the original variables and compare to the coefficient of xc.

Code

k <- contrast(f, list(x=3), list(x=1))
k

           z Contrast     S.E.    Lower    Upper    Z Pr(>|z|)
1 -0.1772172 5.678912 1.098531 3.525831 7.831994 5.17        0

Confidence intervals are 0.95 individual intervals

Code

p <- Predict(f, x=c(1,3))
diff(p$yhat)

[1] 5.678912

Repeat using the more accurate profile likelihood method, which makes use of the singular value decomposition like the one above.

Code

contrast(f, list(x=3), list(x=1), conf.type='profile')

           z Contrast    Lower  Upper    Χ² Pr(>Χ²)
1 -0.1772172 5.678912 3.626648 7.9426 36.37       0

Confidence intervals are 0.95 profile likelihood intervals

Note that singular value decompositions have arbitrary signs. The contrast function does any needed sign reversal so that the contrast matches the original one.

2.10 Bayesian Modeling

There are many advantages to fitting models with a Bayesian approach when compared to the frequentist / maximum likelihood approach that receives more coverage in this text. These advantages include

the ability to use outside information, e.g. the direction or magnitude of an effect, magnitude of interaction effect, degree of nonlinearity
getting exact inference (to within simulation error) regarding model parameters without using any Gaussian approximations as used so heavily in frequentist inference
getting exact inference about derived parameters that are nonlinear transformations of the original model parameters, without using the \(\delta\)-method approximation so often required in frequentist procedures
devoting less than one degree of freedom to a parameter by borrowing information
getting exact inference using ordinary Bayesian procedures when penalization/shrinkage/regularization is used to limit overfitting
allowing one to fit distributional models, e.g., using a regression model for the log standard deviation in addition to one for the mean, and getting exact inference for the joint models
obtaining automatic inference about any interval of possible parameter values instead of just using \(p\)-values to bring evidence against a null value
obtaining exact inference about unions and intersections of various assertions about parameters, e.g., the probability that a treatment reduces mortality by any amount or reduces blood pressure by \(\geq5\) mmHg
getting exact uncertainty intervals for complex derived parameters such as the \(c\)-index or Somers’ \(D_{xy}\) that quantify the model’s predictive discrimination

Bayesian approaches do not tempt analysts to mistakenly assume that the central limit theorem protects them.The \(\delta\)-method fails when the sampling distribution of the parameter is not symmetric.It is just as easy to compute the Bayesian probability that an odds ratio exceeds 2.0 as it is to calculate the probability that the odds ratio exceeds 1.0.

Uncertainies in Model Performance Metrics

As seen in example output form the blrm function below, one can automatically obtain highest posterior density uncertainty intervals for any parameter including overall model performance metrics. These are derived from the \(m\) posterior draws of the model’s parameters by computing the model performance metric for each draw. The uncertainties captured by this process relate to the ability to well-estimate model parameters, which relates also to within-training-sample model fit. So the uncertainties reflect a similar phenomenon to what \(R^{2}_\text{adj}\) measures. Adjusted \(R^2\) other than McFadden’s penalize for \(p\), the number of regression parameters estimated, other than the intercept. This is very similar to considering likelihood ratio \(\chi^2\) statistics minus the number of degrees of freedom involved in the LR test. On the other hand, AIC approximates out-of-sample model performance by using a penalty of twice the degrees of freedom (like the seldom-used McFadden \(R^{2}_\text{adj}\))

So uncertainties computed by the blrm function come solely from the spread of the posterior distributions, i.e., the inability of the analysis to precisely estimate the unknown parameters. They condition on the observed design matrix \(X\) and do not consider other samples as would be done with out-of-sample predictive accuracy assessment with AIC, the bootstrap, or cross-validation.

When \(p=1\) a rank measure of predictive discrimination such as \(D_{xy}\) will have no uncertainty unless the sign of the one regression coefficient often flips over posterior draws.

A major part of the arsenal of Bayesian modeling weapons is Stan based at Columbia University. Very general R statistical modeling packages such as brms and rstanarm are based on Stan.

RMS has several fully worked-out Bayesian modeling case studies. The purpose of the remainder of this chapter is to show the power of Bayes in general regression modeling.

2.10.1 A General Model Specification Approach

With a Bayesian approach one can include a parameter for each aspect of the model you know exists but are unsure about. This leads to results that are not overly optimistic, because uncertainty intervals will be a little wider to acknowledge what you don’t know. A good example is the Bayesian \(t\)-test, which has a parameter for the amount of non-normality and a parameter for how unequal the variances may be in the two groups being compared. Prior distributions can favor normality and equal variance, and modeling becomes more flexible as \(n \uparrow\).

Other examples of borrowing information and adding flexibility:

include a parameter for time-varying treatment effect to not assume proportional hazards
include a parameter for a \(Y\)-varying effect to not assume proportional odds
include an interaction effect for a treatment \(\times\) sex interaction and use a prior that favors the treatment effect for females being simlar to the treatment effect for males but that allows the effects to become arbitrarily different as \(n \uparrow\).

Interactions bring special problems to estimation and inference. In the best of cases, an interaction requires \(4 \times\) larger sample size to estimate and may require \(16 \times\) the sample size to achieve the same power as a main effect test. We need a way to borrow information, essentially having an interaction term “half in” and “half out” of the model. This has been elegantly described by R. Simon & Freedman (1997) who show how to put priors on interaction terms. Using a Bayesian approach to have an interaction “half in” the model is a much more rational approach than prevailing approaches that

use a sample that was sized for estimating main effects and not interaction effects
use a (low power) test for interaction to decide how to analyze the treatment effect, and ignore such pre-testing when doing a statistical test (that will not preserve \(\alpha\)) or computing a confidence interval (that will not have the nominal \(1 - \alpha\) coverage)
use the pre-specified model with interaction, resulting in low precision because of having no way to borrow information across levels of the interacting factor

2.10.2 Help in Specifying Priors

To gain the advantages of Bayesian modeling described above, doing away with binary decisions and allowing the use of outside information, one must specify prior distributions for parameters. It is often difficult to do this, especially when there are nonlinear effects (e.g., splines) and interactions in the model. We need a way to specify priors on the original \(X\) and \(Y\) scales. Fortunately Stan provides an elegant solution.

As discussed here Stan allows one to specify priors on transformations of model parameters, and these priors propagate back to the original parameters. It is easier to specify a prior for the effect of increasing age from 30 to 60 that it is to specify a prior for the age slope. It may be difficult to specify a prior for an age \(\times\) treatment interaction (especially when the age effect is nonlinear), but much easier to specify a prior for how different the treatment effect is for a 30 year old and a 60 year old. By specifying priors on one or more contrasts one can easily encode outside information / information borrowing / shrinkage.

The rms contrast function provides a general way to implement contrasts up to double differences, and more details about computations are provided in that link. The approach used for specifying priors for contrast in rmsb::blrm uses the same process but is even more general. Both contrast and blrm compute design matrices at user-specified predictor settings, and the contrast matrices (matrices multipled by \(\hat{\beta}\)) are simply differences in such design matrices. Thinking of contrasts as differences in predicted values frees the user from having to care about how parameters map to estimands, and allows an R predict(fit, type='x') function do the hard work. Examples of types of differences are below.

rmsb implements priors on contrasts starting with version 1.0-0.

no difference: compute an absolute predicted value (not implemented in blrm for priors)
single difference
- treatment main effect
- continuous predictor effects computed by subtracting predictions at two values of the predictor
double difference
- amount of nonlinearity (differences in slopes over intervals of the predictor)
- interaction effect (e.g., age slope for females minus age slope for males)
triple difference
- amount of nonlinearity in an interaction effect

For predictors modeled linearly, the slope is the regression coefficient. For nonlinear effects where \(x\) is transformed by \(f(x)\), the slope at \(x=\frac{a+b}{2}\) is proportionally approximated by \(f(b) - f(a)\), and the slope at \(x=\frac{b+c}{2}\) by \(f(c) - f(b)\). The amount of nonlinearity is reflected by the difference in the two slopes, or \(f(c) - f(b) -[f(b) - f(a)] = f(a) + f(c) - 2f(b)\). You’ll see this form specified in the contrast part of the pcontrast argument to blrm below.

This is a numerical approximation to the second derivative (slope of the slope; acceleration). It would be easy to use more accurate Lagrange interpolation derivative approximators here.

2.10.3 Examples of Priors on Contrasts

Semiparametic models are introduced in Chapter 13 but we will use one of the models—the proportional odds (PO) ordinal logistic model—in showcasing the utility of specifying priors on contrasts in order to use external information or to place restrictions on model fits. The blrm function in the rmsb package implements this semiparametric model using Stan. Because it does not depend on knowing how to transform \(Y\), I almost always use the more robust ordinal models instead of linear models. The linear predictor \(X\beta\) is on the logit (log odds) scale for the PO model. This unitless scale typically ranges from -5 to 5, corresponding to a range of probabilities of 0.007 to 0.993. Default plotting uses the intercept corresponding to the marginal median of \(Y\), so the log odds of the probability that \(Y\) exceeds or equals this level, given \(X\), is plotted. Estimates can be converted to means, quantiles, or exceedance probabilities using the Mean, Quantile, and ExProb functions in the rms and rmsb packages.

Ordinal models in the cumulative probability class such as the PO model have \(k\) intercepts for \(k+1\) distinct values of \(Y\). These intercepts encode the entire empirical distribution of \(Y\) for one covariate setting, hence the term semiparametric and freedom from having to choose a \(Y\) distribution.

Effects for the PO model are usually expressed as odds ratios (OR). For the case where the prior median for the OR is 1.0 (prior mean or median log(OR)=0.0) it is useful to solve for the prior SD \(\sigma\) so that \(\Pr(\text{OR} > r) = a = \Pr(\text{OR} < \frac{1}{r})\), leading to \(a = \frac{|\log(r)|}{\Phi^{-1}(1-a)}\), computed by the psigma function below. Another function . is defined as an abbreviation for list() for later usage.

Code

psigma <- function(r, a, inline=FALSE, pr=! inline) {
  sigma <- abs(log(r)) / qnorm(1 - a)
  dir <- if(r > 1.) '>' else '<'
  x <- if(inline) paste0('$\\Pr(\\text{OR}', dir, r, ') =', a,
                         ' \\Rightarrow \\sigma=', round(sigma, 3), '$')
  else paste0('Pr(OR ', dir, ' ', r, ') = ', a, ' ⇒ σ=', round(sigma, 3))
  if(inline) return(x)
  if(pr) {
    cat('\n', x, '\n\n', sep='')
    return(invisible(sigma))
  }
  sigma
}
. <- function(...) list(...)

Start with a simple hypothetical example:

model has a quadratic effect of age
age interacts with treatment
sex is also in the model, not interacting with anything

We wish to specify a prior on the treatment effect at age 50 so that there is only a 0.05 chance that the \(\text{OR} < 0.5\). \(\Pr(\text{OR}<0.5) =0.05 \Rightarrow \sigma=0.421\). The covariate settings specified in pcontrast below do not mention sex, so predictions are evaluated at the default sex (the mode). Since sex does not interact with anything, the treatment difference of interest makes the sex setting irrelevant anyway.

Code

require(rmsb)
f <- blrm(y ~ treatment * pol(age, 2) + sex,
          pcontrast=list(sd=psigma(0.5, 0.05),
                         c1=.(treatment='B', age=50),  # .() = list()
                         c2=.(treatment='A', age=50),
                         contrast=expression(c1 - c2) ) )

Note that the notation needed for pcontrast need not consider how age is modeled.

Consider a more complicated situation. Let’s simulate data for one continuous predictor where the true model is a sine wave. The response variable is a slightly rounded version of a conditionally normal \(Y\).

The rounding is done just to lower the number of intercepts from 199 to 52 to speed up the Bayesian PO model fits.

Code

require(rmsb)
require(ggplot2)
options(mc.cores=4,    # See https://hbiostat.org/r/examples/blrm/blrm
        rmsb.backend='cmdstan', rmsbdir='~/.rmsb',
        prType='html')
cmdstanr::set_cmdstan_path(cmdstan.loc)
# cmdstan.loc is defined in ~/.Rprofile

set.seed(3)
n <- 200
x <- rnorm(n)
y <- round(sin(2*x) + rnorm(n), 1)
dd <- datadist(x, q.display=c(.005, .995)); options(datadist='dd')
f <- blrm(y ~ rcs(x, 6))

Running MCMC with 4 parallel chains...

Chain 4 finished in 1.1 seconds.
Chain 1 finished in 1.2 seconds.
Chain 2 finished in 1.2 seconds.
Chain 3 finished in 1.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.1 seconds.
Total execution time: 1.3 seconds.

Code

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6))

	Mixed Calibration/ Discrimination Indexes	Discrimination Indexes	Rank Discrim. Indexes
Obs 200	LOO log L -784.9±14.3	g 1.422 [1.059, 1.726]	C 0.698 [0.69, 0.704]
Draws 4000	LOO IC 1569.79±28.6	g_p 0.296 [0.241, 0.341]	D_xy 0.397 [0.38, 0.408]
Chains 4	Effective p 75.86±6.43	EV 0.279 [0.177, 0.366]
Time 1.8s	B 0.198 [0.193, 0.203]	v 1.596 [0.835, 2.267]
p 5		vp 0.069 [0.045, 0.092]

	Mode β	Mean β	Median β	S.E.	Lower	Upper	Pr(β>0)	Symmetry
x	-3.1225	-3.0894	-3.0986	0.8045	-4.7037	-1.5642	0.0000	1.02
x'	17.0377	16.7293	16.7357	7.9435	1.2582	32.1183	0.9812	0.99
x''	-20.1216	-18.7757	-18.8505	37.4808	-91.2568	54.8113	0.3120	1.04
x'''	-37.4410	-39.0918	-38.6277	58.0373	-148.6275	74.7209	0.2530	0.95
x''''	50.8116	51.2582	50.5935	49.6152	-45.2430	146.7735	0.8522	1.05

Code

ggplot(Predict(f))

Code

# Plot predicted mean instead of log odds
M <- Mean(f)
ggplot(Predict(f, fun=M),
               ylab=expression(hat(E)(Y*"|"*x))) +
  geom_smooth(mapping=aes(x,y)) +
  labs(caption='Black line: posterior mean of predicted means from PO model\nBlue line: loess nonparametric smoother')

Now suppose that there is strong prior knowledge that the effect of x is linear when x is in the interval \([-1, 0]\). Let’s reflect that by putting a very sharp prior to tilt the difference in slopes within that interval towards 0.0. pcontrast= specifies two separate contrasts to pull towards zero to more finely detect nonlinearity.

The examples that follow use atypically small prior standard deviations so that constraints will be obvious.

Code

con <- list(sd=rep(psigma(1.05, 0.01), 2),
            c1=.(x=-1),   c2=.(x=-.75), c3=.(x=-.5),
            c4=.(x=-.25), c5=.(x=0),
            contrast=expression(c1 + c3 - 2 * c2, c3 + c5 - 2 * c4))


Pr(OR > 1.05) = 0.01 ⇒ σ=0.021

Code

f <- blrm(y ~ rcs(x, 6), pcontrast=con)

Running MCMC with 4 parallel chains...

Chain 2 finished in 2.3 seconds.
Chain 1 finished in 2.4 seconds.
Chain 3 finished in 2.4 seconds.
Chain 4 finished in 2.4 seconds.

All 4 chains finished successfully.
Mean chain execution time: 2.4 seconds.
Total execution time: 2.6 seconds.

Code

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6), pcontrast = con)

	Mixed Calibration/ Discrimination Indexes	Discrimination Indexes	Rank Discrim. Indexes
Obs 200	LOO log L -798.78±14.7	g 0.938 [0.681, 1.258]	C 0.658 [0.644, 0.681]
Draws 4000	LOO IC 1597.56±29.4	g_p 0.208 [0.156, 0.268]	D_xy 0.316 [0.289, 0.362]
Chains 4	Effective p 72.89±6.15	EV 0.141 [0.077, 0.215]
Time 3.1s	B 0.213 [0.208, 0.219]	v 0.739 [0.308, 1.168]
p 5		vp 0.035 [0.019, 0.053]

	Mode β	Mean β	Median β	S.E.	Lower	Upper	Pr(β>0)	Symmetry
x	0.2738	0.2759	0.2706	0.3131	-0.3287	0.8896	0.8132	1.04
x'	1.3744	1.3863	1.3947	0.9007	-0.2635	3.3110	0.9408	0.98
x''	-6.0081	-6.0086	-6.0275	3.4010	-12.5814	0.5100	0.0348	0.98
x'''	24.1545	24.1044	24.1415	10.7462	2.4869	44.8096	0.9898	1.05
x''''	-65.2064	-65.3162	-64.9237	22.4652	-111.1543	-23.3077	0.0010	0.95

Contrasts Given Priors

[1] list(sd = c(0.0209728582358081, 0.0209728582358081), c1 = list(   
[2]     x = -1), c2 = list(x = -0.75), c3 = list(x = -0.5), c4 = list(
[3]     x = -0.25), c5 = list(x = 0), contrast = expression(c1 +      
[4]     c3 - 2 * c2, c3 + c5 - 2 * c4))

Code

f$Contrast   # Print the design matrix corresponding to the two contrasts

  rcs(x, 6)x rcs(x, 6)x' rcs(x, 6)x'' rcs(x, 6)x''' rcs(x, 6)x''''
1          0  0.02911963  0.002442679    0.00000000              0
1          0  0.04851488  0.020842754    0.00300411              0

Code

ggplot(Predict(f))

What happens if we moderately limit the acceleration (second derivative; slope of the slope) at 7 equally-spaced points?

Code

con <- list(sd=rep(0.5, 7),
            c1=.(x=-2), c2=.(x=-1.5), c3=.(x=-1), c4=.(x=-.5), c5=.(x=0),
            c6=.(x=.5), c7=.(x=1), c8=.(x=1.5), c9=.(x=2),
            contrast=expression(c1 + c3 - 2 * c2,
                                c2 + c4 - 2 * c3,
                                c3 + c5 - 2 * c4,
                                c4 + c6 - 2 * c5,
                                c5 + c7 - 2 * c6,
                                c6 + c8 - 2 * c7,
                                c7 + c9 - 2 * c8) )
f <- blrm(y ~ rcs(x, 6), pcontrast=con)

Running MCMC with 4 parallel chains...

Chain 1 finished in 1.0 seconds.
Chain 4 finished in 1.0 seconds.
Chain 2 finished in 1.2 seconds.
Chain 3 finished in 1.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.1 seconds.
Total execution time: 1.3 seconds.

Code

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6), pcontrast = con)

	Mixed Calibration/ Discrimination Indexes	Discrimination Indexes	Rank Discrim. Indexes
Obs 200	LOO log L -785.22±13.89	g 1.071 [0.82, 1.325]	C 0.694 [0.683, 0.704]
Draws 4000	LOO IC 1570.44±27.79	g_p 0.239 [0.192, 0.286]	D_xy 0.388 [0.366, 0.409]
Chains 4	Effective p 73.18±6.44	EV 0.181 [0.113, 0.252]
Time 1.7s	B 0.198 [0.193, 0.204]	v 0.904 [0.506, 1.35]
p 5		vp 0.045 [0.028, 0.063]

	Mode β	Mean β	Median β	S.E.	Lower	Upper	Pr(β>0)	Symmetry
x	-2.0433	-2.0099	-2.0243	0.6198	-3.2583	-0.8814	0.0015	1.06
x'	12.6742	12.5233	12.5217	5.0400	2.6084	22.1149	0.9920	0.98
x''	-21.0768	-20.6335	-20.6586	22.1924	-63.9604	23.3569	0.1742	1.02
x'''	-9.4453	-9.8318	-9.8116	32.9628	-72.3885	56.2248	0.3890	0.99
x''''	14.4264	14.5429	14.1230	28.6168	-39.1407	70.7082	0.6825	1.05

Contrasts Given Priors

[1] list(sd = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), c1 = list(x = -2),    
[2]     c2 = list(x = -1.5), c3 = list(x = -1), c4 = list(x = -0.5),      
[3]     c5 = list(x = 0), c6 = list(x = 0.5), c7 = list(x = 1), c8 = list(
[4]         x = 1.5), c9 = list(x = 2), contrast = expression(c1 +        
[5]         c3 - 2 * c2, c2 + c4 - 2 * c3, c3 + c5 - 2 * c4, c4 +         
[6]         c6 - 2 * c5, c5 + c7 - 2 * c6, c6 + c8 - 2 * c7, c7 +         
[7]         c9 - 2 * c8))

Code

f$Contrast   # Print the design matrix corresponding to the two contrasts

  rcs(x, 6)x rcs(x, 6)x' rcs(x, 6)x'' rcs(x, 6)x''' rcs(x, 6)x''''
1          0  0.01298375  0.000000000   0.000000000    0.000000000
1          0  0.07768802  0.002453429   0.000000000    0.000000000
1          0  0.15526901  0.045575694   0.003046185    0.000000000
1          0  0.23285000  0.122161511   0.048638084    0.001537246
1          0  0.30602473  0.196347206   0.122778948    0.037926277
1          0  0.24969458  0.170741274   0.117781863    0.055476864
1          0  0.06289905  0.043179908   0.029952920    0.014391805

Code

ggplot(Predict(f))

Next simulate data with one continuous predictor x1 and a 3-level grouping variable x2. Start with almost flat priors that allow arbitrary interaction patterns as long as x1 has a smooth effect.

Code

set.seed(6)
n <- 90
x1 <- runif(n)
x2 <- sample(c('a', 'b', 'c'), n, TRUE)
y  <- round(x1 + (x1 - 0.5) ^ 2 -0.4 * (x2 == 'b') + .5 * (x2 == 'c') + runif(n), 1)
dd <- datadist(x1, x2)
f <- orm(y ~ rcs(x1, 4) * x2)
ggplot(Predict(f, x1, x2))

Code

f <- blrm(y ~ rcs(x1, 4) * x2)

Running MCMC with 4 parallel chains...

Chain 1 finished in 0.6 seconds.
Chain 2 finished in 0.5 seconds.
Chain 4 finished in 0.5 seconds.
Chain 3 finished in 0.6 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 0.7 seconds.

Code

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.105 for Intercepts

blrm(formula = y ~ rcs(x1, 4) * x2)

Frequencies of Responses

  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 
  1   1   4   1   2   1   8   5   5   5   4   3   8   3   7   8   3   4   3   3 
  2 2.1 2.2 2.3 2.7 
  4   4   1   1   1

	Mixed Calibration/ Discrimination Indexes	Discrimination Indexes	Rank Discrim. Indexes
Obs 90	LOO log L -251.24±9.6	g 3.835 [2.959, 4.655]	C 0.848 [0.834, 0.86]
Draws 4000	LOO IC 502.47±19.21	g_p 0.44 [0.409, 0.473]	D_xy 0.696 [0.667, 0.72]
Chains 4	Effective p 43.33±4.33	EV 0.648 [0.548, 0.758]
Time 1.1s	B 0.11 [0.098, 0.122]	v 11.348 [6.478, 16.163]
p 11		vp 0.16 [0.132, 0.19]

	Mode β	Mean β	Median β	S.E.	Lower	Upper	Pr(β>0)	Symmetry
x1	1.5721	1.5405	1.4134	4.8741	-7.1487	11.4742	0.6152	1.01
x1'	-5.8947	-5.3431	-5.1217	15.8510	-35.8123	25.8332	0.3705	0.97
x1''	44.1414	41.6796	40.6164	46.3358	-48.0219	133.8866	0.8235	1.02
x2=b	-2.5978	-2.6626	-2.6437	1.6172	-5.6493	0.6593	0.0505	1.01
x2=c	6.5733	6.4699	6.4529	1.9660	2.9408	10.6871	1.0000	1.05
x1 × x2=b	-0.1123	-0.1178	-0.0493	7.3867	-14.4457	14.3859	0.4975	0.96
x1' × x2=b	9.7407	9.8998	9.8749	23.2850	-37.5361	53.1591	0.6650	0.99
x1'' × x2=b	-28.6660	-28.9611	-28.6527	66.1296	-162.7945	96.6076	0.3270	1.01
x1 × x2=c	-11.4679	-11.2318	-11.1422	8.6543	-28.7246	5.2402	0.0940	0.99
x1' × x2=c	39.1883	38.0264	38.1637	26.3466	-11.4145	90.9475	0.9255	1.05
x1'' × x2=c	-119.1197	-114.8539	-114.6710	72.2872	-261.8098	18.9491	0.0545	0.95

Code

ggplot(Predict(f, x1, x2))

Put priors specifying that groups b and c have a similar x1-shape (no partial interaction between x1 and b vs. c). contrast below encodes parallelism with respect to b and c.

Code

con <- list(sd=rep(psigma(1.5, 0.05), 4),
            c1=.(x1=0,   x2='b'), c2=.(x1=0,   x2='c'),
            c3=.(x1=.25, x2='b'), c4=.(x1=.25, x2='c'),
            c5=.(x1=.5,  x2='b'), c6=.(x1=.5,  x2='c'),
            c7=.(x1=.75, x2='b'), c8=.(x1=.75, x2='c'),
            c9=.(x1=1,   x2='b'), c10=.(x1=1,  x2='c'),
            contrast=expression(c1 - c2 - (c3 - c4),  # gap between b and c curves at x1=0 vs. x1=.25
                                c1 - c2 - (c5 - c6),
                                c1 - c2 - (c7 - c8),
                                c1 - c2 - (c9 - c10) ))


Pr(OR > 1.5) = 0.05 ⇒ σ=0.247

Code

f <- blrm(y ~ rcs(x1, 4) * x2, pcontrast=con)

Running MCMC with 4 parallel chains...

Chain 1 finished in 2.0 seconds.
Chain 2 finished in 2.0 seconds.
Chain 3 finished in 1.9 seconds.
Chain 4 finished in 1.9 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.9 seconds.
Total execution time: 2.1 seconds.

Code

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.105 for Intercepts

blrm(formula = y ~ rcs(x1, 4) * x2, pcontrast = con)

Frequencies of Responses

  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 
  1   1   4   1   2   1   8   5   5   5   4   3   8   3   7   8   3   4   3   3 
  2 2.1 2.2 2.3 2.7 
  4   4   1   1   1

	Mixed Calibration/ Discrimination Indexes	Discrimination Indexes	Rank Discrim. Indexes
Obs 90	LOO log L -252.33±9.41	g 3.482 [2.59, 4.226]	C 0.844 [0.831, 0.856]
Draws 4000	LOO IC 504.66±18.82	g_p 0.422 [0.382, 0.459]	D_xy 0.689 [0.662, 0.712]
Chains 4	Effective p 40.66±4.4	EV 0.575 [0.46, 0.689]
Time 2.5s	B 0.109 [0.099, 0.12]	v 9.552 [5.529, 14.341]
p 11		vp 0.142 [0.113, 0.175]

	Mode β	Mean β	Median β	S.E.	Lower	Upper	Pr(β>0)	Symmetry
x1	1.3882	1.3920	1.3219	4.6282	-7.3210	10.5430	0.6118	1.00
x1'	-5.3210	-5.0635	-4.9193	15.0764	-35.1537	24.3668	0.3668	1.00
x1''	39.7361	38.3310	38.1331	44.1192	-48.6910	127.2686	0.8145	1.00
x2=b	-1.1477	-1.1664	-1.1714	1.3532	-3.8319	1.4338	0.1915	1.01
x2=c	4.2591	4.2182	4.2262	1.4429	1.1585	6.8475	0.9985	1.04
x1 × x2=b	-4.8033	-4.8093	-4.8477	6.1973	-16.4962	7.3213	0.2180	1.00
x1' × x2=b	23.4176	23.1932	23.3113	19.7709	-16.6630	60.4650	0.8775	1.01
x1'' × x2=b	-72.8597	-71.6596	-72.1637	56.5480	-185.4562	36.4429	0.1047	1.00
x1 × x2=c	-4.6848	-4.6889	-4.7191	6.2200	-16.1228	7.9197	0.2255	1.00
x1' × x2=c	23.0127	22.7201	22.6454	19.8062	-15.1881	61.8363	0.8715	0.98
x1'' × x2=c	-72.2134	-70.7465	-70.7977	56.5081	-189.7170	32.6563	0.1075	1.02

Contrasts Given Priors

[1] list(sd = c(0.246505282576203, 0.246505282576203, 0.246505282576203, 
[2] 0.246505282576203), c1 = list(x1 = 0, x2 = "b"), c2 = list(x1 = 0,   
[3]     x2 = "c"), c3 = list(x1 = 0.25, x2 = "b"), c4 = list(x1 = 0.25,  
[4]     x2 = "c"), c5 = list(x1 = 0.5, x2 = "b"), c6 = list(x1 = 0.5,    
[5]     x2 = "c"), c7 = list(x1 = 0.75, x2 = "b"), c8 = list(x1 = 0.75,  
[6]     x2 = "c"), c9 = list(x1 = 1, x2 = "b"), c10 = list(x1 = 1,       
[7]     x2 = "c"), contrast = expression(c1 - c2 - (c3 - c4), c1 -       
[8]     c2 - (c5 - c6), c1 - c2 - (c7 - c8), c1 - c2 - (c9 - c10)))

Code

f$Contrast

  rcs(x1, 4)x1 rcs(x1, 4)x1' rcs(x1, 4)x1'' x2b x2c rcs(x1, 4)x1:x2b
1            0             0              0   0   0            -0.25
1            0             0              0   0   0            -0.50
1            0             0              0   0   0            -0.75
1            0             0              0   0   0            -1.00
  rcs(x1, 4)x1':x2b rcs(x1, 4)x1'':x2b rcs(x1, 4)x1:x2c rcs(x1, 4)x1':x2c
1      -0.006089308        0.000000000             0.25       0.006089308
1      -0.091848739       -0.002089092             0.50       0.091848739
1      -0.372867191       -0.062281932             0.75       0.372867191
1      -0.879577838       -0.236978793             1.00       0.879577838
  rcs(x1, 4)x1'':x2c
1        0.000000000
1        0.002089092
1        0.062281932
1        0.236978793

Code

ggplot(Predict(f, x1, x2))

2.11 Study Questions

Contrast statistical models vs. machine learning in a few ways

Section 2.1-2.2

What is always linear in the linear model?

Section 2.3

Define an interaction effect in words
In a model with main effects and two-way interactions, which regression parameter is the most “universal” by virtue of being most independent of coding?
When a predictor is involved in an interaction, which test of association involving that predictor is preferred?
What are two ways of doing chunk (composite) tests?
Analysts want to use least squares multiple regression to predict follow-up blood pressure from baseline blood pressure, age, and sex. They want to use this model for estimation, prediction, and inference (statistical tests and confidence limits). List all of the assumptions they make by fitting the model \(Y = \beta_{0} + \beta_{1} \text{bp} + \beta_{2} \text{age} + \beta_{3} [\text{sex=female}]\).
List as many methods as you can for checking the assumptions you listed.

Section 2.4.1

Provide an example where categorizing a continuous predictor is a good idea
If you dichotomize a continuous predictor that has a smooth relationship with Y, how can you arbitrarily change its estimated regression coefficients?
What is a general term used for the statistical problem caused by categorizing continuous variables?
What tends to happen when data are used to find cutpoints for relationships that are in fact smooth?
What other inferential distortion is caused by categorization?
Is there an amount of noise that could be added to a continuous X that makes a categorized analysis better than a continuous one?

Section 2.4.2

What is the biggest drawback to using regular polynomials to fit smooth nonlinear relationships?
Why does a spline function require one to use a truncated term?
What advantages does a linearly tail-restricted cubic spline have?
Why are its constructed variables X’ X’’ X’’’ etc. so complex?
What do you need to do to handle a sudden curvature change when fitting a nonlinear spline?
Barring a sudden curvature change, why are knot locations not so important in a restricted cubic spline (rcs)?
What is the statistical term for the delimma that occurs when choosing the number of knots?
What is a Bayesian way of handling this?
What is the core idea behind Cleveland’s loess smoother?
Can fitted spline functions for some of the predictors be an end in themselves?

Section 2.5

What is the worst thing about recursive partitioning?
What is a major tradeoff between the modeling strategy we will largely use and ML?
What are some determinants of the best choice between SM and ML?

Section 2.6

What is the main message of the Grambsch and O’Brien paper?
What notational error do many analysts make that lead them to ignore model uncertainty?

Section 2.7.2

Why is it a good idea to pre-specify interactions, and how is this done?
Why do we not fiddle with transformations of a predictor before considering interactions?
What 3-dimensional restriction does the cross-products of two restricted cubic splines force?
What is an efficient and likely-to-fit approach to allowing for the effect of measurement degradement?

Altman, D. G. (1991). Categorising continuous covariates (letter to the editor). Brit J Cancer, 64, 975.

Altman, D. G. (1998). Suboptimal analysis using “optimal” cutpoints. Brit J Cancer, 78, 556–557.

Altman, D. G., Lausen, B., Sauerbrei, W., & Schumacher, M. (1994). Dangers of using “optimal” cutpoints in the evaluation of prognostic factors. J Nat Cancer Inst, 86, 829–835.

Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model. Biometrika, 67, 413–418.

Austin, P. C., Tu, J. V., & Lee, D. S. (2010). Logistic regression had superior performance compared with regression trees for predicting in-hospital mortality in patients hospitalized with heart failure. J Clin Epi, 63, 1145–1155.

ROC areas for logistic models varied from 0.747 to 0.775 whereas they varied from 0.620-0.651 for recursive partitioning;repeated data simulation showed large variation in tree structure

Belcher, H. (1992). The concept of residual confounding in regression models and some applications. Stat Med, 11, 1747–1758.

Bennette, C., & Vickers, A. (2012). Against quantiles: Categorization of continuous variables in epidemiologic research, and its discontents. BMC Med Res Methodol, 12(1), 21+. https://doi.org/10.1186/1471-2288-12-21

terrific graphical examples; nice display of outcome heterogeneity within quantile groups of PSA

Berhane, K., Hauptmann, M., & Langholz, B. (2008). Using tensor product splines in modeling exposure–time–response relationships: Application to the Colorado Plateau Uranium Miners cohort. Stat Med, 27, 5484–5496.

discusses taking product of all univariate spline basis functions

Bernal, J. L., Cummins, S., & Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. International Journal of Epidemiology, 46(1), 348–355. https://doi.org/10.1093/ije/dyw098

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. Wadsworth and Brooks/Cole.

Buettner, P., Garbe, C., & Guggenmoos-Holzmann, I. (1997). Problems in defining cutoff points of continuous prognostic factors: Example of tumor thickness in primary cutaneous melanoma. J Clin Epi, 50, 1201–1210.

choice of cut point depends on marginal distribution of predictor

Chambers, J. M., & Hastie, T. J. (Eds.). (1992). Statistical Models in S. Wadsworth and Brooks/Cole.

Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc, 74, 829–836.

Collins, G. S., Ogundimu, E. O., Cook, J. A., Manach, Y. L., & Altman, D. G. (2016). Quantifying the impact of different approaches for handling continuous predictors on the performance of a prognostic model. Stat Med, 35(23), 4124–4135. https://doi.org/10.1002/sim.6986

used rms package hazard regression method (hare) for survival model calibration

Cook, E. F., & Goldman, L. (1988). Asymmetric stratification: An outline for an efficient method for controlling confounding in cohort studies. Am J Epi, 127, 626–639.

Cox, D. R. (1972). Regression models and life-tables (with discussion). J Roy Stat Soc B, 34, 187–220.

Devlin, T. F., & Weeks, B. J. (1986). Spline functions for logistic regression modeling. Proceedings of the Eleventh Annual SAS Users Group International Conference, 646–651.

Durrleman, S., & Simon, R. (1989). Flexible regression models with cubic splines. Stat Med, 8, 551–561.

Faraggi, D., & Simon, R. (1996). A simulation study of cross-validation for selecting an optimal cutpoint in univariate survival analysis. Stat Med, 15, 2203–2213.

bias in point estimate of effect from selecting cutpoints based on P-value; loss of information from dichotomizing continuous predictors

Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of dichotomization. Pharm Stat, 8, 50–61. https://doi.org/10.1002/pst.331

optimal cutpoint depends on unknown parameters;should only entertain dichotomization when "estimating a value of the cumulative distribution and when the assumed model is very different from the true model";nice graphics

Giannoni, A., Baruah, R., Leong, T., Rehman, M. B., Pastormerlo, L. E., Harrell, F. E., Coats, A. J., & Francis, D. P. (2014). Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. PLoS ONE, 9(1). https://doi.org/10.1371/journal.pone.0081699

Govindarajulu, U. S., Spiegelman, D., Thurston, S. W., Ganguli, B., & Eisen, E. A. (2007). Comparing smoothing techniques in Cox models for exposure-response relationships. Stat Med, 26, 3735–3752.

authors wrote a SAS macro for restricted cubic splines even though such a macro has existed since 1984; would have gotten more useful results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials

Grambsch, P. M., & O’Brien, P. C. (1991). The effects of transformations and preliminary tests for non-linearity in regression. Stat Med, 10, 697–709.

Gray, R. J. (1992). Flexible methods for analyzing survival data using splines, with applications to breast cancer prognosis. J Am Stat Assoc, 87, 942–951.

Gray, R. J. (1994). Spline-based tests in survival analysis. Biometrics, 50, 640–652.

Harrell, F. E., Lee, K. L., & Pollock, B. G. (1988). Regression models in clinical studies: Determining relationships between predictors and response. J Nat Cancer Inst, 80, 1198–1202.

Hastie, T. J., & Tibshirani, R. J. (1990). Generalized Additive Models. Chapman & Hall/CRC.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2008). The Elements of Statistical Learning (second ed.). Springer.

Hilsenbeck, S. G., & Clark, G. M. (1996). Practical p-value adjustment for optimally selected cutpoints. Stat Med, 15, 103–112.

Holländer, N., Sauerbrei, W., & Schumacher, M. (2004). Confidence intervals for the effect of a prognostic factor after selection of an “optimal” cutpoint. Stat Med, 23, 1701–1713. https://doi.org/10.1002/sim.1611

true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;"It should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the “optimal” cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology."; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub

Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. https://doi.org/10.1016/j.patter.2023.100804

Lausen, B., & Schumacher, M. (1996). Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Comp Stat Data Analysis, 21(3), 307–326. https://doi.org/10.1016/0167-9473(95)00016-X

Lee, S., Huang, J. Z., & Hu, J. (2010). Sparse logistic principal components analysis for binary data. Ann Appl Stat, 4(3), 1579–1601.

Leng, C., & Wang, H. (2009). On general adaptive sparse principal component analysis. J Comp Graph Stat, 18(1), 201–215.

Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median splits and spurious statistical significance. Psych Bull, 113, 181–190. https://doi.org/10.1037//0033-2909.113.1.181

Moser, B. K., & Coombs, L. P. (2004). Odds ratios for a continuous outcome variable without dichotomizing. Stat Med, 23, 1843–1860.

large loss of efficiency and power;embeds in a logistic distribution, similar to proportional odds model;categorization;dichotomization of a continuous response in order to obtain odds ratios often results in an inflation of the needed sample size by a factor greater than 1.5

Nott, D. J., & Leng, C. (2010). Bayesian projection approaches to variable selection in generalized linear models. Computational Statistics & Data Analysis, 54(12), 3227–3241. https://doi.org/10.1016/j.csda.2010.01.036

Paul, D., Bair, E., Hastie, T., & Tibshirani, R. (2008). “Preconditioning” for feature selection and regression in high-dimensional problems. Ann Stat, 36(4), 1595–1619. https://doi.org/10.1214/009053607000000578

develop consistent Y using a latent variable structure, using for example supervised principal components. Then run stepwise regression or lasso predicting Y (lasso worked better). Can run into problems when a predictor has importance in an adjusted sense but has no marginal correlation with Y;model approximation;model simplification

Radchenko, P., & James, G. M. (2008). Variable inclusion and shrinkage algorithms. J Am Stat Assoc, 103(483), 1304–1315.

solves problem caused by lasso using the same penalty parameter for variable selection and shrinkage which causes lasso to have to keep too many variables in the model to avoid overshrinking the remaining predictors;does not handle scaling issue well

Ragland, D. R. (1992). Dichotomizing continuous outcome variables: Dependence of the magnitude of association and statistical power on the cutpoint. Epi, 3, 434–440. https://doi.org/10.1097/00001648-199209000-00009

Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med, 25, 127–141. https://doi.org/10.1002/sim.2331

destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data

Schulgen, G., Lausen, B., Olsen, J., & Schumacher, M. (1994). Outcome-oriented cutpoints in quantitative exposure. Am J Epi, 120, 172–184.

Selvin, E., Steffes, M. W., Zhu, H., Matsushita, K., Wagenknecht, L., Pankow, J., Coresh, J., & Brancati, F. L. (2010). Glycated hemoglobin, diabetes, and cardiovascular risk in nondiabetic adults. NEJM, 362(9), 800–811. https://doi.org/10.1056/NEJMoa0908359

Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2013). A sparse-group lasso. J Comp Graph Stat, 22(2), 231–245. https://doi.org/10.1080/10618600.2012.681250

sparse effects both on a group and within group levels;can also be considered special case of group lasso allowing overlap between groups

Simon, R., & Freedman, L. S. (1997). Bayesian design and analysis of two two factorial clinical trials. Biometrics, 53, 456–464.

Steyerberg, E. W., Eijkemans, M. J. C., Harrell, F. E., & Habbema, J. D. F. (2000). Prognostic modelling with logistic regression analysis: A comparison of selection and estimation methods in small data sets. Stat Med, 19, 1059–1079.

Stone, C. J. (1986). Comment: Generalized additive models. Stat Sci, 1, 312–314.

Stone, C. J., & Koo, C. Y. (1985). Additive splines in statistics. Proceedings of the Statistical Computing Section ASA, 45–48.

Suissa, S., & Blais, L. (1995). Binary regression with continuous outcomes. Stat Med, 14, 247–255. https://doi.org/10.1002/sim.4780140303

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J Roy Stat Soc B, 58, 267–288.

van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology, 14(1), 137+. https://doi.org/10.1186/1471-2288-14-137

Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination measure. But much good practical information. Recursive partitioning fared poorly.

van Houwelingen, J. C., & le Cessie, S. (1990). Predictive value of statistical models. Stat Med, 9, 1303–1325.

Wainer, H. (2006). Finding what is not there through the unfortunate binning of results: The Mendel effect. Chance, 19(1), 49–56.

can find bins that yield either positive or negative association;especially pertinent when effects are small;"With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk." - John von Neumann

Wang, H., & Leng, C. (2007). Unified LASSO estimation by least squares approximation. J Am Stat Assoc, 102, 1039–1048. https://doi.org/10.1198/016214507000000509

Wang, S., Nan, B., Zhou, N., & Zhu, J. (2009). Hierarchically penalized Cox regression with grouped variables. Biometrika, 96(2), 307–322.

Witten, D. M., & Tibshirani, R. (2008). Testing significance of features by lassoed principal components. Ann Appl Stat, 2(3), 986–1012.

reduction in false discovery rates over using a vector of t-statistics;borrowing strength across genes;"one would not expect a single gene to be associated with the outcome, since, in practice, many genes work together to effect a particular phenotype. LPC effectively down-weights individual genes that are associated with the outcome but that do not share an expression pattern with a larger group of genes, and instead favors large groups of genes that appear to be differentially-expressed.";regress principal components on outcome;sparse principal components

Wood, S. N. (2006). Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.

Xiong, S. (2010). Some notes on the nonnegative garrote. Technometrics, 52(3), 349–361.

"... to select tuning parameters, it may be unnecessary to optimize a model selectin criterion repeatedly";natural selection of penalty function

Zhang, H. H., & Lu, W. (2007). Adaptive lasso for Cox’s proportional hazards model. Biometrika, 94, 691–703.

penalty function has ratios against original MLE;scale-free lasso

Zhang, M., Yu, Y., Wang, S., Salvatore, M., Fritsche, L. G., He, Z., & Mukherjee, B. (2020). Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions. Statistics in Medicine, n/a(n/a). https://doi.org/10.1002/sim.8505

Zhou, H., Hastie, T., & Tibshirani, R. (2006). Sparse principal component analysis. J Comp Graph Stat, 15, 265–286.

principal components analysis that shrinks some loadings to zero

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. J Roy Stat Soc B, 67(2), 301–320.