2  General Aspects of Fitting Regression Models

Regression modeling meets many analytic needs:

Alternative: Stratification

Alternative: Single Trees (recursive partitioning/CART)

Alternative: Machine Learning

2.1 Notation for Multivariable Regression Models

  • Weighted sum of a set of independent or predictor variables
  • Interpret parameters and state assumptions by linearizing model with respect to regression coefficients
  • Analysis of variance setups, interaction effects, nonlinear effects
  • Examining the 2 regression assumptions
A
Symbol Meaning
\(Y\) response (dependent) variable
\(X\) \(X_{1},X_{2},\ldots,X_{p}\) – list of predictors
\(\beta\) \(\beta_{0},\beta_{1},\ldots,\beta_{p}\) – regression coefficients
\(\beta_0\) intercept parameter(optional)
\(\beta_{1},\ldots,\beta_{p}\) weights or regression coefficients
\(X\beta\) \(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{p}X_{p}, X_{0}=1\)

Model: connection between \(X\) and \(Y\)
\(C(Y|X)\) : property of distribution of \(Y\) given \(X\), e.g. 
\(C(Y|X) = {\rm E}(Y|X)\) or \(\Pr(Y=1|X)\).

B

2.2 Model Formulations

General regression model \[C(Y|X) = g(X) .\]

General linear regression model \[C(Y|X) = g(X\beta) .\] Examples

C
\[\begin{array}{ccc} C(Y|X) =& E(Y|X) =& X\beta, \\ Y|X &\sim N(X\beta,\sigma^{2}) & \\ C(Y|X) =& \Pr(Y=1|X) =& (1+\exp(-X\beta))^{-1} \\ \end{array}\]

Linearize: \(h(C(Y|X))=X\beta, h(u)=g^{-1}(u)\)
Example:

D
\[\begin{array}{ccc} C(Y|X)=\Pr(Y=1|X)&=&(1+\exp(-X\beta))^{-1} \\ h(u)=\mathrm{logit}(u)&=&\log(\frac{u}{1-u}) \\ h(C(Y|X)) &=& C'(Y|X)\ {\rm (link)} \end{array}\]

General linear regression model:
\(C'(Y|X)=X\beta\).

2.3 Interpreting Model Parameters

Suppose that \(X_{j}\) is linear and doesn’t interact with other \(X\)’s1.

E

1 Note that it is not necessary to “hold constant” all other variables to be able to interpret the effect of one predictor. It is sufficient to hold constant the weighted sum of all the variables other than \(X_{j}\). And in many cases it is not physically possible to hold other variables constant while varying one, e.g., when a model contains \(X\) and \(X^{2}\) (David Hoaglin, personal communication).

\[\begin{array}{ccc} C'(Y|X) &=& X\beta = \beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{p}X_{p} \\ \beta_{j} &=& C'(Y|X_{1}, X_{2}, \ldots, X_{j}+1, \ldots, X_{p}) \\ &-& C'(Y|X_{1}, X_{2}, \ldots, X_{j}, \ldots, X_{p}) \end{array}\]

Drop \('\) from \(C'\) and assume \(C(Y|X)\) is property of \(Y\) that is linearly related to weighted sum of \(X\)’s.

2.3.1 Nominal Predictors

Nominal (polytomous) factor with \(k\) levels : \(k-1\) indicator variables. E.g. \(T=J,K,L,M\):

F
\[\begin{array}{ccc} C(Y|T=J) &=& \beta_{0} \nonumber\\ C(Y|T=K) &=& \beta_{0}+\beta_{1}\\ C(Y|T=L) &=& \beta_{0}+\beta_{2}\nonumber\\ C(Y|T=M) &=& \beta_{0}+\beta_{3}\nonumber . \end{array}\]

\[C(Y|T) = X\beta= \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\beta_{3} X_{3},\]

where

\[\begin{array}{ccc} X_{1} = 1 & {\rm if} \ \ T=K, & 0 \ \ {\rm otherwise} \nonumber\\ X_{2} = 1 & {\rm if} \ \ T=L, & 0 \ \ {\rm otherwise} \\ X_{3} = 1 & {\rm if} \ \ T=M, & 0 \ \ {\rm otherwise} \nonumber. \end{array}\]

The test for any differences in the property \(C(Y)\) between treatments is \(H_{0}:\beta_{1}=\beta_{2}=\beta_{3}=0\).

2.3.2 Interactions


\(X_{1}\) and \(X_{2}\), effect of \(X_{1}\) on \(Y\) depends on level of \(X_{2}\). One way to describe interaction is to add \(X_{3}=X_{1}X_{2}\) to model: \[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2} .\]

G
\[\begin{array}{ccc} C(Y|X_{1}+1, X_{2})&-&C(Y|X_{1}, X_{2})\nonumber\\ &=&\beta_{0}+\beta_{1} (X_{1}+1)+\beta_{2}X_{2}\nonumber\\ &+&\beta_{3} (X_{1}+1)X_{2}\\ &-&[\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{1}X_{2}]\nonumber\\ &=&\beta_{1}+\beta_{3}X_{2} .\nonumber \end{array}\]

One-unit increase in \(X_{2}\) on \(C(Y|X)\) : \(\beta_{2}+\beta_{3} X_{1}\).

Worse interactions:

If \(X_{1}\) is binary, the interaction may take the form of a difference in shape (and/or distribution) of \(X_{2}\) vs. \(C(Y)\) depending on whether \(X_{1}=0\) or \(X_{1}=1\) (e.g. logarithm vs. square root).

This paper describes how interaction effects can be misleading.

2.3.3 Example: Inference for a Simple Model

Postulate the model \(C(Y|age,sex) = \beta_{0}+\beta_{1} age + \beta_{2} [sex=f] + \beta_{3} age [sex=f]\) where \([sex=f]\) is an indicator indicator variable for sex=female, i.e., the reference cell is sex=male2.

2 You can also think of the last part of the model as being \(\beta_{3} X_{3}\), where \(X_{3} = age \times [sex=f]\).

Model assumes

  1. age is linearly related to \(C(Y)\) for males,
  2. age is linearly related to \(C(Y)\) for females, and
  3. interaction between age and sex is simple
  4. whatever distribution, variance, and independence assumptions are appropriate for the model being considered.
H

Interpretations of parameters:

I
Parameter Meaning
\(\beta_{0}\) \(C(Y | age=0, sex=m)\)
\(\beta_{1}\) \(C(Y | age=x+1, sex=m) - C(Y | age=x, sex=m)\)
\(\beta_{2}\) \(C(Y | age=0, sex=f) - C(Y | age=0, sex=m)\)
\(\beta_{3}\) \(C(Y | age=x+1, sex=f) - C(Y | age=x, sex=f) -\)
\([C(Y | age=x+1, sex=m) - C(Y | age=x, sex=m)]\)

\(\beta_{3}\) is the difference in slopes (female – male).

When a high-order effect such as an interaction effect is in the model, be sure to interpret low-order effects by finding out what makes the interaction effect ignorable. In our example, the interaction effect is zero when age=0 or sex is male.

Hypotheses that are usually inappropriate:

  1. \(H_{0}: \beta_{1}=0\): This tests whether age is associated with \(Y\) for males
  2. \(H_{0}: \beta_{2}=0\): This tests whether sex is associated with \(Y\) for zero year olds
J

More useful hypotheses follow. For any hypothesis need to

  • Write what is being tested
  • Translate to parameters tested
  • List the alternative hypothesis
  • Not forget what the test is powered to detect
    • Test against nonzero slope has maximum power when linearity holds
    • If true relationship is monotonic, test for non-flatness will have some but not optimal power
    • Test against a quadratic (parabolic) shape will have some power to detect a logarithmic shape but not against a sine wave over many cycles
  • Useful to write e.g. “\(H_{a}:\) age is associated with \(C(Y)\), powered to detect a linear relationship”
K

Most Useful Tests for Linear age \(\times\) sex Model

L
Null or Alternative Hypothesis Mathematical Statement
Effect of age is independent of sex or
Effect of sex is independent of age or
age and sex are additive
age effects are parallel
\(H_{0}: \beta_{3}=0\)
age interacts with sex
age modifies effect of sex
sex modifies effect of age
sex and age are non-additive (synergistic)
\(H_{a}: \beta_{3} \neq 0\)
age is not associated with \(Y\) \(H_{0}: \beta_{1}=\beta_{3}=0\)
age is associated with \(Y\)
age is associated with \(Y\) for either females or males
\(H_{a}: \beta_{1} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)
sex is not associated with \(Y\) \(H_{0}: \beta_{2}=\beta_{3}=0\)
sex is associated with \(Y\)
sex is associated with \(Y\) for some value of age
\(H_{a}: \beta_{2} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)
Neither age nor sex is associated with \(Y\) \(H_{0}: \beta_{1}=\beta_{2}=\beta_{3}=0\)
Either age or sex is associated with \(Y\) \(H_{a}: \beta_{1} \neq 0 \textrm{~or~} \beta_{2} \neq 0 \textrm{~or~} \beta_{3} \neq 0\)

Note: The last test is called the global test of no association. If an interaction effect present, there is both an age and a sex effect. There can also be age or sex effects when the lines are parallel. The global test of association (test of total association) has 3 d.f. instead of 2 (age + sex) because it allows for unequal slopes.

2.3.4 Review of Composite (Chunk) Tests

  • In the model
M
Code
y ~ age + sex + weight + waist + tricep

we may want to jointly test the association between all body measurements and response, holding age and sex constant.

  • This 3 d.f. test may be obtained two ways:
    • Remove the 3 variables and compute the change in \(SSR\) or \(SSE\)
    • Test \(H_{0}:\beta_{3}=\beta_{4}=\beta_{5}=0\) using matrix algebra (e.g., anova(fit, weight, waist, tricep) if fit is a fit object created by the R rms package)

2.4 Relaxing Linearity Assumption for Continuous Predictors

2.4.1 Avoiding Categorization

Natura non facit saltus
(Nature does not make jumps)
— Gottfried Wilhelm Leibniz

    Lucy D’Agostino McGowan

  • Relationships seldom linear except when predicting one variable from itself measured earlier

  • Categorizing continuous predictors into intervals is a disaster; see Royston et al. (2006), D. G. Altman (1991), Hilsenbeck & Clark (1996), Lausen & Schumacher (1996), D. G. Altman et al. (1994), Belcher (1992), Faraggi & Simon (1996), Ragland (1992), Suissa & Blais (1995), Buettner et al. (1997), Maxwell & Delaney (1993), Schulgen et al. (1994), Douglas G. Altman (1998), Holländer et al. (2004), Moser & Coombs (2004), Wainer (2006), Fedorov et al. (2009), Giannoni et al. (2014), Collins et al. (2016), Bennette & Vickers (2012) and Biostatistics for Biomedical Research, Chapter 18.

  • Some problems caused by this approach:

N
  1. Estimated values have reduced precision, and associated tests have reduced power
  2. Categorization assumes relationship between predictor and response is flat within intervals; far less reasonable than a linearity assumption in most cases
  3. To make a continuous predictor be more accurately modeled when categorization is used, multiple intervals are required
  4. Because of sample size limitations in the very low and very high range of the variable, the outer intervals (e.g., outer quintiles) will be wide, resulting in significant heterogeneity of subjects within those intervals, and residual confounding
  5. Categorization assumes that there is a discontinuity in response as interval boundaries are crossed. Other than the effect of time (e.g., an instant stock price drop after bad news), there are very few examples in which such discontinuities have been shown to exist.
  6. Categorization only seems to yield interpretable estimates. E.g. odds ratio for stroke for persons with a systolic blood pressure \(> 160\) mmHg compared to persons with a blood pressure \(\leq 160\) mmHg \(\rightarrow\) interpretation of OR depends on distribution of blood pressures in the sample (the proportion of subjects \(> 170\), \(> 180\), etc.). If blood pressure is modeled as a continuous variable (e.g., using a regression spline, quadratic, or linear effect) one can estimate the ratio of odds for exact settings of the predictor, e.g., the odds ratio for 200 mmHg compared to 120 mmHg.
  7. Categorization does not condition on full information. When, for example, the risk of stroke is being assessed for a new subject with a known blood pressure (say 162~mmHg), the subject does not report to her physician “my blood pressure exceeds 160” but rather reports 162 mmHg. The risk for this subject will be much lower than that of a subject with a blood pressure of 200 mmHg.
  8. If cutpoints are determined in a way that is not blinded to the response variable, calculation of \(P\)-values and confidence intervals requires special simulation techniques; ordinary inferential methods are completely invalid. E.g.: cutpoints chosen by trial and error utilizing \(Y\), even informally \(\rightarrow\) \(P\)-values too small and CLs not accurate3.
  9. Categorization not blinded to \(Y\) \(\rightarrow\) biased effect estimates (D. G. Altman et al. (1994), Schulgen et al. (1994))
  10. “Optimal” cutpoints do not replicate over studies. Holländer et al. (2004) state that “… the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the ‘optimal’ cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology.” Giannoni et al. (2014) demonstrated that many claimed “optimal cutpoints” are just the observed median values in the sample, which happens to optimize statistical power for detecting a separation in outcomes.
  11. Disagreements in cutpoints (which are bound to happen whenever one searches for things that do not exist) cause severe interpretation problems. One study may provide an odds ratio for comparing body mass index (BMI) \(> 30\) with BMI \(\leq 30\), another for comparing BMI \(> 28\) with BMI \(\leq 28\). Neither of these has a good definition and the two estimates are not comparable.
  12. Cutpoints are arbitrary and manipulable; cutpoints can be found that can result in both positive and negative associations Wainer (2006).
  13. If a confounder is adjusted for by categorization, there will be residual confounding that can be explained away by inclusion of the continuous form of the predictor in the model in addition to the categories.

3 If a cutpoint is chosen that minimizes the \(P\)-value and the resulting \(P\)-value is 0.05, the true type I error can easily be above 0.5 Holländer et al. (2004).

  • To summarize: The use of a (single) cutpoint \(c\) makes many assumptions, including:
O
  1. Relationship between \(X\) and \(Y\) is discontinuous at \(X=c\) and only \(X=c\)
  2. \(c\) is correctly found as the cutpoint
  3. \(X\) vs. \(Y\) is flat to the left of \(c\)
  4. \(X\) vs. \(Y\) is flat to the right of \(c\)
  5. The choice of \(c\) does not depend on the values of other predictors

Interactive demonstration of power loss of categorization vs. straight line and quadratic fits in OLS, with varying degree of nonlinearity and noise added to \(X\) (must run in RStudio)

Code
require(Hmisc)
getRs('catgNoise.r')

Example4 of misleading results from creating intervals (here, deciles) of a continuous predictor. Final interval is extremely heterogeneous and is greatly influenced by very large glycohemoglobin values, creating the false impression of an inflection point at 5.9.

4 From NHANES III; Diabetes Care 32:1327-34; 2009 adapted from Diabetes Care 20:1183-1197; 1997.

See this for excellent graphical examples of the harm of categorizing predictors, especially when using quantile groups.

2.4.2 Simple Nonlinear Terms

\[C(Y|X_{1}) = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{1}^{2} .\]

  • \(H_{0}:\) model is linear in \(X_{1}\) vs. \(H_{a}:\) model is quadratic in \(X_{1} \equiv H_{0}: \beta_{2}=0\).
  • Test of linearity may be powerful if true model is not extremely non-parabolic
  • Predictions not accurate in general as many phenomena are non-quadratic
  • Can get more flexible fits by adding powers higher than 2
  • But polynomials do not adequately fit logarithmic functions or “threshold” effects, and have unwanted peaks and valleys.
P

2.4.3 Splines for Estimating Shape of Regression Function and Determining Predictor Transformations

  • Draftsman’s spline: flexible strip of metal or rubber used to trace curves.
  • Spline Function: piecewise polynomial
  • Linear Spline Function: piecewise linear function
    • Bi-linear regression: model is \(\beta_{0}+\beta_{1}X\) if \(X \leq a\), \(\beta_{2}+\beta_{3}X\) if \(X > a\).
    • Problem with this notation: two lines not constrained to join
    • To force simple continuity: \(\beta_{0} + \beta_{1}X + \beta_{2}(X-a)\times [X>a] = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2}\),
      where \(X_{2} = (X_{1}-a) \times [X_{1} > a]\).
    • Slope is \(\beta_{1}, X \leq a\), \(\beta_{1}+\beta_{2}, X > a\).
    • \(\beta_{2}\) is the slope increment as you pass \(a\)
QSee this for a nice review and information about resources in R.

More generally: \(x\)-axis divided into intervals with endpoints \(a,b,c\) (knots).

\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}(X-a)_{+}+\beta_{3}(X-b)_{+} \nonumber\\ &+& \beta_{4}(X-c)_{+} , \end{array}\]

where

\[\begin{array}{ccc} (u)_{+}=&u,&u>0 ,\nonumber\\ &0,&u\leq0 . \end{array}\]

R
\[\begin{array}{ccc} f(X) & = \beta_{0}+\beta_{1}X, & X\leq a \nonumber \\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a) & a<X\leq b \\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a)+\beta_{3}(X-b) & b<X\leq c \nonumber\\ & = \beta_{0}+\beta_{1}X+\beta_{2}(X-a) & \nonumber\\ & + \beta_{3}(X-b)+\beta_{4}(X-c) & c<X. \nonumber \end{array}\]
Figure 2.1: A linear spline function with knots at \(a = 1, b = 3, c = 5\).

\[C(Y|X) = f(X) = X\beta,\] where \(X\beta = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\beta_{3}X_{3}+\beta_{4} X_{4}\), and

\[\begin{array}{cc} X_{1}=X & X_{2} = (X-a)_{+}\nonumber\\ X_{3}=(X-b)_{+} & X_{4} = (X-c)_{+} . \end{array}\]

Overall linearity in \(X\) can be tested by testing \(H_{0} : \beta_{2} = \beta_{3} = \beta_{4} = 0\).

2.4.4 Cubic Spline Functions

Cubic splines are smooth at knots (function, first and second derivatives agree) — can’t see joins.

S
\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}X^{2}+\beta_{3}X^{3}\nonumber\\ &+&\beta_{4}(X-a)_{+}^{3}+ \beta_{5}(X-b)_{+}^{3}+\beta_{6}(X-c)_{+}^{3}\\ &=& X\beta\nonumber \end{array}\] \[\begin{array}{cc} X_{1}=X & X_{2}=X^{2}\nonumber\\ X_{3}=X^{3} & X_{4}=(X-a)_{+}^{3}\\ X_{5}=(X-b)_{+}^{3} & X_{6}=(X-c)_{+}^{3}\nonumber . \end{array}\]

\(k\) knots \(\rightarrow k+3\) coefficients excluding intercept.

\(X^2\) and \(X^3\) terms must be included to allow nonlinearity when \(X < a\).

stats.stackexchange.com/questions/421964 has some useful descriptions of what happens at the knots, e.g.:

Knots are where different cubic polynomials are joined, and cubic splines force there to be three levels of continuity (the function, its slope, and its acceleration or second derivative (slope of the slope) do not change) at these points. At the knots the jolt (third derivative or rate of change of acceleration) is allowed to change suddenly, meaning the jolt is allowed to be discontinuous at the knots. Between knots, jolt is constant.

The following graphs show the function and its first three derivatives (all further derivatives are zero) for the function given by \(f(x) = x + x^{2} + 2x^{3} + 10(x - 0.25)^{3}_{+} - 50(x - 0.5)^{3}_{+} -100(x - 0.75)^{3}_{+}\) for \(x\) going from 0 to 1, where there are three knots, at \(x=0.25, 0.5, 0.75\).

Code
spar(bty='l', mfrow=c(4,1), bot=-1.5)
x  <- seq(0, 1, length=500)
x1 <- pmax(x - .25, 0)
x2 <- pmax(x - .50, 0)
x3 <- pmax(x - .75, 0)
b1 <- 1; b2 <- 1; b3 <- 2; b4 <- 10; b5 <- -50; b6 <- -100
y  <- b1 * x + b2 * x^2 + b3 * x^3 + b4 * x1^3 + b5 * x2^3 + b6 * x3^3
y1 <- b1     + 2*b2*x   + 3*b3*x^2 + 3*b4*x1^2 + 3*b5*x2^2 + 3*b6*x3^2
y2 <-          2*b2     + 6*b3*x   + 6*b4*x1   + 6*b5*x2   + 6*b6*x3
y3 <-                     6*b3     + 6*b4*(x1>0)+ 6*b5*(x2>0) + 6*b6*(x3>0)

g <- function() abline(v=(1:3)/4, col=gray(.85))
plot(x, y, type='l', ylab=''); g()
text(0, 1.5, 'Function', adj=0)

plot(x, y1, type='l', ylab=''); g()
text(0, -15, 'First Derivative: Slope\nRate of Change of Function',
     adj=0)
plot(x, y2, type='l', ylab=''); g()
text(0, -125, 'Second Derivative: Acceleration\nRate of Change of Slope',
     adj=0)
plot(x, y3, type='l', ylab=''); g()
text(0, -400, 'Third Derivative: Jolt\nRate of Change of Acceleration',
     adj=0)
Figure 2.2: A regular cubic spline function with three levels of continuity that prevent the human eye from detecting the knots. Also shown is the function’s first three derivatives. Knots are located at \(x=0.25, 0.5, 0.75\). For \(x\) beyond the outer knots, the function is not restricted to be linear. Linearity would imply an acceleration of zero. Vertical lines are drawn at the knots.

2.4.5 Restricted Cubic Splines

Stone and Koo Stone & Koo (1985): cubic splines poorly behaved in tails. Constrain function to be linear in tails.
\(k+3 \rightarrow k-1\) parameters Devlin & Weeks (1986).

T

To force linearity when \(X < a\): \(X^2\) and \(X^3\) terms must be omitted
To force linearity when \(X\) is beyond the last knot: last two \(\beta\) s are redundant, i.e., are just combinations of the other \(\beta\) s.

The restricted spline function with \(k\) knots \(t_{1}, \ldots, t_{k}\) is given by Devlin & Weeks (1986) \[f(X) = \beta_{0}+\beta_{1} X_{1}+\beta_{2} X_{2}+\ldots+\beta_{k-1} X_{k-1},\] where \(X_{1} = X\) and for \(j=1, \ldots, k-2\),

U

\[\begin{array}{ccc} X_{j+1} &= &(X-t_{j})_{+}^{3}-(X-t_{k-1})_{+}^{3} (t_{k}-t_{j})/(t_{k}-t_{k-1})\nonumber\\ &+&(X-t_{k})_{+}^{3} (t_{k-1}-t_{j})/(t_{k}-t_{k-1}). \end{array} \tag{2.1}\]

\(X_{j}\) is linear in \(X\) for \(X\geq t_{k}\).

For numerical behavior and to put all basis functions for \(X\) on the same scale, R Hmisc and rms package functions by default divide the terms above by \(\tau = (t_{k} - t_{1})^{2}\).

Code
require(Hmisc)
spar(mfrow=c(1,2), left=-2)
x <- rcspline.eval(seq(0,1,.01),
                   knots=seq(.05,.95,length=5), inclx=T)
xm <- x
xm[xm > .0106] <- NA
matplot(x[,1], xm, type="l", ylim=c(0,.01),
        xlab=expression(X), ylab='', lty=1)
matplot(x[,1], x,  type="l",
        xlab=expression(X), ylab='', lty=1)
Figure 2.3: Restricted cubic spline component variables for \(k = 5\) and knots at \(X = .05, .275, .5, .725\), and \(.95\). Nonlinear basis functions are scaled by \(\tau\). The left panel is a \(y\)–magnification of the right panel. Fitted functions such as those in Figure 2.4 will be linear combinations of these basis functions as long as knots are at the same locations used here.
Code
spar(left=-2, bot=2, mfrow=c(2,2), ps=13)
x <- seq(0, 1, length=300)
for(nk in 3:6) {
  set.seed(nk)
  knots <- seq(.05, .95, length=nk)
  xx <- rcspline.eval(x, knots=knots, inclx=T)
  for(i in 1 : (nk - 1))
    xx[,i] <- (xx[,i] - min(xx[,i])) /
              (max(xx[,i]) - min(xx[,i]))
  for(i in 1 : 20) {
    beta  <- 2*runif(nk-1) - 1
    xbeta <- xx %*% beta + 2 * runif(1) - 1
    xbeta <- (xbeta - min(xbeta)) /
             (max(xbeta) - min(xbeta))
    if(i == 1) {
      plot(x, xbeta, type="l", lty=1,
           xlab=expression(X), ylab='', bty="l")
      title(sub=paste(nk,"knots"), adj=0, cex=.75)
      for(j in 1 : nk)
        arrows(knots[j], .04, knots[j], -.03,
               angle=20, length=.07, lwd=1.5)
    }
    else lines(x, xbeta, col=i)
  }
}
Figure 2.4: Some typical restricted cubic spline functions for \(k = 3, 4, 5, 6\). The \(y\)–axis is \(X\beta\). Arrows indicate knots. These curves were derived by randomly choosing values of \(\beta\) subject to standard deviations of fitted functions being normalized.

Interactive demonstration of linear and cubic spline fitting, plus ordinary \(4^{th}\) order polynomial. This can be run with RStudio or in an ordinary R session.

Code
require(Hmisc)
getRs('demoSpline.r', put='rstudio')  # to see code in RStudio window
getRs('demoSpline.r')                 # to just run the code

Paul Lambert’s excellent self-contained interactive demonstrations of continuity restrictions, cubic polynomial, linear spline, cubic spline, and restricted cubic spline fitting is at pclambert.net/interactivegraphs. Jordan Gauthier has another nice interactive demonstration at drjgauthier.shinyapps.io/spliny.

Once \(\beta_{0}, \ldots, \beta_{k-1}\) are estimated, the restricted cubic spline can be restated in the form

V

\[\begin{array}{ccc} f(X) &=& \beta_{0}+\beta_{1}X+\beta_{2}(X-t_{1})_{+}^{3}+\beta_{3}(X-t_{2})_{+}^{3}\nonumber\\ && +\ldots+ \beta_{k+1}(X-t_{k})_{+}^{3} \end{array} \tag{2.2}\]

by dividing \(\beta_{2},\ldots,\beta_{k-1}\) by \(\tau\) and computing

\[\begin{array}{ccc} \beta_{k} &=& [\beta_{2}(t_{1}-t_{k})+\beta_{3}(t_{2}-t_{k})+\ldots\nonumber\\ && +\beta_{k-1}(t_{k-2}-t_{k})]/(t_{k}-t_{k-1})\nonumber\\ \beta_{k+1} &= & [\beta_{2}(t_{1}-t_{k-1})+\beta_{3}(t_{2}-t_{k-1})+\ldots\\ && + \beta_{k-1}(t_{k-2}-t_{k-1})]/(t_{k-1}-t_{k})\nonumber . \end{array}\]

A test of linearity in X can be obtained by testing

\[H_{0} : \beta_{2} = \beta_{3} = \ldots = \beta_{k-1} = 0.\]

Example: Selvin et al. (2010)

2.4.6 Choosing Number and Position of Knots

  • Knots are specified in advance in regression splines
  • Locations not important in most situations— Stone (1986), Durrleman & Simon (1989)
  • Place knots where data exist — fixed quantiles of predictor’s marginal distribution
  • Fit depends more on choice of \(k\)
W
k
Quantiles
3 .10 .5 .90
4 .05 .35 .65 .95
5 .05 .275 .5 .725 .95
6 .05 .23 .41 .59 .77 .95
7 .025 .1833 .3417 .5 .6583 .8167 .975

\(n<100\) – replace outer quantiles with 5th smallest and 5th largest \(X\) (Stone & Koo (1985)).

Choice of \(k\):

  • Flexibility of fit vs. \(n\) and variance
  • Usually \(k=3,4,5\). Often \(k=4\)
  • Large \(n\) (e.g. \(n\geq 100\)) – \(k=5\)
  • Small \(n\) (\(<30\), say) – \(k=3\)
  • Can use Akaike’s information criterion (AIC) (Atkinson (1980), van Houwelingen & le Cessie (1990)) to choose \(k\)
  • This chooses \(k\) to maximize model likelihood ratio \(\chi^{2} - 2k\).
X

See Govindarajulu et al. (2007) for a comparison of restricted cubic splines, fractional polynomials, and penalized splines.

2.4.7 Nonparametric Regression

  • Estimate tendency (mean or median) of \(Y\) as a function of \(X\)
  • Few assumptions
  • Especially handy when there is a single \(X\)
  • Plotted trend line may be the final result of the analysis
  • Simplest smoother: moving average
Y
\(X\): 1 2 3 5 8
\(Y\): 2.1 3.8 5.7 11.1 17.2
\[\begin{array}{ccc} \hat{E}(Y | X=2) &=& \frac{2.1+3.8+5.7}{3} \\ \hat{E}(Y | X=\frac{2+3+5}{3}) &=& \frac{3.8+5.7+11.1}{3} \end{array}\]
  • overlap OK
  • problem in estimating \(E(Y)\) at outer \(X\)-values
  • estimates very sensitive to bin width
  • Moving linear regression far superior to moving avg. (moving flat line)
  • Cleveland (1979) moving linear regression smoother loess (locally weighted least squares) is the most popular smoother. To estimate central tendency of \(Y\) at \(X=x\):
    • take all the data having \(X\) values within a suitable interval about \(x\) (default is \(\frac{2}{3}\) of the data)
    • fit weighted least squares linear regression within this neighborhood
    • points near \(x\) given the most weight5
    • points near extremes of interval receive almost no weight
    • loess works much better at extremes of \(X\) than moving avg.
    • provides an estimate at each observed \(X\); other estimates obtained by linear interpolation
    • outlier rejection algorithm built-in
  • loess works for binary \(Y\) — just turn off outlier detection
  • Other popular smoother: Friedman’s “super smoother”
  • For loess or supsmu amount of smoothing can be controlled by analyst
  • Another alternative: smoothing splines6
  • Smoothers are very useful for estimating trends in residual plots
Z

5 Weight here means something different than regression coefficient. It means how much a point is emphasized in developing the regression coefficients.

A

6 These place knots at all the observed data points but penalize coefficient estimates towards smoothness.

2.4.8 Advantages of Regression Splines over Other Methods

Regression splines have several advantages (Harrell et al. (1988)):

  • Parametric splines can be fitted using any existing regression program
  • Regression coefficients estimated using standard techniques (ML or least squares), formal tests of no overall association, linearity, and additivity, confidence limits for the estimated regression function are derived by standard theory.
  • The fitted function directly estimates transformation predictor should receive to yield linearity in \(C(Y|X)\).
  • Even when a simple transformation is obvious, spline function can be used to represent the predictor in the final model (and the d.f. will be correct). Nonparametric methods do not yield a prediction equation.
  • Extension to non-additive models.
    Multi-dimensional nonparametric estimators often require burdensome computations.
B

2.5 Recursive Partitioning: Tree-Based Models

Breiman et al. (1984): CART (Classification and Regression Trees) — essentially model-free

Method:

  • Find predictor so that best possible binary split has maximum value of some statistic for comparing 2 groups
  • Within previously formed subsets, find best predictor and split maximizing criterion in the subset
  • Proceed in like fashion until \(<k\) obs. remain to split
  • Summarize \(Y\) for the terminal node (e.g., mean, modal category)
  • Prune tree backward until it cross-validates as well as its “apparent” accuracy, or use shrinkage
C

Advantages/disadvantages of recursive partitioning:

  • Does not require functional form for predictors
  • Does not assume additivity — can identify complex interactions
  • Can deal with missing data flexibly
  • Interactions detected are frequently spurious
  • Does not use continuous predictors effectively
  • Penalty for overfitting in 3 directions
  • Often tree doesn’t cross-validate optimally unless pruned back very conservatively
  • Very useful in messy situations or those in which overfitting is not as problematic (confounder adjustment using propensity scores Cook & Goldman (1988); missing value imputation)
D

See Austin et al. (2010).

2.5.1 New Directions in Predictive Modeling

The approaches recommended in this course are

  • fitting fully pre-specified models without deletion of “insignificant” predictors
  • using data reduction methods (masked to \(Y\)) to reduce the dimensionality of the predictors and then fitting the number of parameters the data’s information content can support
  • use shrinkage (penalized estimation) to fit a large model without worrying about the sample size.
E

The data reduction approach can yield very interpretable, stable models, but there are many decisions to be made when using a two-stage (reduction/model fitting) approach, Newer approaches are evolving, including the following. These new approach handle continuous predictors well, unlike recursive partitioning.

  • lasso (shrinkage using L1 norm favoring zero regression coefficients) - Tibshirani (1996), Steyerberg et al. (2000)
  • elastic net (combination of L1 and L2 norms that handles the \(p > n\) case better than the lasso) Zou & Hastie (2005)
  • adaptive lasso H. H. Zhang & Lu (2007), H. Wang & Leng (2007)
  • more flexible lasso to differentially penalize for variable selection and for regression coefficient estimation (Radchenko & James (2008))
  • group lasso to force selection of all or none of a group of related variables (e.g., indicator variables representing a polytomous predictor)
  • group lasso-like procedures that also allow for variables within a group to be removed (S. Wang et al. (2009))
  • sparse-group lasso using L1 and L2 norms to achieve spareness on groups and within groups of variables (N. Simon et al. (2013))
  • adaptive group lasso (Wang & Leng)
  • Breiman’s nonnegative garrote (Xiong (2010))
  • “preconditioning”, i.e., model simplification after developing a “black box” predictive model - Paul et al. (2008),Nott & Leng (2010)
  • sparse principal components analysis to achieve parsimony in data reduction Witten & Tibshirani (2008),Zhou et al. (2006),Leng & Wang (2009),Lee et al. (2010)
  • bagging, boosting, and random forests T. Hastie et al. (2008)
F

One problem prevents most of these methods from being ready for everyday use: they require scaling predictors before fitting the model. When a predictor is represented by nonlinear basis functions, the scaling recommendations in the literature are not sensible. There are also computational issues and difficulties obtaining hypothesis tests and confidence intervals.

G

When data reduction is not required, generalized additive models T. J. Hastie & Tibshirani (1990), Wood (2006) should also be considered.

2.5.2 Choosing Between Machine Learning and Statistical Modeling

  • Statistical models allow for complexity (nonlinearity, interaction)
  • Easy to allow every predictor to have nonlinear effect
  • Easy to handle unlimited numbers of candidate predictors if assume additivity (e.g., using ridge regression, lasso, elastic net)
  • Interactions should be pre-specified
  • Machine learning is gaining attention but is oversold in some settings
  • Researchers are under the mistaken impression that machine learning can be used on small samples

Considerations in Choosing One Approach over Another

A statistical model may be the better choice if

  • Uncertainty is inherent and the signal:noise ratio is not large; even with identical twins, one twin may get colon cancer and the other not; model tendencies instead of doing classification
  • One could never have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error
  • One wants to isolate effects of a small number of variables
  • Uncertainty in an overall prediction or the effect of a predictor is sought
  • Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be pre-specified
  • The sample size isn’t huge
  • One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor
  • One wants the entire model to be interpretable

Machine learning may be the better choice if

  • The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an “E” or not an “E”
  • The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)
  • Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)
  • One is not very interested in estimating uncertainty in forecasts or in effects of select predictors
  • Non-additivity is expected to be strong and can’t be isolated to a few pre-specified variables (e.g., in visual pattern recognition the letter “L” must have both a dominating vertical component and a dominating horizontal component)
  • The sample size is huge (van der Ploeg et al. (2014))
  • One does not need to isolate the effect of a special variable such as treatment
  • One does not care that the model is a “black box”

2.6 Multiple Degree of Freedom Tests of Association

\[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}^{2} ,\] \(H_{0}: \beta_{2}=\beta_{3}=0\) with 2 d.f. to assess association between \(X_{2}\) and outcome.

H

In the 5-knot restricted cubic spline model \[C(Y|X) = \beta_{0}+\beta_{1}X+\beta_{2}X'+\beta_{3}X''+\beta_{4}X''' ,\] \(H_{0}: \beta_{1}=\ldots=\beta_{4}=0\)

  • Test of association: 4 d.f.
  • Insignificant \(\rightarrow\) dangerous to interpret plot
  • What to do if 4 d.f. test insignificant, 3 d.f. test for linearity insig., 1 d.f. test sig. after delete nonlinear terms?
I

Grambsch & O’Brien (1991) elegantly described the hazards of pretesting

  • Studied quadratic regression
  • Showed 2 d.f. test of association is nearly optimal even when regression is linear if nonlinearity entertained
  • Considered ordinary regression model
    \(E(Y|X) =\beta_{0}+\beta_{1}X+\beta_{2}X^{2}\)
  • Two ways to test association between \(X\) and \(Y\)
  • Fit quadratic model and test for linearity (\(H_{0}: \beta_{2}=0\))
  • \(F\)-test for linearity significant at \(\alpha=0.05\) level \(\rightarrow\) report as the final test of association the 2 d.f. \(F\) test of \(H_{0}: \beta_{1}=\beta_{2}=0\)
  • If the test of linearity insignificant, refit without the quadratic term and final test of association is 1 d.f. test, \(H_{0}:\beta_{1}=0 | \beta_{2}=0\)
  • Showed that type I error \(> \alpha\)
  • Fairly accurate \(P\)-value obtained by instead testing against \(F\) with 2 d.f. even at second stage
  • Cause: are retaining the most significant part of \(F\)
  • BUT if test against 2 d.f. can only lose power when compared with original \(F\) for testing both \(\beta\)s
  • \(SSR\) from quadratic model \(> SSR\) from linear model
J

2.7 Assessment of Model Fit

2.7.1 Regression Assumptions

The general linear regression model is \[C(Y|X) = X\beta =\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\ldots+\beta_{k}X_{k} .\] Verify linearity and additivity. Special case: \[C(Y|X) = \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2},\] where \(X_{1}\) is binary and \(X_{2}\) is continuous.

K
Figure 2.5: Regression assumptions for one binary and one continuous predictor

Methods for checking fit:

  1. Fit simple linear additive model and check examine residual plots for patterns
L
  • For OLS: box plots of \(e\) stratified by \(X_{1}\), scatterplots of \(e\) vs. \(X_{2}\) and \(\hat{Y}\), with trend curves (want flat central tendency, constant variability)
  • For normality, qqnorm plots of overall and stratified residuals

Advantage: Simplicity

Disadvantages:

  • Can only compute standard residuals for uncensored continuous response
  • Subjective judgment of non-randomness
  • Hard to handle interaction
  • Hard to see patterns with large \(n\) (trend lines help)
  • Seeing patterns does not lead to corrective action
  1. Scatterplot of \(Y\) vs. \(X_{2}\) using different symbols according to values of \(X_{1}\)
    Advantages: Simplicity, can see interaction
M

Disadvantages:

  • Scatterplots cannot be drawn for binary, categorical, or censored \(Y\)
  • Patterns difficult to see if relationships are weak or \(n\) large
  1. Stratify the sample by \(X_{1}\) and quantile groups (e.g. deciles) of \(X_{2}\); estimate \(C(Y|X_{1},X_{2})\) for each stratum
    Advantages: Simplicity, can see interactions, handles censored \(Y\) (if you are careful)

Disadvantages:

  • Requires large \(n\)
  • Does not use continuous var. effectively (no interpolation)
  • Subgroup estimates have low precision
  • Dependent on binning method
  1. Separately for levels of \(X_{1}\) fit a nonparametric smoother relating \(X_{2}\) to \(Y\)
    Advantages: All regression aspects of the model can be summarized efficiently with minimal assumptions
N

Disadvantages:

  • Does not apply to censored \(Y\)
  • Hard to deal with multiple predictors
  1. Fit flexible nonlinear parametric model

Advantages:

  • One framework for examining the model assumptions, fitting the model, drawing formal inference
  • d.f. defined and all aspects of statistical inference “work as advertised”

Disadvantages:

  • Complexity
  • Generally difficult to allow for interactions when assessing patterns of effects

Confidence limits, formal inference can be problematic for methods 1-4.

O

Restricted cubic spline works well for method 5.

\[\begin{array}{ccc} \hat{C}(Y|X) &=& \hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\hat{\beta}_{2}X_{2}+\hat{\beta}_{3}X_{2}'+\hat{\beta}_{4}X_{2}'' \nonumber\\ &=& \hat{\beta}_{0}+\hat{\beta}_{1}X_{1}+\hat{f}(X_{2}) , \end{array}\]

where \[\hat{f}(X_{2}) = \hat{\beta}_{2}X_{2}+\hat{\beta}_{3}X_{2}'+\hat{\beta}_{4}X_{2}'' ,\] \(\hat{f}(X_{2})\) spline-estimated transformation of \(X_{2}\).

  • Plot \(\hat{f}(X_{2})\) vs. \(X_{2}\)
  • \(n\) large \(\rightarrow\) can fit separate functions by \(X_{1}\)
  • Test of linearity: \(H_{0}:\beta_{3}=\beta_{4}=0\)
  • Few good reasons to do the test other than to demonstrate that linearity is not a good default assumption
  • Nonlinear \(\rightarrow\) use transformation suggested by spline fit or keep spline terms
  • Tentative transformation \(g(X_{2})\) \(\rightarrow\) check adequacy by expanding \(g(X_{2})\) in spline function and testing linearity
  • Can find transformations by plotting \(g(X_{2})\) vs. \(\hat{f}(X_{2})\) for variety of \(g\)
  • Multiple continuous predictors \(\rightarrow\) expand each using spline
  • Example: assess linearity of \(X_{2}, X_{3}\)
P

Q
\[\begin{array}{ccc} C(Y|X) &=& \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}'+\beta_{4}X_{2}'' \nonumber\\ &+& \beta_{5}X_{3}+\beta_{6}X_{3}'+\beta_{7}X_{3}'' , \end{array}\]

Overall test of linearity \(H_{0}: \beta_{3}=\beta_{4}=\beta_{6}=\beta_{7}=0\), with 4 d.f.

2.7.2 Modeling and Testing Complex Interactions


Note: Interactions will be misleading if main effects are not properly modeled (M. Zhang et al. (2020)).

Suppose \(X_1\) binary or linear, \(X_2\) continuous:

R
\[\begin{array}{ccc} C(Y|X) & = & \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+\beta_{3}X_{2}'+\beta_{4}X_{2}'' \\ \nonumber &&+\beta_{5}X_{1}X_{2}+\beta_{6}X_{1}X_{2}'+\beta_{7}X_{1}X_{2}'' \end{array}\]

Simultaneous test of linearity and additivity: \(H_{0}: \beta_{3} = \ldots = \beta_{7} = 0\).

  • 2 continuous variables: could transform separately and form simple product
  • But transformations depend on whether interaction terms adjusted for, so it is usually not possible to estimate transformations and interaction effects other than simultaneously
  • Compromise: Fit interactions of the form \(X_{1} f(X_{2})\) and \(X_{2} g(X_{1})\):
S

T

\[\begin{array}{ccc} C(Y|X) & = & \beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{1}'+\beta_{3}X_{1}'' \nonumber\\ &+& \beta_{4}X_{2}+\beta_{5}X_{2}'+\beta_{6}X_{2}'' \nonumber\\ &+& \beta_{7}X_{1}X_{2}+\beta_{8}X_{1}X_{2}'+\beta_{9}X_{1}X_{2}'' \\ &+& \beta_{10}X_{2}X_{1}'+\beta_{11}X_{2}X_{1}'' \nonumber \end{array} \tag{2.3}\]

  • Test of additivity is \(H_{0}: \beta_{7} = \beta_{8} = \ldots = \beta_{11} = 0\) with 5 d.f.
  • Test of lack of fit for the simple product interaction with \(X_{2}\) is \(H_{0}: \beta_{8} = \beta_{9} = 0\)
  • Test of lack of fit for the simple product interaction with \(X_{1}\) is \(H_{0}: \beta_{10} = \beta_{11} = 0\)
U

General spline surface:

  • Cover \(X_{1} \times X_{2}\) plane with grid and fit patch-wise cubic polynomial in two variables
  • Restrict to be of form \(aX_{1}+bX_{2}+cX_{1}X_{2}\) in corners
  • Uses all \((k-1)^{2}\) cross-products of restricted cubic spline terms
  • See Gray (1992), Gray (1994) for penalized splines allowing control of effective degrees of freedom. See Berhane et al. (2008) for a good discussion of tensor splines.
V
Figure 2.6: Logistic regression estimate of probability of a hemorrhagic stroke for patients in the GUSTO-I trial given \(t\)-PA, using a tensor spline of two restricted cubic splines and penalization (shrinkage). Dark (cold color) regions are low risk, and bright (hot) regions are higher risk.

Figure 2.6 is particularly interesting because the literature had suggested (based on approximately 24 strokes) that pulse pressure was the main cause of hemorrhagic stroke whereas this flexible modeling approach (based on approximately 230 strokes) suggests that mean arterial blood pressure (roughly a \(45^\circ\) line) is what is most important over a broad range of blood pressures. At the far right one can see that pulse pressure (axis perpendicular to \(45^\circ\) line) may have an impact although a non-monotonic one.

Other issues:

  • \(Y\) non-censored (especially continuous) \(\rightarrow\) multi-dimensional scatterplot smoother (Chambers & Hastie (1992))
  • Interactions of order \(>2\): more trouble
  • 2-way interactions among \(p\) predictors: pooled tests
  • \(p\) tests each with \(p-1\) d.f.
W

Some types of interactions to pre-specify in clinical studies:

  • Treatment \(\times\) severity of disease being treated
  • Age \(\times\) risk factors
  • Age \(\times\) type of disease
  • Measurement \(\times\) state of a subject during measurement
  • Race \(\times\) disease
  • Calendar time \(\times\) treatment
  • Quality \(\times\) quantity of a symptom
  • Measurement \(\times\) amount of deterioration of the measurement
X

The last example is worth expanding as an example in model formulation. Consider the following study.

  • A sample of patients seen over several years have a blood sample taken at time of hospitalization
  • Blood samples are frozen
  • Long after the last patient was sampled, the blood samples are thawed all in the same week and a blood analysis is done
  • It is known that the quality of the blood analysis deteriorates roughly logarithmically by the age of the sample; blood measurements made on old samples are assumed to be less predictive of outcome
  • This is reflected in an interaction between a function of sample age and the blood measurement B7
  • Patients were followed for an event, and the outcome variable of interest is the time from hospitalization to that event
  • To not assume a perfect logarithmic relationship for sample age on the effect of the blood measurement, a restricted cubic spline model with 3 default knots will be fitted for log sample age
  • Sample age is assumed to not modify the effects of non-blood predictors patient age and sex
  • Model may be specified the following way using the R rms package to fit a Cox proportional hazards model
  • Test for nonlinearity of sampleAge tests the adequacy of assuming a plain logarithmic trend in sample age
Y

7 For continuous \(Y\) one might need to model the residual variance of \(Y\) as increasing with sample age, in addition to modeling the mean function.

Code
f <- cph(Surv(etime, event) ~ rcs(log(sampleAge), 3) * rcs(B, 4) +
           rcs(age, 5) * sex, data=mydata)

The B \(\times\) sampleAge interaction effects have 6 d.f. and tests whether the sample deterioration affects the effect of B. By not assuming that B has the same effect for old samples as for young samples, the investigator will be able to estimate the effect of B on outcome when the blood analysis is ideal by inserting sampleAge = 1 day when requesting predicted values as a function of B.

2.7.3 Fitting Ordinal Predictors

  • Small no. categories (3-4) \(\rightarrow\) polytomous factor, indicator variables
  • Design matrix for easy test of adequacy of initial codes \(\rightarrow\) \(k\) original codes + \(k-2\) indicators
  • More categories \(\rightarrow\) score using data-driven trend. Later tests use \(k-1\) d.f. instead of 1 d.f.
  • E.g., compute logit(mortality) vs. category
  • Much better: used penalized maximum likelihood estimation (R ordSmooth package) or Bayesian shrinkage (R brms package).
Z

2.7.4 Distributional Assumptions

  • Some models (e.g., logistic): all assumptions in \(C(Y|X)=X\beta\) (implicitly assuming no omitted variables!)
  • Linear regression: \(Y \sim X\beta + \epsilon, \epsilon \sim n(0,\sigma^{2})\)
  • Examine distribution of residuals
  • Some models (Weibull, Cox (1972)):
    \(C(Y|X) = C(Y=y|X) = d(y)+X\beta\)
    \(C =\) log hazard
  • Check form of \(d(y)\)
  • Show \(d(y)\) does not interact with \(X\)
A

2.8 Complex Curve Fitting Example

  • Restricted cubic spline
  • Discontinuity (interrupted time series analysis)
  • Cyclic trend (seasonality)
  • Data from academic.oup.com/ije/article/46/1/348/2622842 by Bernal et al. (2017)
  • Rates of hospital admissions for acute coronary events in Sicily before and after a smoking ban on 2005-01
  • Poisson regression on case counts, adjusted for population size as an offset variable (analyzes event rate) (see stats.stackexchange.com/questions/11182)
  • Classic interrupted time series puts a discontinuity at the intervention point and assesses statistical evidence for a nonzero jump height
  • We will do that and also fit a continuous cubic spline but with multiple knots near the intervention point
  • All in context of long-term and seasonal trends
  • Uses the rms package gTrans function documented at hbiostat.org/R/rms/gtrans.html
  • Can do this without gTrans but harder to plot predicted values, get contrasts, and get chunk tests
  • Time variable is months after 2001-12-01
B
Code
require(rms)  # engages rms which also engages Hmisc which provides getHdata
options(prType='html')   # applies to printing model fits
getHdata(sicily)   # fetch dataset from hbiostat.org/data
d    <- sicily
dd   <- datadist(d);  options(datadist='dd')

Start with a standard restricted cubic spline fit, 6 knots at default quantile locations. From the fitted Poisson model we estimate the number of cases per a constant population size of 100,000.

Code
require(ggplot2)
g   <- function(x) exp(x) * 100000
off <- list(stdpop=mean(d$stdpop))    # offset for prediction (383464.4)
w   <- geom_point(aes(x=time, y=rate), data=d)
v   <- geom_vline(aes(xintercept=37, col=I('red')))
yl  <- ylab('Acute Coronary Cases Per 100,000')
f   <- Glm(aces ~ offset(log(stdpop)) + rcs(time, 6), data=d, family=poisson)
f$aic
[1] 721.5237
Code
ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

  • To add seasonality to the model can add sine and/or cosine terms
  • See pmean.com/07/CyclicalTrends.html by Steve Simon
  • If you knew the month at which incidence is a minimum, could just add a sine term to the model
  • Adding both sine and cosine terms effectively allows for a model parameter that estimates the time origin
  • Assume the period (cycle length) is 12m
C
Code
# Save knot locations
k  <- attr(rcs(d$time, 6), 'parms')
k
[1]  5.00 14.34 24.78 35.22 45.66 55.00
Code
kn <- k
# rcspline.eval is the rcs workhorse
h <- function(x) cbind(rcspline.eval(x, kn),
                       sin=sin(2*pi*x/12), cos=cos(2*pi*x/12))
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family=poisson)
f$aic
[1] 674.112
Code
ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

Next add more knots near intervention to allow for sudden change

Code
kn <- sort(c(k, c(36, 37, 38)))
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family=poisson)
f$aic
[1] 661.7904
Code
ggplot(Predict(f, fun=g, offset=off)) + w + v + yl

Now make the slow trend simpler (6 knots) but add a discontinuity at the intervention. More finely control times at which predictions are requested, to handle discontinuity.

Code
h <- function(x) {
  z <- cbind(rcspline.eval(x, k),
             sin=sin(2*pi*x/12), cos=cos(2*pi*x/12),
             jump=x >= 37)
  attr(z, 'nonlinear') <- 2 : ncol(z)
  z
    }
f <- Glm(aces ~ offset(log(stdpop)) + gTrans(time, h),
         data=d, family=poisson, x=TRUE, y=TRUE) # x, y for LRTs
f$aic
[1] 659.6044
Code
times <- sort(c(seq(0, 60, length=200), 36.999, 37, 37.001))
ggplot(Predict(f, time=times, fun=g, offset=off)) + w + v + yl

Look at fit statistics especially evidence for the jump

Code
f

General Linear Model

Glm(formula = aces ~ offset(log(stdpop)) + gTrans(time, h), family = poisson, 
    data = d, x = TRUE, y = TRUE)
Model Likelihood
Ratio Test
Obs 59 LR χ2 169.64
Residual d.f. 51 d.f. 7
g 0.080 Pr(>χ2) <0.0001
β S.E. Wald Z Pr(>|Z|)
Intercept  -6.2118  0.0095 -656.01 <0.0001
time   0.0635  0.0113 5.63 <0.0001
time'  -0.1912  0.0433 -4.41 <0.0001
time''   0.2653  0.0760 3.49 0.0005
time'''  -0.2409  0.0925 -2.61 0.0092
sin   0.0343  0.0067 5.11 <0.0001
cos   0.0380  0.0065 5.86 <0.0001
jump  -0.1268  0.0313 -4.06 <0.0001

  • Evidence for an intervention effect (jump)
  • Evidence for seasonality
  • Could have analyzed rates using a semiparametric model
D

Compute likelihood ratio \(\chi^2\) test statistics for this model

Code
anova(f, test='LR')
Likelihood Ratio Statistics for aces
χ2 d.f. P
time 169.64 7 <0.0001
Nonlinear 127.03 6 <0.0001
TOTAL 169.64 7 <0.0001

Get a joint LR test of seasonality and discontinuity by omitting 3 terms from the model

Code
g <- Glm(aces ~ offset(log(stdpop)) + rcs(time, k),
         data=d, family=poisson)
lrtest(f, g)

Model 1: aces ~ offset(log(stdpop)) + gTrans(time, h)
Model 2: aces ~ offset(log(stdpop)) + rcs(time, k)

  L.R. Chisq         d.f.            P 
6.591931e+01 2.000000e+00 4.884981e-15 

2.9 Bayesian Modeling

There are many advantages to fitting models with a Bayesian approach when compared to the frequentist / maximum likelihood approach that receives more coverage in this text. These advantages include

  • the ability to use outside information, e.g. the direction or magnitude of an effect, magnitude of interaction effect, degree of nonlinearity
  • getting exact inference (to within simulation error) regarding model parameters without using any Gaussian approximations as used so heavily in frequentist inference
  • getting exact inference about derived parameters that are nonlinear transformations of the original model parameters, without using the \(\delta\)-method approximation so often required in frequentist procedures
  • devoting less than one degree of freedom to a parameter by borrowing information
  • getting exact inference using ordinary Bayesian procedures when penalization/shrinkage/regularization is used to limit overfitting
  • allowing one to fit distributional models, e.g., using a regression model for the log standard deviation in addition to one for the mean, and getting exact inference for the joint models
  • obtaining automatic inference about any interval of possible parameter values instead of just using \(p\)-values to bring evidence against a null value
  • obtaining exact inference about unions and intersections of various assertions about parameters, e.g., the probability that a treatment reduces mortality by any amount or reduces blood pressure by \(\geq 5\) mmHg
It is just as easy to compute the Bayesian probability that an odds ratio exceeds 2.0 as it is to calculate the probability that the odds ratio exceeds 1.0.

A major part of the arsenal of Bayesian modeling weapons is Stan based at Columbia University. Very general R statistical modeling packages such as brms and rstanarm are based on Stan.

RMS has several fully worked-out Bayesian modeling case studies. The purpose of the remainder of this chapter is to show the power of Bayes in general regression modeling.

2.9.1 A General Model Specification Approach

With a Bayesian approach one can include a parameter for each aspect of the model you know exists but are unsure about. This leads to results that are not overly optimistic because uncertainty intervals will be a little wider to acknowledge what you don’t know. A good example is the Bayesian \(t\)-test, which has a parameter for the amount of non-normality and a parameter for how unequal the variances may be in the two groups being compared. Prior distributions can favor normality and equal variance, and modeling becomes more flexible as \(n \uparrow\).

Other examples of borrowing information and adding flexibility:

  • include a parameter for time-varying treatment effect to not assume proportional hazards
  • include a parameter for a \(Y\)-varying effect to not assume proportional odds
  • include an interaction effect for a treatment \(\times\) sex interaction and use a prior that favors the treatment effect for females being simlar to the treatment effect for males but that allows the effects to become arbitrarily different as \(n \uparrow\).

Interactions bring special problems to estimation and inference. In the best of cases, an interaction requires \(4 \times\) larger sample size to estimate and may require \(16 \times\) the sample size to achieve the same power as a main effect test. We need a way to borrow information, essentially having an interaction term “half in” and “half out” of the model. This has been elegantly described by R. Simon & Freedman (1997) who show how to put priors on interaction terms. Using a Bayesian approach to have an interaction “half in” the model is a much more rational approach than prevailing approaches that

  • use a sample that was sized for estimating main effects and not interaction effects
  • use a (low power) test for interaction to decide how to analyze the treatment effect, and ignore such pre-testing when doing a statistical test (that will not preserve \(\alpha\)) or computing a confidence interval (that will not have the nominal \(1 - \alpha\) coverage)
  • use the pre-specified model with interaction, resulting in low precision because of having no way to borrow information across levels of the interacting factor

2.9.2 Help in Specifying Priors

To gain the advantages of Bayesian modeling described above, doing away with binary decisions and allowing the use of outside information, one must specify prior distributions for parameters. It is often difficult to do this, especially when there are nonlinear effects (e.g., splines) and interactions in the model. We need a way to specify priors on the original \(X\) and \(Y\) scales. Fortunately Stan provides an elegant solution.

As discussed here Stan allows one to specify priors on transformations of model parameters, and these priors propagate back to the original parameters. It is easier to specify a prior for the effect of increasing age from 30 to 60 that it is to specify a prior for the age slope. It may be difficult to specify a prior for an age \(\times\) treatment interaction, but much easier to specify a prior for how different the treatment effect is for a 30 year old and a 60 year old. By specifying priors on one or more contrasts one can easily encode outside information / information borrowing / shrinkage.

The rms contrast function provides a general way to implement contrasts up to double differences, and more details about computations are provided in that link. The approach used for specifying priors for contrast in rmsb::blrm uses the same process but is even more general. Both contrast and blrm compute design matrices at user-specified predictor settings, and the contrast matrices (matrices multipled by \(\hat{\beta}\)) are simply differences in such design matrices. Thinking of contrasts as differences in predicted values frees the user from having to care about how parameters map to estimands, and allows an R predict(fit, type='x') function do the hard work. Examples of types of differences are below.

rmsb implements priors on contrasts starting with versio 1.0-0.
  • no difference: compute an absolute predicted value (not implemented in blrm for priors)
  • single difference
    • treatment main effect
    • continuous predictor effects computed by subtracting predictions at two values of the predictor
  • double difference
    • amount of nonlinearity (differences in slopes over intervals of the predictor)
    • interaction effect (e.g., age slope for females minus age slope for males)
  • triple difference
    • amount of nonlinearity in an interaction effect

For predictors modeled linearly, the slope is the regression coefficient. For nonlinear effects where \(x\) is transformed by \(f(x)\), the slope at \(x=\frac{a+b}{2}\) is proportionally approximated by \(f(b) - f(a)\), and the slope at \(x=\frac{b+c}{2}\) by \(f(c) - f(b)\). The amount of nonlinearity is reflected by the difference in the two slopes, or \(f(c) - f(b) -[f(b) - f(a)] = f(a) + f(c) - 2f(b)\). You’ll see this form specified in the contrast part of the pcontrast argument to blrm below.

This is a numerical approximation to the second derivative (slope of the slope; acceleration). It would be easy to use more accurate Lagrange interpolation derivative approximators here.

2.9.3 Examples of Priors on Contrasts

Semiparametic models are introduced in Chapter 13 but we will use one of the models—the proportional odds (PO) ordinal logistic model—in showcasing the utility of specifying priors on contrasts in order to use external information or to place restrictions on model fits. The blrm function in the rmsb package implements this semiparametric model using Stan. Because it does not depend on knowing how to transform \(Y\) I almost always use the more robust ordinal models instead of linear models. The linear predictor \(X\beta\) is on the logit (log odds) scale for the PO model. This unitless scale typically ranges from -5 to 5, corresponding to a range of probabilities of 0.007 to 0.993. Default plotting uses the intercept corresponding to the marginal median of \(Y\), so the log odds of the probability that \(Y\) exceeds or equals this level, given \(X\), is plotted. Estimates can be converted to means, quantiles, or exceedance probabilities using the Mean, Quantile, and ExProb functions in the rms and rmsb packages.

Ordinal models in the cumulative probability class such as the PO model have \(k\) intercepts for \(k+1\) distinct values of \(Y\). These intercepts encode the entire empirical distribution of \(Y\) for one covariate setting, hence the term semiparametric and freedom from having to choose a \(Y\) distribution.

Effects for the PO model are usually expressed as odds ratios (OR). For the case where the prior median for the OR is 1.0 (prior mean or median log(OR)=0.0) it is useful to solve for the prior SD \(\sigma\) so that \(\Pr(\text{OR} > r) = a = \Pr(\text{OR} < \frac{1}{r})\), leading to \(a = \frac{|\log(r)|}{\Phi^{-1}(1-a)}\), computed by the psigma function below. Another function . is defined as an abbreviation for list() for later usage.

Code
psigma <- function(r, a, inline=FALSE, pr=! inline) {
  sigma <- abs(log(r)) / qnorm(1 - a)
  dir <- if(r > 1.) '>' else '<'
  x <- if(inline) paste0('$\\Pr(\\text{OR}', dir, r, ') =', a,
                         ' \\Rightarrow \\sigma=', round(sigma, 3), '$')
  else paste0('Pr(OR ', dir, ' ', r, ') = ', a, ' ⇒ σ=', round(sigma, 3))
  if(inline) return(x)
  if(pr) {
    cat('\n', x, '\n\n', sep='')
    return(invisible(sigma))
  }
  sigma
}
. <- function(...) list(...)

Start with a simple hypothetical example:

  • model has a quadratic effect of age
  • age interacts with treatment
  • sex is also in the model, not interacting with anything

We wish to specify a prior on the treatment effect at age 50 so that there is only a 0.05 chance that the \(\text{OR} < 0.5\). \(\Pr(\text{OR}<0.5) =0.05 \Rightarrow \sigma=0.421\). The covariate settings specified in pcontrast below do not mention sex, so predictions are evaluated at the default sex (the mode). Since sex does not interact with anything, the treatment difference of interest makes the sex setting irrelevant anyway.

Code
require(rmsb)
f <- blrm(y ~ treatment * pol(age, 2) + sex,
          pcontrast=list(sd=psigma(0.5, 0.05),
                         c1=.(treatment='B', age=50),  # .() = list()
                         c2=.(treatment='A', age=50),
                         contrast=expression(c1 - c2) ) )

Note that the notation needed for pcontrast need not consider how age is modeled.

Consider a more complicated situation. Let’s simulate data for one continuous predictor where the true model is a sine wave. The response variable is a slightly rounded version of a conditionally normal \(Y\).

The rounding is done just to lower the number of intercepts from 199 to 52 to speed up the Bayesian PO model fits.
Code
require(rmsb)
require(ggplot2)
options(mc.cores=4,    # See https://hbiostat.org/r/examples/blrm/blrm
        rmsb.backend='cmdstan', rmsbdir='~/.rmsb',
        prType='html')
cmdstanr::set_cmdstan_path('/usr/local/bin/cmdstan-2.34.1')
set.seed(3)
n <- 200
x <- rnorm(n)
y <- round(sin(2*x) + rnorm(n), 1)
dd <- datadist(x, q.display=c(.005, .995)); options(datadist='dd')
f <- blrm(y ~ rcs(x, 6))
Initial log joint probability = -759.904 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
      29      -552.575    0.00122596    0.00234747      0.4355           1       34    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.1 seconds.
Running MCMC with 4 parallel chains...

Chain 2 finished in 1.0 seconds.
Chain 4 finished in 1.0 seconds.
Chain 1 finished in 1.1 seconds.
Chain 3 finished in 1.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.1 seconds.
Total execution time: 1.3 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6))
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 200 LOO log L -787.71±14.88 g 1.428 [1.153, 1.73] C 0.699 [0.69, 0.704]
Draws 4000 LOO IC 1575.42±29.76 gp 0.297 [0.253, 0.341] Dxy 0.397 [0.381, 0.409]
Chains 4 Effective p 78.65±7.27 EV 0.281 [0.188, 0.357]
Time 1.7s B 0.197 [0.193, 0.203] v 1.604 [0.915, 2.191]
p 5 vp 0.07 [0.047, 0.089]
Mode β Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
x   -3.1222   -3.0966   -3.1071   0.7914   -4.5327   -1.4872  0.0003  1.07
x'   17.0350   16.8046   16.8815   8.0429   1.4312   33.0922  0.9792  0.99
x''  -20.1181  -19.1344  -19.4768  38.0700   -97.8284   54.5320  0.3020  1.02
x'''  -37.4296  -38.5821  -38.2656  58.8854  -158.9895   76.0041  0.2550  1.00
x''''   50.7852   50.9529   50.6317  50.5337   -50.0252  148.2597  0.8398  1.03
Code
ggplot(Predict(f))

Code
# Plot predicted mean instead of log odds
M <- Mean(f)
ggplot(Predict(f, fun=M),
               ylab=expression(hat(E)(Y*"|"*x))) +
  geom_smooth(mapping=aes(x,y)) +
  labs(caption='Black line: posterior mean of predicted means from PO model\nBlue line: loess nonparametric smoother')

Now suppose that there is strong prior knowledge that the effect of x is linear when x is in the interval \([-1, 0]\). Let’s reflect that by putting a very sharp prior to tilt the difference in slopes within that interval towards 0.0. pcontrast= specifies two separate contrasts to pull towards zero to more finely detect nonlinearity.

The examples that follow use atypically small prior standard deviations so that constraints will be obvious.
Code
con <- list(sd=rep(psigma(1.05, 0.01), 2),
            c1=.(x=-1),   c2=.(x=-.75), c3=.(x=-.5),
            c4=.(x=-.25), c5=.(x=0),
            contrast=expression(c1 + c3 - 2 * c2, c3 + c5 - 2 * c4))

Pr(OR > 1.05) = 0.01 ⇒ σ=0.021
Code
f <- blrm(y ~ rcs(x, 6), pcontrast=con)
Initial log joint probability = -744.826 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
      47      -564.845   0.000491529    0.00396647      0.3177           1       56    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.1 seconds.
Running MCMC with 4 parallel chains...

Chain 1 finished in 2.1 seconds.
Chain 3 finished in 2.2 seconds.
Chain 4 finished in 2.2 seconds.
Chain 2 finished in 2.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 2.2 seconds.
Total execution time: 2.4 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6), pcontrast = con)
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 200 LOO log L -800.3±14.97 g 0.938 [0.636, 1.25] C 0.658 [0.643, 0.681]
Draws 4000 LOO IC 1600.6±29.94 gp 0.208 [0.151, 0.268] Dxy 0.316 [0.287, 0.362]
Chains 4 Effective p 74.32±6.47 EV 0.141 [0.073, 0.211]
Time 2.8s B 0.214 [0.208, 0.22] v 0.74 [0.331, 1.233]
p 5 vp 0.035 [0.018, 0.053]
Mode β Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
x   0.2739   0.2760   0.2717   0.3082   -0.3205   0.8730  0.8162  1.01
x'   1.3744   1.3907   1.3821   0.9277   -0.3533   3.2728  0.9328  0.99
x''   -6.0056   -6.1230   -6.0893   3.4195   -13.2538   -0.0368  0.0332  0.98
x'''   24.1346   24.5147   24.3647  10.6199   4.1416   45.1982  0.9902  1.02
x''''  -65.1571  -66.0545  -65.5550  22.2431  -110.5397  -24.3161  0.0015  0.96

Contrasts Given Priors

[1] list(sd = c(0.0209728582358081, 0.0209728582358081), c1 = list(   
[2]     x = -1), c2 = list(x = -0.75), c3 = list(x = -0.5), c4 = list(
[3]     x = -0.25), c5 = list(x = 0), contrast = expression(c1 +      
[4]     c3 - 2 * c2, c3 + c5 - 2 * c4))                               
Code
f$Contrast   # Print the design matrix corresponding to the two contrasts
  rcs(x, 6)x rcs(x, 6)x' rcs(x, 6)x'' rcs(x, 6)x''' rcs(x, 6)x''''
1          0  0.02911963  0.002442679    0.00000000              0
1          0  0.04851488  0.020842754    0.00300411              0
Code
ggplot(Predict(f))

What happens if we moderately limit the acceleration (second derivative; slope of the slope) at 7 equally-spaced points?

Code
con <- list(sd=rep(0.5, 7),
            c1=.(x=-2), c2=.(x=-1.5), c3=.(x=-1), c4=.(x=-.5), c5=.(x=0),
            c6=.(x=.5), c7=.(x=1), c8=.(x=1.5), c9=.(x=2),
            contrast=expression(c1 + c3 - 2 * c2,
                                c2 + c4 - 2 * c3,
                                c3 + c5 - 2 * c4,
                                c4 + c6 - 2 * c5,
                                c5 + c7 - 2 * c6,
                                c6 + c8 - 2 * c7,
                                c7 + c9 - 2 * c8) )
f <- blrm(y ~ rcs(x, 6), pcontrast=con)
Initial log joint probability = -733.911 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
      31      -563.446   0.000882598     0.0019205      0.5105      0.5105       35    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.1 seconds.
Running MCMC with 4 parallel chains...

Chain 4 finished in 0.9 seconds.
Chain 2 finished in 1.0 seconds.
Chain 1 finished in 1.1 seconds.
Chain 3 finished in 1.1 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.0 seconds.
Total execution time: 1.2 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.052 for Intercepts

blrm(formula = y ~ rcs(x, 6), pcontrast = con)
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 200 LOO log L -785.25±13.78 g 1.061 [0.772, 1.293] C 0.694 [0.683, 0.704]
Draws 4000 LOO IC 1570.51±27.56 gp 0.237 [0.19, 0.286] Dxy 0.388 [0.365, 0.407]
Chains 4 Effective p 73.14±6.29 EV 0.178 [0.101, 0.241]
Time 1.6s B 0.198 [0.193, 0.203] v 0.885 [0.463, 1.288]
p 5 vp 0.044 [0.025, 0.06]
Mode β Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
x   -2.0426   -2.0030   -2.0174   0.6300   -3.1758  -0.6852  0.0010  1.04
x'   12.6712   12.5307   12.6299   4.9780   2.9104  22.3549  0.9900  0.98
x''  -21.0673  -20.9391  -21.1785  21.8647  -62.8108  20.5844  0.1747  1.01
x'''   -9.4571   -9.0169   -8.5099  32.9117  -75.0163  51.7359  0.3922  0.98
x''''   14.4363   13.6367   13.1019  29.1170  -44.1863  69.6172  0.6815  1.04

Contrasts Given Priors

[1] list(sd = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5), c1 = list(x = -2),    
[2]     c2 = list(x = -1.5), c3 = list(x = -1), c4 = list(x = -0.5),      
[3]     c5 = list(x = 0), c6 = list(x = 0.5), c7 = list(x = 1), c8 = list(
[4]         x = 1.5), c9 = list(x = 2), contrast = expression(c1 +        
[5]         c3 - 2 * c2, c2 + c4 - 2 * c3, c3 + c5 - 2 * c4, c4 +         
[6]         c6 - 2 * c5, c5 + c7 - 2 * c6, c6 + c8 - 2 * c7, c7 +         
[7]         c9 - 2 * c8))                                                 
Code
f$Contrast   # Print the design matrix corresponding to the two contrasts
  rcs(x, 6)x rcs(x, 6)x' rcs(x, 6)x'' rcs(x, 6)x''' rcs(x, 6)x''''
1          0  0.01298375  0.000000000   0.000000000    0.000000000
1          0  0.07768802  0.002453429   0.000000000    0.000000000
1          0  0.15526901  0.045575694   0.003046185    0.000000000
1          0  0.23285000  0.122161511   0.048638084    0.001537246
1          0  0.30602473  0.196347206   0.122778948    0.037926277
1          0  0.24969458  0.170741274   0.117781863    0.055476864
1          0  0.06289905  0.043179908   0.029952920    0.014391805
Code
ggplot(Predict(f))

Next simulate data with one continuous predictor x1 and a 3-level grouping variable x2. Start with almost flat priors that allow arbitrary interaction patterns as long as x1 has a smooth effect.

Code
set.seed(6)
n <- 90
x1 <- runif(n)
x2 <- sample(c('a', 'b', 'c'), n, TRUE)
y  <- round(x1 + (x1 - 0.5) ^ 2 -0.4 * (x2 == 'b') + .5 * (x2 == 'c') + runif(n), 1)
dd <- datadist(x1, x2)
f <- orm(y ~ rcs(x1, 4) * x2)
ggplot(Predict(f, x1, x2))

Code
f <- blrm(y ~ rcs(x1, 4) * x2)
Initial log joint probability = -275.928 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
      47      -152.587    0.00114036    0.00143689      0.9485      0.9485       53    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.1 seconds.
Running MCMC with 4 parallel chains...

Chain 3 finished in 0.4 seconds.
Chain 1 finished in 0.5 seconds.
Chain 2 finished in 0.5 seconds.
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 0.6 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.105 for Intercepts

blrm(formula = y ~ rcs(x1, 4) * x2)

Frequencies of Responses

  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 
  1   1   4   1   2   1   8   5   5   5   4   3   8   3   7   8   3   4   3   3 
  2 2.1 2.2 2.3 2.7 
  4   4   1   1   1 
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 90 LOO log L -251.49±9.68 g 3.831 [3.087, 4.721] C 0.847 [0.832, 0.859]
Draws 4000 LOO IC 502.98±19.35 gp 0.44 [0.407, 0.473] Dxy 0.695 [0.663, 0.718]
Chains 4 Effective p 43.59±4.36 EV 0.648 [0.537, 0.762]
Time 0.9s B 0.11 [0.099, 0.121] v 11.325 [7.196, 17.021]
p 11 vp 0.16 [0.128, 0.188]
Mode β Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
x1   1.5721   1.4055   1.3174   4.7155   -7.8307   10.4658  0.6150  1.02
x1'   -5.8955   -5.0212   -5.1153  15.5477   -36.3046   24.4310  0.3675  1.00
x1''   44.1427   40.9260   41.4864  45.7927   -46.3901  131.3788  0.8092  1.01
x2=b   -2.5977   -2.6503   -2.6244   1.5806   -5.8521   0.4163  0.0450  0.97
x2=c   6.5730   6.4575   6.4376   1.9068   2.7954   10.3532  0.9998  1.04
x1 × x2=b   -0.1114   -0.1065   -0.1995   7.1523   -14.3788   13.3477  0.4938  0.98
x1' × x2=b   9.7368   9.8309   10.0249  22.9770   -34.8053   54.7470  0.6628  0.97
x1'' × x2=b   -28.6540   -28.8044   -29.6637  66.0464  -160.9128   97.6813  0.3298  1.03
x1 × x2=c   -11.4684   -11.1547   -11.2030   8.5059   -27.1265   6.0600  0.0928  0.99
x1' × x2=c   39.1911   37.8146   37.7590  26.5617   -13.1120   89.4993  0.9268  1.02
x1'' × x2=c  -119.1246  -114.2376  -114.7394  73.6667  -258.1990   26.0909  0.0583  0.98
Code
ggplot(Predict(f, x1, x2))

Put priors specifying that groups b and c have a similar x1-shape (no partial interaction between x1 and b vs. c). contrast below encodes parallelism with respect to b and c.

Code
con <- list(sd=rep(psigma(1.5, 0.05), 4),
            c1=.(x1=0,   x2='b'), c2=.(x1=0,   x2='c'),
            c3=.(x1=.25, x2='b'), c4=.(x1=.25, x2='c'),
            c5=.(x1=.5,  x2='b'), c6=.(x1=.5,  x2='c'),
            c7=.(x1=.75, x2='b'), c8=.(x1=.75, x2='c'),
            c9=.(x1=1,   x2='b'), c10=.(x1=1,  x2='c'),
            contrast=expression(c1 - c2 - (c3 - c4),  # gap between b and c curves at x1=0 vs. x1=.25
                                c1 - c2 - (c5 - c6),
                                c1 - c2 - (c7 - c8),
                                c1 - c2 - (c9 - c10) ))

Pr(OR > 1.5) = 0.05 ⇒ σ=0.247
Code
f <- blrm(y ~ rcs(x1, 4) * x2, pcontrast=con)
Initial log joint probability = -332.174 
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes  
      77      -154.898    0.00197516    0.00232123           1           1       85    
Optimization terminated normally:  
  Convergence detected: relative gradient magnitude is below tolerance 
Finished in  0.1 seconds.
Running MCMC with 4 parallel chains...

Chain 2 finished in 1.5 seconds.
Chain 4 finished in 1.8 seconds.
Chain 1 finished in 1.9 seconds.
Chain 3 finished in 1.8 seconds.

All 4 chains finished successfully.
Mean chain execution time: 1.8 seconds.
Total execution time: 2.0 seconds.
Code
f

Bayesian Proportional Odds Ordinal Logistic Model

Dirichlet Priors With Concentration Parameter 0.105 for Intercepts

blrm(formula = y ~ rcs(x1, 4) * x2, pcontrast = con)

Frequencies of Responses

  0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9   1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 
  1   1   4   1   2   1   8   5   5   5   4   3   8   3   7   8   3   4   3   3 
  2 2.1 2.2 2.3 2.7 
  4   4   1   1   1 
Mixed Calibration/
Discrimination Indexes
Discrimination
Indexes
Rank Discrim.
Indexes
Obs 90 LOO log L -252.87±9.56 g 3.553 [2.879, 4.194] C 0.844 [0.83, 0.855]
Draws 4000 LOO IC 505.74±19.12 gp 0.425 [0.393, 0.457] Dxy 0.688 [0.662, 0.711]
Chains 4 Effective p 41.23±4.67 EV 0.584 [0.487, 0.681]
Time 2.3s B 0.109 [0.098, 0.121] v 9.891 [6.331, 13.406]
p 11 vp 0.145 [0.117, 0.168]
Mode β Mean β Median β S.E. Lower Upper Pr(β>0) Symmetry
x1   1.3872   1.2949   1.2634   4.5721   -7.2921   10.9135  0.6168  1.02
x1'   -5.3197   -4.7258   -4.5720  15.1777   -36.0190   23.1949  0.3825  0.99
x1''   39.7289   37.4367   36.2995  44.5967   -53.2062  120.8291  0.8045  1.02
x2=b   -1.1474   -1.1790   -1.2099   1.3117   -3.7379   1.4107  0.1815  1.02
x2=c   4.2584   4.2138   4.2340   1.4052   1.5111   7.0485  0.9990  1.02
x1 × x2=b   -4.8033   -4.7442   -4.6551   6.0352   -16.7051   6.7515  0.2182  1.00
x1' × x2=b   23.4146   22.8466   22.4774  19.3795   -15.9664   59.8044  0.8818  1.05
x1'' × x2=b  -72.8422  -70.5298  -68.8468  55.5136  -184.3112   32.4627  0.1010  0.97
x1 × x2=c   -4.6847   -4.6153   -4.5425   6.0636   -16.6970   7.1548  0.2255  0.98
x1' × x2=c   23.0095   22.4053   22.1281  19.5092   -17.3973   58.4469  0.8702  1.06
x1'' × x2=c  -72.1958  -69.7826  -69.8212  55.8542  -180.0213   35.2193  0.1055  0.97

Contrasts Given Priors

[1] list(sd = c(0.246505282576203, 0.246505282576203, 0.246505282576203, 
[2] 0.246505282576203), c1 = list(x1 = 0, x2 = "b"), c2 = list(x1 = 0,   
[3]     x2 = "c"), c3 = list(x1 = 0.25, x2 = "b"), c4 = list(x1 = 0.25,  
[4]     x2 = "c"), c5 = list(x1 = 0.5, x2 = "b"), c6 = list(x1 = 0.5,    
[5]     x2 = "c"), c7 = list(x1 = 0.75, x2 = "b"), c8 = list(x1 = 0.75,  
[6]     x2 = "c"), c9 = list(x1 = 1, x2 = "b"), c10 = list(x1 = 1,       
[7]     x2 = "c"), contrast = expression(c1 - c2 - (c3 - c4), c1 -       
[8]     c2 - (c5 - c6), c1 - c2 - (c7 - c8), c1 - c2 - (c9 - c10)))      
Code
f$Contrast
  rcs(x1, 4)x1 rcs(x1, 4)x1' rcs(x1, 4)x1'' x2b x2c rcs(x1, 4)x1:x2b
1            0             0              0   0   0            -0.25
1            0             0              0   0   0            -0.50
1            0             0              0   0   0            -0.75
1            0             0              0   0   0            -1.00
  rcs(x1, 4)x1':x2b rcs(x1, 4)x1'':x2b rcs(x1, 4)x1:x2c rcs(x1, 4)x1':x2c
1      -0.006089308        0.000000000             0.25       0.006089308
1      -0.091848739       -0.002089092             0.50       0.091848739
1      -0.372867191       -0.062281932             0.75       0.372867191
1      -0.879577838       -0.236978793             1.00       0.879577838
  rcs(x1, 4)x1'':x2c
1        0.000000000
1        0.002089092
1        0.062281932
1        0.236978793
Code
ggplot(Predict(f, x1, x2))

2.10 Study Questions

  1. Contrast statistical models vs. machine learning in a few ways

Section 2.1-2.2

  1. What is always linear in the linear model?

Section 2.3

  1. Define an interaction effect in words
  2. In a model with main effects and two-way interactions, which regression parameter is the most “universal” by virtue of being most independent of coding?
  3. When a predictor is involved in an interaction, which test of association involving that predictor is preferred?
  4. What are two ways of doing chunk (composite) tests?
  5. An analyst intends to use least squares multiple regression to predict follow-up blood pressure from baseline blood pressure, age, and sex. She wants to use this model for estimation, prediction, and inference (statistical tests and confidence limits). List all of the assumptions she makes by fitting the model \(Y = \beta_{0} + \beta_{1} \text{bp} + \beta_{2} \text{age} + \beta_{3} [\text{sex=female}]\).
  6. List as many methods as you can for checking the assumptions you listed.

Section 2.4.1

  1. Provide an example where categorizing a continuous predictor is a good idea
  2. If you dichotomize a continuous predictor that has a smooth relationship with Y, how can you arbitrarily change its estimated regression coefficients?
  3. What is a general term used for the statistical problem caused by categorizing continuous variables?
  4. What tends to happen when data are used to find cutpoints for relationships that are in fact smooth?
  5. What other inferential distortion is caused by categorization?
  6. Is there an amount of noise that could be added to a continuous X that makes a categorized analysis better than a continuous one?

Section 2.4.2

  1. What is the biggest drawback to using regular polynomials to fit smooth nonlinear relationships?
  2. Why does a spline function require one to use a truncated term?
  3. What advantages does a linearly tail-restricted cubic spline have?
  4. Why are its constructed variables X’ X’’ X’’’ etc. so complex?
  5. What do you need to do to handle a sudden curvature change when fitting a nonlinear spline?
  6. Barring a sudden curvature change, why are knot locations not so important in a restricted cubic spline (rcs)?
  7. What is the statistical term for the delimma that occurs when choosing the number of knots?
  8. What is a Bayesian way of handling this?
  9. What is the core idea behind Cleveland’s loess smoother?
  10. Can fitted spline functions for some of the predictors be an end in themselves?

Section 2.5

  1. What is the worst thing about recursive partitioning?
  2. What is a major tradeoff between the modeling strategy we will largely use and ML?
  3. What are some determinants of the best choice between SM and ML?

Section 2.6

  1. What is the main message of the Grambsch and O’Brien paper?
  2. What notational error do many analysts make that lead them to ignore model uncertainty?

Section 2.7.2

  1. Why is it a good idea to pre-specify interactions, and how is this done?
  2. Why do we not fiddle with transformations of a predictor before considering interactions?
  3. What 3-dimensional restriction does the cross-products of two restricted cubic splines force?
  4. What is an efficient and likely-to-fit approach to allowing for the effect of measurement degradement?