require(rms)
require(ggplot2)
# knitrSet(lang='markdown', fig.path='png/')
options(prType='html')
General Multi-parameter Transformations in rms
As of version 6.2-0 of the rms
package appearing in CRAN on 2021-03-18, there is a new transformation function gTrans
, joining lsp
, rcs
, pol
, etc. gTrans
takes a numeric, character, or factor vector as its first argument and a function as its second argument. The latter can be any function that creates a numeric matrix. Give the columns of the matrix names if you don’t want to use default names. If you want to take advantage of automatic tests of nonlinearity when running rms::anova.rms
(e.g., by issuing anova(fit)
), give the returned matrix an attribute nonlinear
which is a vector of integers specifying which columns correpond to nonlinear terms. Columns default to being considered as linear terms. If you want the matrix you generate to have a linear term it is up to you to add a column containing the original (numeric) vector.
gTrans
is quite general. With it you can do things such as the following:
- use a linear spline basis different from the one
lsp
uses - use a cubic spline basis different from the one
rcs
uses - use a quadratic spline
- specify a linear, quadratic, or spline function with discontinuities
- implement non-spline bases such as fractional polynomials
- add cyclic trends to the regression
- pool certain categories of a categorical predictor
Examples where you might want to fit a discontinuous function are interrupted time series and allowing for special effects at zero. As an example of the latter, suppose that a predictor is the number of days that a patient has been in the hospital in the past year. Zero has a special meaning, so one may wish to allow for a spike at zero by using gTrans(x, function(x) cbind(x == 0, sqrt(x), x))
. This fits the following model:
\[Y = \beta_{0} + \beta_{1}[x = 0] + \beta_{2}\sqrt{x} + \beta_{3}x\] The reason to use gTrans
instead of creating a separate model term for the discontinuity is that all the predictor’s basis functions remain connected. This is important for several reasons, including the following.
- When studying the effect of varying \(x\), the terms will be forced to vary together. This will force, for example, \(\beta_1\) to be ignored when \(x > 0\) where one is studying the effect of \(x\) on the continuous part of the model. This applies to partial effect plots, odds ratio charts, nomograms, etc.
- When using
anova(fit)
to do a “chunk” test for whether \(x\) is associated with \(Y\), the 1 d.f. for the discontinuous term will be added to the 2 d.f. for \(\beta_2\) and \(\beta_3\) to get the correct 3 d.f. test for flatness (no association). anova
can also provide chunk tests of nonlinearity when you keep the terms together.
Here are some examples with graphical output. Start by simulating some random data.
<- 40
n set.seed(1)
<- runif(n)
y <- runif(n)
x1 <- rnorm(n)
x2 <- sample(letters[1:4], n, TRUE)
g # We usually plot predicted values from the 10th smallest to the 10th largest observations
# Instead for demonstrating fits we plot over the whole range
<- datadist(x1, x2, g, q.display=c(0,1)); options(datadist='dd') dd
Quadratic Fit Duplicating pol(x, 2)
<- ols(y ~ pol(x1, 2) * g + x2)
f <- function(x) {
pol2 # Give a custom name xsq for one column
<- cbind(x, xsq=x^2)
z attr(z, 'nonlinear') <- 2
z
}<- ols(y ~ gTrans(x1, pol2) * g + x2)
h specs(h, long=TRUE)
ols(formula = y ~ gTrans(x1, pol2) * g + x2)
Assumption Parameters d.f.
x1 gTrans 2
g category a b c d 3
x2 asis 1
x1 * g interaction nonlinear x linear - f(A)B 6
x1 g x2
Low:effect 0.3377484 <NA> -0.59514731
Adjust to 0.4985898 c 0.05117174
High:effect 0.7787237 <NA> 0.57997131
Low:prediction 0.0233312 a -1.80495863
High:prediction 0.9606180 d 2.40161776
Low 0.0233312 a -1.80495863
High 0.9606180 d 2.40161776
rbind(coef(f), coef(h))
Intercept x1 x1^2 g=b g=c g=d x2
[1,] 0.7901302 -1.227246 1.002612 0.250034 -1.225043 -0.0724312 0.02494889
[2,] 0.7901302 -1.227246 1.002612 0.250034 -1.225043 -0.0724312 0.02494889
x1 * g=b x1^2 * g=b x1 * g=c x1^2 * g=c x1 * g=d x1^2 * g=d
[1,] -0.1810583 0.06080463 4.38774 -3.491963 -0.1317308 0.5119778
[2,] -0.1810583 0.06080463 4.38774 -3.491963 -0.1317308 0.5119778
anova(f)
Analysis of Variance for y |
|||||
---|---|---|---|---|---|
d.f. | Partial SS | MS | F | P | |
x1 (Factor+Higher Order Factors) | 8 | 0.4691615 | 0.05864518 | 0.69 | 0.6936 |
All Interactions | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
Nonlinear (Factor+Higher Order Factors) | 4 | 0.2503327 | 0.06258318 | 0.74 | 0.5727 |
g (Factor+Higher Order Factors) | 9 | 0.5117214 | 0.05685793 | 0.67 | 0.7260 |
All Interactions | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
x2 | 1 | 0.0125370 | 0.01253700 | 0.15 | 0.7031 |
x1 × g (Factor+Higher Order Factors) | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
Nonlinear | 3 | 0.1705474 | 0.05684914 | 0.67 | 0.5763 |
Nonlinear Interaction : f(A,B) vs. AB | 3 | 0.1705474 | 0.05684914 | 0.67 | 0.5763 |
TOTAL NONLINEAR | 4 | 0.2503327 | 0.06258318 | 0.74 | 0.5727 |
TOTAL NONLINEAR + INTERACTION | 7 | 0.4666141 | 0.06665916 | 0.79 | 0.6029 |
REGRESSION | 12 | 0.7834533 | 0.06528777 | 0.77 | 0.6724 |
ERROR | 27 | 2.2818581 | 0.08451326 |
anova(h)
Analysis of Variance for y |
|||||
---|---|---|---|---|---|
d.f. | Partial SS | MS | F | P | |
x1 (Factor+Higher Order Factors) | 8 | 0.4691615 | 0.05864518 | 0.69 | 0.6936 |
All Interactions | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
Nonlinear (Factor+Higher Order Factors) | 4 | 0.2503327 | 0.06258318 | 0.74 | 0.5727 |
g (Factor+Higher Order Factors) | 9 | 0.5117214 | 0.05685793 | 0.67 | 0.7260 |
All Interactions | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
x2 | 1 | 0.0125370 | 0.01253700 | 0.15 | 0.7031 |
x1 × g (Factor+Higher Order Factors) | 6 | 0.2900861 | 0.04834768 | 0.57 | 0.7489 |
Nonlinear | 3 | 0.1705474 | 0.05684914 | 0.67 | 0.5763 |
Nonlinear Interaction : f(A,B) vs. AB | 3 | 0.1705474 | 0.05684914 | 0.67 | 0.5763 |
TOTAL NONLINEAR | 4 | 0.2503327 | 0.06258318 | 0.74 | 0.5727 |
TOTAL NONLINEAR + INTERACTION | 7 | 0.4666141 | 0.06665916 | 0.79 | 0.6029 |
REGRESSION | 12 | 0.7834533 | 0.06528777 | 0.77 | 0.6724 |
ERROR | 27 | 2.2818581 | 0.08451326 |
summary(f)
Effects Response: y |
|||||||
---|---|---|---|---|---|---|---|
Low | High | Δ | Effect | S.E. | Lower 0.95 | Upper 0.95 | |
x1 | 0.3377 | 0.7787 | 0.441 | 0.16810 | 0.16690 | -0.1743 | 0.5105 |
x2 | -0.5951 | 0.5800 | 1.175 | 0.02932 | 0.07612 | -0.1269 | 0.1855 |
g --- a:c | 3.0000 | 1.0000 | -0.09457 | 0.19360 | -0.4919 | 0.3028 | |
g --- b:c | 3.0000 | 2.0000 | 0.08031 | 0.20190 | -0.3339 | 0.4945 | |
g --- d:c | 3.0000 | 4.0000 | -0.10540 | 0.20090 | -0.5177 | 0.3069 |
summary(h)
Effects Response: y |
|||||||
---|---|---|---|---|---|---|---|
Low | High | Δ | Effect | S.E. | Lower 0.95 | Upper 0.95 | |
x1 | 0.3377 | 0.7787 | 0.441 | 0.16810 | 0.16690 | -0.1743 | 0.5105 |
x2 | -0.5951 | 0.5800 | 1.175 | 0.02932 | 0.07612 | -0.1269 | 0.1855 |
g --- a:c | 3.0000 | 1.0000 | -0.09457 | 0.19360 | -0.4919 | 0.3028 | |
g --- b:c | 3.0000 | 2.0000 | 0.08031 | 0.20190 | -0.3339 | 0.4945 | |
g --- d:c | 3.0000 | 4.0000 | -0.10540 | 0.20090 | -0.5177 | 0.3069 |
ggplot(Predict(f))
ggplot(Predict(h))
<- list(x1=c(.2, .4), g='b')
k1 <- list(x1=c(.2, .4), g='d')
k2 contrast(f, k1)
x1 g x2 Contrast S.E. Lower Upper t Pr(>|t|)
1 0.2 b 0.05117174 0.01218642 0.3892210 -0.7864292 0.8108020 0.03 0.9753
2 0.4 b 0.05117174 -0.14186453 0.3835685 -0.9288821 0.6451531 -0.37 0.7144
Error d.f.= 27
Confidence intervals are 0.95 individual intervals
contrast(h, k1)
x1 g x2 Contrast S.E. Lower Upper t Pr(>|t|)
1 0.2 b 0.05117174 0.01218642 0.3892210 -0.7864292 0.8108020 0.03 0.9753
2 0.4 b 0.05117174 -0.14186453 0.3835685 -0.9288821 0.6451531 -0.37 0.7144
Error d.f.= 27
Confidence intervals are 0.95 individual intervals
contrast(f, k1, k2)
x1 x2 Contrast S.E. Lower Upper t Pr(>|t|)
1 0.2 0.05117174 0.2945528 0.2229484 -0.1628996 0.7520051 1.32 0.1975
2 0.4 0.05117174 0.2305465 0.2028467 -0.1856605 0.6467535 1.14 0.2657
Error d.f.= 27
Confidence intervals are 0.95 individual intervals
contrast(h, k1, k2)
x1 x2 Contrast S.E. Lower Upper t Pr(>|t|)
1 0.2 0.05117174 0.2945528 0.2229484 -0.1628996 0.7520051 1.32 0.1975
2 0.4 0.05117174 0.2305465 0.2028467 -0.1856605 0.6467535 1.14 0.2657
Error d.f.= 27
Confidence intervals are 0.95 individual intervals
Linear Spline Fit Duplicating lsp
<- ols(y ~ lsp(x1, c(.2, .4)))
f # Give custom names for columns
<- function(x) {
lspline <- cbind(x, x.2=pmax(x - .2, 0), x.4=pmax(x - .4, 0))
z attr(z, 'nonlinear') <- 2:3
z
}<- ols(y ~ gTrans(x1, lspline))
h rbind(coef(f), coef(h))
Intercept x1 x1' x1''
[1,] 1.051559 -3.999801 5.19094 -1.093075
[2,] 1.051559 -3.999801 5.19094 -1.093075
ggplot(Predict(f))
ggplot(Predict(h))
anova(f)
Analysis of Variance for y |
|||||
---|---|---|---|---|---|
d.f. | Partial SS | MS | F | P | |
x1 | 3 | 0.4889819 | 0.16299396 | 2.28 | 0.0962 |
Nonlinear | 2 | 0.4808427 | 0.24042137 | 3.36 | 0.0459 |
REGRESSION | 3 | 0.4889819 | 0.16299396 | 2.28 | 0.0962 |
ERROR | 36 | 2.5763295 | 0.07156471 |
anova(h)
Analysis of Variance for y |
|||||
---|---|---|---|---|---|
d.f. | Partial SS | MS | F | P | |
x1 | 3 | 0.4889819 | 0.16299396 | 2.28 | 0.0962 |
Nonlinear | 2 | 0.4808427 | 0.24042137 | 3.36 | 0.0459 |
REGRESSION | 3 | 0.4889819 | 0.16299396 | 2.28 | 0.0962 |
ERROR | 36 | 2.5763295 | 0.07156471 |
Let’s use a different linear spline basis where the parameters refer to slopes and not to slope increments.
<- function(x, a, b=Inf) pmax(pmin(x, b) - a, 0)
curtail <- function(x) cbind(curtail(x, 0, 0.2), curtail(x, 0.2, 0.4), curtail(x, 0.4))
lspline <- ols(y ~ gTrans(x1, lspline))
h rbind(coef(f), coef(h))
Intercept x1 x1' x1''
[1,] 1.051559 -3.999801 5.19094 -1.09307531
[2,] 1.051559 -3.999801 1.19114 0.09806455
ggplot(Predict(h))
More Exotic Linear Splines
Fit a straight line from x1=0.1 on, but force a flat relationship for x1 in [0, 0.1]. First do it forcing continuity at x1=0.1
<- ols(y ~ pmax(x1, 0.1))
h <- c(0, 0.099, 1, 0.101, seq(0.2, .8, by=0.1))
xseq ggplot(Predict(h, x1=xseq))
Now allow discontinuity without a slope change.
<- function(x) cbind(x < 0.1, x)
flin <- ols(y ~ gTrans(x1, flin))
h <- c(-0.25, 1.25)
yl ggplot(Predict(h, x1=xseq), ylim=yl) +
geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
Now have a discontinuity with a slope change.
<- function(x) cbind(x < 0.1, pmax(x - 0.1, 0))
flin <- ols(y ~ gTrans(x1, flin))
h ggplot(Predict(h, x1=xseq), ylim=yl) +
geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
Discontinuity with a slope change and initial slope not restricted to be zero.
<- function(x) cbind(x < 0.1, x, pmax(x - 0.1, 0))
flin <- ols(y ~ gTrans(x1, flin))
h ggplot(Predict(h, x1=xseq), ylim=yl) +
geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
Discontinuous linear spline with two knots:
<- function(x) {
dlsp <- cbind(x, x >= 0.2, pmax(x - .2, 0), pmax(x - .4, 0))
z attr(z, 'nonlinear') <- 2:4
z
}<- ols(y ~ gTrans(x1, dlsp))
h ggplot(Predict(h), ylim=yl)
ggplot(Predict(h, x1=c(.1, .199, .2, .201, .3, .4, 1)), ylim=yl)
Discontinuous linear spline with different basis functions:
<- function(x) {
dlsp <- cbind(x, x >= 0.2, curtail(x, 0.2, 0.4), curtail(x, 0.4))
z attr(z, 'nonlinear') <- 2:4
z
}<- ols(y ~ gTrans(x1, dlsp))
h ggplot(Predict(h), ylim=yl)
Categorical Predictor
# Try on a categorical predictor
<- function(x) cbind(bc=x %in% c('b','c'), d=x == 'd')
gr <- ols(y ~ gTrans(g, gr))
h ggplot(Predict(h, g))
Different Basis Functions for Cubic Splines
Next let’s use a B-spline basis for a restricted cubic spline function (natural spline).
require(splines)
<- ols(y ~ rcs(x1, 5))
f <- c(0, 1.25)
yl ggplot(Predict(f), ylim=yl)
<- ols(y ~ gTrans(x1, function(x) ns(x, df=4)))
h ggplot(Predict(h), ylim=yl)
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 0.0354661
# Now try to specify the same knots for ns as used for rcs
<- rcs(x1, 5)
w <- attr(w, 'parms')
kn <- function(x) ns(x, knots=kn[2:4], Boundary.knots=c(kn[1], kn[5]))
nspl # Show the ns basis functions
<- order(x1)
i matplot(x1[i], nspl(x1)[i, ], type='l', xlab='x1', ylab='Natural Spline Bases')
<- ols(y ~ gTrans(x1, nspl))
h ggplot(Predict(h), ylim=yl)
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 1.301181e-15
Now use the simple truncated power basis for the natural spline, but orthogonalize it by using principal components on centered, SD-scaled data.
# Use knots already computed and stored in kn above
# Compute PCs on rcs and store scaling, centering, and rotation objects
# Note that this is automatic if you specify options(rcspc=TRUE) although not certain that
# post-fitting predicted values will work as well as below
<- rcspline.eval(x1, kn, inclx=TRUE) # use low-level function that rcs calls
X # Show usual basis functions
matplot(x1[i], X[i, ], type='l', xlab='x1', ylab='rcs Bases')
<- prcomp(X, scale=TRUE)
p matplot(x1[i], p$x[i, ], type='l', xlab='x1', ylab='PC Rotated rcs Bases')
<- function(x, rotation, center, scale, knots) {
prcs # deparse inside gTrans does not preserve matrix arguments
<- matrix(rotation, nrow=length(center))
rotation <- rcspline.eval(x, knots, inclx=TRUE)
X scale(X, center, scale) %*% rotation
}# Define parameters inside new function so the function will be self contained
formals(prcs) <- list(x=NULL, rotation=p$rotation, center=p$center, scale=p$scale, knots=kn)
# Show fit with usual basis
<- ols(y ~ rcs(x1, kn))
f ggplot(Predict(f))
# Show fit with rotated basis having no collinearities
<- ols(y ~ gTrans(x1, prcs))
h ggplot(Predict(h))
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 2.974648e-14
QR-rotations of truncated power basis functions:
# Use the Hmisc qrxcenter function to use the QR decomposition on X
<- qrxcenter(X) # also use inside gTrans as in other examples
w round(w$R, 2) # Transformation matrix; 1st column = linear = 3.83*x1
[,1] [,2] [,3] [,4]
[1,] 3.83 -7.11 9.96 -9.12
[2,] 0.00 8.19 -93.56 228.61
[3,] 0.00 0.00 142.17 -524.49
[4,] 0.00 0.00 0.00 451.46
<- w$x
Xt round(cor(Xt), 3)
[,1] [,2] [,3] [,4]
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
# Compare with high collinearities in the original bases
round(cor(X), 3)
x
x 1.000 0.881 0.851 0.794
0.881 1.000 0.997 0.977
0.851 0.997 1.000 0.990
0.794 0.977 0.990 1.000
matplot(x1[i], Xt[i, ], type='l', xlab='x1', ylab='QR Rotated rcs Bases')
Non-spline Basis
The gTrans
framework easily allows for non-spline basis functions for modeling continuous predictors. For example, fractional polynomials can be used directly. Periodic functions may also be used if the cycle length is known. Using the data simulated above, here is an example of adding a short-cycle harmonic series to a linear spline. Here the cycle length is taken to be 0.2. By adding cosine terms one does not need to know the phase angle. A nice resource on adding cyclic trends to regression by Steve Simon is here.
To render the fitted equation in math notation using the latex
method in rms
requires one to specify a function that creates LaTeX notation for the columns of the matrix produced by gTrans
. The sprintf
function is most handy for this, as shown below.
# Define a function that will be used by the latex method to customize typesetting of hrm model components
# Argument x will contain the variable's base name (here x1 but in LaTeX notation x_{1} for the subscript)
# tex works only in rms version 6.2-1 and higher
<- function(x) sprintf(c('%s', '(%s - 0.5)_{+}',
texhrm '\\sin(2\\pi \\frac{%s}{0.2})', '\\cos(2\\pi \\frac{%s}{0.2})'), x)
<- function(x) {
hrm <- cbind(x, slopeChange=pmax(x - 0.5, 0), sin=sin(2*pi*x/0.2), cos=cos(2*pi*x/0.2))
z attr(z, 'nonlinear') <- 2:4
attr(z, 'tex') <- texhrm
z
}<- ols(y ~ gTrans(x1, hrm))
h h
Linear Regression Model
ols(formula = y ~ gTrans(x1, hrm))
Model Likelihood Ratio Test |
Discrimination Indexes |
|
---|---|---|
Obs 40 | LR χ2 2.45 | R2 0.059 |
σ 0.2870 | d.f. 4 | R2adj -0.048 |
d.f. 35 | Pr(>χ2) 0.6532 | g 0.077 |
Residuals
Min 1Q Median 3Q Max -0.43501 -0.19915 -0.04817 0.22900 0.56842
β | S.E. | t | Pr(>|t|) | |
---|---|---|---|---|
Intercept | 0.7037 | 0.1536 | 4.58 | <0.0001 |
x | -0.5302 | 0.4000 | -1.33 | 0.1936 |
slopeChange | 0.8325 | 0.6596 | 1.26 | 0.2153 |
sin | -0.0238 | 0.0623 | -0.38 | 0.7048 |
cos | 0.0407 | 0.0728 | 0.56 | 0.5800 |
latex(h)
ggplot(Predict(h)) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
Computing Environment
R version 4.4.0 (2024-04-24) Platform: aarch64-apple-darwin20 Running under: macOS Sonoma 14.7 Matrix products: default BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0 attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] ggplot2_3.5.1 rms_6.9-0 Hmisc_5.2-1To cite R in publications use:
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
To cite the Hmisc package in publications use:Harrell Jr F (2024). Hmisc: Harrell Miscellaneous. R package version 5.2-1, https://hbiostat.org/R/Hmisc/.
To cite the rms package in publications use:Harrell Jr FE (2024). rms: Regression Modeling Strategies. R package version 6.9-0, https://github.com/harrelfe/rms, https://hbiostat.org/R/rms/.
To cite the ggplot2 package in publications use:Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.