General Multi-parameter Transformations in rms

Department of Biostatistics
Vanderbilt University School of Medicine

November 19, 2024

As of version 6.2-0 of the rms package appearing in CRAN on 2021-03-18, there is a new transformation function gTrans, joining lsp, rcs, pol, etc. gTrans takes a numeric, character, or factor vector as its first argument and a function as its second argument. The latter can be any function that creates a numeric matrix. Give the columns of the matrix names if you don’t want to use default names. If you want to take advantage of automatic tests of nonlinearity when running rms::anova.rms (e.g., by issuing anova(fit)), give the returned matrix an attribute nonlinear which is a vector of integers specifying which columns correpond to nonlinear terms. Columns default to being considered as linear terms. If you want the matrix you generate to have a linear term it is up to you to add a column containing the original (numeric) vector.

gTrans is quite general. With it you can do things such as the following:

Examples where you might want to fit a discontinuous function are interrupted time series and allowing for special effects at zero. As an example of the latter, suppose that a predictor is the number of days that a patient has been in the hospital in the past year. Zero has a special meaning, so one may wish to allow for a spike at zero by using gTrans(x, function(x) cbind(x == 0, sqrt(x), x)). This fits the following model:

\[Y = \beta_{0} + \beta_{1}[x = 0] + \beta_{2}\sqrt{x} + \beta_{3}x\] The reason to use gTrans instead of creating a separate model term for the discontinuity is that all the predictor’s basis functions remain connected. This is important for several reasons, including the following.

Here are some examples with graphical output. Start by simulating some random data.

require(rms)
require(ggplot2)
# knitrSet(lang='markdown', fig.path='png/')
options(prType='html')
n <- 40
set.seed(1)
y  <- runif(n)
x1 <- runif(n)
x2 <- rnorm(n)
g  <- sample(letters[1:4], n, TRUE)
# We usually plot predicted values from the 10th smallest to the 10th largest observations
# Instead for demonstrating fits we plot over the whole range
dd <- datadist(x1, x2, g, q.display=c(0,1)); options(datadist='dd')

Quadratic Fit Duplicating pol(x, 2)

f <- ols(y ~ pol(x1, 2) * g + x2)
pol2 <- function(x) {
  # Give a custom name xsq for one column
  z <- cbind(x, xsq=x^2)
  attr(z, 'nonlinear') <- 2
  z
}
h <- ols(y ~ gTrans(x1, pol2) * g + x2)
specs(h, long=TRUE)
ols(formula = y ~ gTrans(x1, pol2) * g + x2)

       Assumption  Parameters                 d.f.
x1     gTrans                                 2   
g      category     a b c d                   3   
x2     asis                                   1   
x1 * g interaction nonlinear x linear - f(A)B 6   

                       x1    g          x2
Low:effect      0.3377484 <NA> -0.59514731
Adjust to       0.4985898    c  0.05117174
High:effect     0.7787237 <NA>  0.57997131
Low:prediction  0.0233312    a -1.80495863
High:prediction 0.9606180    d  2.40161776
Low             0.0233312    a -1.80495863
High            0.9606180    d  2.40161776
rbind(coef(f), coef(h))
     Intercept        x1     x1^2      g=b       g=c        g=d         x2
[1,] 0.7901302 -1.227246 1.002612 0.250034 -1.225043 -0.0724312 0.02494889
[2,] 0.7901302 -1.227246 1.002612 0.250034 -1.225043 -0.0724312 0.02494889
       x1 * g=b x1^2 * g=b x1 * g=c x1^2 * g=c   x1 * g=d x1^2 * g=d
[1,] -0.1810583 0.06080463  4.38774  -3.491963 -0.1317308  0.5119778
[2,] -0.1810583 0.06080463  4.38774  -3.491963 -0.1317308  0.5119778
anova(f)
Analysis of Variance for y
d.f. Partial SS MS F P
x1 (Factor+Higher Order Factors) 8 0.4691615 0.05864518 0.69 0.6936
All Interactions 6 0.2900861 0.04834768 0.57 0.7489
Nonlinear (Factor+Higher Order Factors) 4 0.2503327 0.06258318 0.74 0.5727
g (Factor+Higher Order Factors) 9 0.5117214 0.05685793 0.67 0.7260
All Interactions 6 0.2900861 0.04834768 0.57 0.7489
x2 1 0.0125370 0.01253700 0.15 0.7031
x1 × g (Factor+Higher Order Factors) 6 0.2900861 0.04834768 0.57 0.7489
Nonlinear 3 0.1705474 0.05684914 0.67 0.5763
Nonlinear Interaction : f(A,B) vs. AB 3 0.1705474 0.05684914 0.67 0.5763
TOTAL NONLINEAR 4 0.2503327 0.06258318 0.74 0.5727
TOTAL NONLINEAR + INTERACTION 7 0.4666141 0.06665916 0.79 0.6029
REGRESSION 12 0.7834533 0.06528777 0.77 0.6724
ERROR 27 2.2818581 0.08451326
anova(h)
Analysis of Variance for y
d.f. Partial SS MS F P
x1 (Factor+Higher Order Factors) 8 0.4691615 0.05864518 0.69 0.6936
All Interactions 6 0.2900861 0.04834768 0.57 0.7489
Nonlinear (Factor+Higher Order Factors) 4 0.2503327 0.06258318 0.74 0.5727
g (Factor+Higher Order Factors) 9 0.5117214 0.05685793 0.67 0.7260
All Interactions 6 0.2900861 0.04834768 0.57 0.7489
x2 1 0.0125370 0.01253700 0.15 0.7031
x1 × g (Factor+Higher Order Factors) 6 0.2900861 0.04834768 0.57 0.7489
Nonlinear 3 0.1705474 0.05684914 0.67 0.5763
Nonlinear Interaction : f(A,B) vs. AB 3 0.1705474 0.05684914 0.67 0.5763
TOTAL NONLINEAR 4 0.2503327 0.06258318 0.74 0.5727
TOTAL NONLINEAR + INTERACTION 7 0.4666141 0.06665916 0.79 0.6029
REGRESSION 12 0.7834533 0.06528777 0.77 0.6724
ERROR 27 2.2818581 0.08451326
summary(f)
Effects   Response: y
Low High Δ Effect S.E. Lower 0.95 Upper 0.95
x1 0.3377 0.7787 0.441 0.16810 0.16690 -0.1743 0.5105
x2 -0.5951 0.5800 1.175 0.02932 0.07612 -0.1269 0.1855
g --- a:c 3.0000 1.0000 -0.09457 0.19360 -0.4919 0.3028
g --- b:c 3.0000 2.0000 0.08031 0.20190 -0.3339 0.4945
g --- d:c 3.0000 4.0000 -0.10540 0.20090 -0.5177 0.3069
summary(h)
Effects   Response: y
Low High Δ Effect S.E. Lower 0.95 Upper 0.95
x1 0.3377 0.7787 0.441 0.16810 0.16690 -0.1743 0.5105
x2 -0.5951 0.5800 1.175 0.02932 0.07612 -0.1269 0.1855
g --- a:c 3.0000 1.0000 -0.09457 0.19360 -0.4919 0.3028
g --- b:c 3.0000 2.0000 0.08031 0.20190 -0.3339 0.4945
g --- d:c 3.0000 4.0000 -0.10540 0.20090 -0.5177 0.3069
ggplot(Predict(f))

ggplot(Predict(h))

k1 <- list(x1=c(.2, .4), g='b')
k2 <- list(x1=c(.2, .4), g='d')
contrast(f, k1)
   x1 g         x2    Contrast      S.E.      Lower     Upper     t Pr(>|t|)
1 0.2 b 0.05117174  0.01218642 0.3892210 -0.7864292 0.8108020  0.03   0.9753
2 0.4 b 0.05117174 -0.14186453 0.3835685 -0.9288821 0.6451531 -0.37   0.7144

Error d.f.= 27 

Confidence intervals are 0.95 individual intervals
contrast(h, k1)
   x1 g         x2    Contrast      S.E.      Lower     Upper     t Pr(>|t|)
1 0.2 b 0.05117174  0.01218642 0.3892210 -0.7864292 0.8108020  0.03   0.9753
2 0.4 b 0.05117174 -0.14186453 0.3835685 -0.9288821 0.6451531 -0.37   0.7144

Error d.f.= 27 

Confidence intervals are 0.95 individual intervals
contrast(f, k1, k2)
   x1         x2  Contrast      S.E.      Lower     Upper    t Pr(>|t|)
1 0.2 0.05117174 0.2945528 0.2229484 -0.1628996 0.7520051 1.32   0.1975
2 0.4 0.05117174 0.2305465 0.2028467 -0.1856605 0.6467535 1.14   0.2657

Error d.f.= 27 

Confidence intervals are 0.95 individual intervals
contrast(h, k1, k2)
   x1         x2  Contrast      S.E.      Lower     Upper    t Pr(>|t|)
1 0.2 0.05117174 0.2945528 0.2229484 -0.1628996 0.7520051 1.32   0.1975
2 0.4 0.05117174 0.2305465 0.2028467 -0.1856605 0.6467535 1.14   0.2657

Error d.f.= 27 

Confidence intervals are 0.95 individual intervals

Linear Spline Fit Duplicating lsp

f <- ols(y ~ lsp(x1, c(.2, .4)))
# Give custom names for columns
lspline <- function(x) {
  z <- cbind(x, x.2=pmax(x - .2, 0), x.4=pmax(x - .4, 0))
  attr(z, 'nonlinear') <- 2:3
  z
}
h <- ols(y ~ gTrans(x1, lspline))
rbind(coef(f), coef(h))
     Intercept        x1     x1'      x1''
[1,]  1.051559 -3.999801 5.19094 -1.093075
[2,]  1.051559 -3.999801 5.19094 -1.093075
ggplot(Predict(f))

ggplot(Predict(h))

anova(f)
Analysis of Variance for y
d.f. Partial SS MS F P
x1 3 0.4889819 0.16299396 2.28 0.0962
Nonlinear 2 0.4808427 0.24042137 3.36 0.0459
REGRESSION 3 0.4889819 0.16299396 2.28 0.0962
ERROR 36 2.5763295 0.07156471
anova(h)
Analysis of Variance for y
d.f. Partial SS MS F P
x1 3 0.4889819 0.16299396 2.28 0.0962
Nonlinear 2 0.4808427 0.24042137 3.36 0.0459
REGRESSION 3 0.4889819 0.16299396 2.28 0.0962
ERROR 36 2.5763295 0.07156471

Let’s use a different linear spline basis where the parameters refer to slopes and not to slope increments.

curtail <- function(x, a, b=Inf) pmax(pmin(x, b) - a, 0)
lspline <- function(x) cbind(curtail(x, 0, 0.2), curtail(x, 0.2, 0.4), curtail(x, 0.4))
h <- ols(y ~ gTrans(x1, lspline))
rbind(coef(f), coef(h))
     Intercept        x1     x1'        x1''
[1,]  1.051559 -3.999801 5.19094 -1.09307531
[2,]  1.051559 -3.999801 1.19114  0.09806455
ggplot(Predict(h))

More Exotic Linear Splines

Fit a straight line from x1=0.1 on, but force a flat relationship for x1 in [0, 0.1]. First do it forcing continuity at x1=0.1

h <- ols(y ~ pmax(x1, 0.1))
xseq <- c(0, 0.099, 1, 0.101, seq(0.2, .8, by=0.1))
ggplot(Predict(h, x1=xseq))

Now allow discontinuity without a slope change.

flin <- function(x) cbind(x < 0.1, x)
h <- ols(y ~ gTrans(x1, flin))
yl <- c(-0.25, 1.25)
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))

Now have a discontinuity with a slope change.

flin <- function(x) cbind(x < 0.1, pmax(x - 0.1, 0))
h <- ols(y ~ gTrans(x1, flin))
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))

Discontinuity with a slope change and initial slope not restricted to be zero.

flin <- function(x) cbind(x < 0.1, x, pmax(x - 0.1, 0))
h <- ols(y ~ gTrans(x1, flin))
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))

Discontinuous linear spline with two knots:

dlsp <- function(x) {
  z <- cbind(x, x >= 0.2, pmax(x - .2, 0), pmax(x - .4, 0))
  attr(z, 'nonlinear') <- 2:4
  z
}
h <- ols(y ~ gTrans(x1, dlsp))
ggplot(Predict(h), ylim=yl)

ggplot(Predict(h, x1=c(.1, .199, .2, .201, .3, .4, 1)), ylim=yl)

Discontinuous linear spline with different basis functions:

dlsp <- function(x) {
  z <- cbind(x, x >= 0.2, curtail(x, 0.2, 0.4), curtail(x, 0.4))
  attr(z, 'nonlinear') <- 2:4
  z
}
h <- ols(y ~ gTrans(x1, dlsp))
ggplot(Predict(h), ylim=yl)

Categorical Predictor

# Try on a categorical predictor
gr <- function(x) cbind(bc=x %in% c('b','c'), d=x == 'd')
h <- ols(y ~ gTrans(g, gr))
ggplot(Predict(h, g))

Different Basis Functions for Cubic Splines

Next let’s use a B-spline basis for a restricted cubic spline function (natural spline).

require(splines)
f <- ols(y ~ rcs(x1, 5))
yl <- c(0, 1.25)
ggplot(Predict(f), ylim=yl)

h <- ols(y ~ gTrans(x1, function(x) ns(x, df=4)))
ggplot(Predict(h), ylim=yl)

mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 0.0354661
# Now try to specify the same knots for ns as used for rcs
w <- rcs(x1, 5)
kn <- attr(w, 'parms')
nspl <- function(x) ns(x, knots=kn[2:4], Boundary.knots=c(kn[1], kn[5]))
# Show the ns basis functions
i <- order(x1)
matplot(x1[i], nspl(x1)[i, ], type='l', xlab='x1', ylab='Natural Spline Bases')

h <- ols(y ~ gTrans(x1, nspl))
ggplot(Predict(h), ylim=yl)

mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 1.301181e-15

Now use the simple truncated power basis for the natural spline, but orthogonalize it by using principal components on centered, SD-scaled data.

# Use knots already computed and stored in kn above
# Compute PCs on rcs and store scaling, centering, and rotation objects
# Note that this is automatic if you specify options(rcspc=TRUE) although not certain that
# post-fitting predicted values will work as well as below
X <- rcspline.eval(x1, kn, inclx=TRUE)  # use low-level function that rcs calls
# Show usual basis functions
matplot(x1[i], X[i, ], type='l', xlab='x1', ylab='rcs Bases')

p <- prcomp(X, scale=TRUE)
matplot(x1[i], p$x[i, ], type='l', xlab='x1', ylab='PC Rotated rcs Bases')

prcs <- function(x, rotation, center, scale, knots) {
  # deparse inside gTrans does not preserve matrix arguments
  rotation <- matrix(rotation, nrow=length(center))
  X <- rcspline.eval(x, knots, inclx=TRUE)
  scale(X, center, scale) %*% rotation
}
# Define parameters inside new function so the function will be self contained
formals(prcs) <- list(x=NULL, rotation=p$rotation, center=p$center, scale=p$scale, knots=kn)
# Show fit with usual basis
f <- ols(y ~ rcs(x1, kn))
ggplot(Predict(f))

# Show fit with rotated basis having no collinearities
h <- ols(y ~ gTrans(x1, prcs))
ggplot(Predict(h))

mean(abs(Predict(f)$yhat - Predict(h)$yhat))
[1] 2.974648e-14

QR-rotations of truncated power basis functions:

# Use the Hmisc qrxcenter function to use the QR decomposition on X
w  <- qrxcenter(X)   # also use inside gTrans as in other examples
round(w$R,  2) # Transformation matrix; 1st column = linear = 3.83*x1
     [,1]  [,2]   [,3]    [,4]
[1,] 3.83 -7.11   9.96   -9.12
[2,] 0.00  8.19 -93.56  228.61
[3,] 0.00  0.00 142.17 -524.49
[4,] 0.00  0.00   0.00  451.46
Xt <- w$x
round(cor(Xt), 3)
     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    0    0
[3,]    0    0    1    0
[4,]    0    0    0    1
# Compare with high collinearities in the original bases
round(cor(X), 3)
      x                  
x 1.000 0.881 0.851 0.794
  0.881 1.000 0.997 0.977
  0.851 0.997 1.000 0.990
  0.794 0.977 0.990 1.000
matplot(x1[i], Xt[i, ], type='l', xlab='x1', ylab='QR Rotated rcs Bases')

Non-spline Basis

The gTrans framework easily allows for non-spline basis functions for modeling continuous predictors. For example, fractional polynomials can be used directly. Periodic functions may also be used if the cycle length is known. Using the data simulated above, here is an example of adding a short-cycle harmonic series to a linear spline. Here the cycle length is taken to be 0.2. By adding cosine terms one does not need to know the phase angle. A nice resource on adding cyclic trends to regression by Steve Simon is here.

To render the fitted equation in math notation using the latex method in rms requires one to specify a function that creates LaTeX notation for the columns of the matrix produced by gTrans. The sprintf function is most handy for this, as shown below.

# Define a function that will be used by the latex method to customize typesetting of hrm model components
# Argument x will contain the variable's base name (here x1 but in LaTeX notation x_{1} for the subscript)
# tex works only in rms version 6.2-1 and higher
texhrm <- function(x) sprintf(c('%s', '(%s - 0.5)_{+}',
                                '\\sin(2\\pi \\frac{%s}{0.2})', '\\cos(2\\pi \\frac{%s}{0.2})'), x)
hrm <- function(x) {
  z <- cbind(x, slopeChange=pmax(x - 0.5, 0), sin=sin(2*pi*x/0.2), cos=cos(2*pi*x/0.2))
  attr(z, 'nonlinear') <- 2:4
  attr(z, 'tex')       <- texhrm
  z
}
h <- ols(y ~ gTrans(x1, hrm))
h

Linear Regression Model

ols(formula = y ~ gTrans(x1, hrm))
Model Likelihood
Ratio Test
Discrimination
Indexes
Obs 40 LR χ2 2.45 R2 0.059
σ 0.2870 d.f. 4 R2adj -0.048
d.f. 35 Pr(>χ2) 0.6532 g 0.077

Residuals

     Min       1Q   Median       3Q      Max 
-0.43501 -0.19915 -0.04817  0.22900  0.56842 
β S.E. t Pr(>|t|)
Intercept   0.7037  0.1536 4.58 <0.0001
x  -0.5302  0.4000 -1.33 0.1936
slopeChange   0.8325  0.6596 1.26 0.2153
sin  -0.0238  0.0623 -0.38 0.7048
cos   0.0407  0.0728 0.56 0.5800
latex(h)
\[\mathrm{E}(\mathrm{y}) = X\beta,~~\mathrm{where}\] \[\begin{array} \lefteqn{X\hat{\beta}=}\\ & & 0.7036882 -0.5301862 \:\mathrm{x_{1}}+0.8324556 (\mathrm{x_{1}} - 0.5)_{+}-0.02380026\sin(2\pi \frac{\mathrm{x_{1}}}{0.2})+0.04066578 \cos(2\pi \frac{\mathrm{x_{1}}}{0.2}) \\ \end{array}\]
ggplot(Predict(h)) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y))

Computing Environment

 R version 4.4.0 (2024-04-24)
 Platform: aarch64-apple-darwin20
 Running under: macOS Sonoma 14.7
 
 Matrix products: default
 BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
 LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
 
 attached base packages:
 [1] splines   stats     graphics  grDevices utils     datasets  methods  
 [8] base     
 
 other attached packages:
 [1] ggplot2_3.5.1 rms_6.9-0     Hmisc_5.2-1  
 
To cite R in publications use:

R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

To cite the Hmisc package in publications use:

Harrell Jr F (2024). Hmisc: Harrell Miscellaneous. R package version 5.2-1, https://hbiostat.org/R/Hmisc/.

To cite the rms package in publications use:

Harrell Jr FE (2024). rms: Regression Modeling Strategies. R package version 6.9-0, https://github.com/harrelfe/rms, https://hbiostat.org/R/rms/.

To cite the ggplot2 package in publications use:

Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.