--- title: "General Multi-parameter Transformations in `rms`" author: - name: Frank Harrell url: https://hbiostat.org affiliation: Department of Biostatistics
Vanderbilt University School of Medicine author-title: "" affiliation-title: "" published-title: "" date: last-modified format: html: embed-resources: true anchor-sections: true code-tools: true code-fold: false fig-width: 6 fig-height: 4 code-block-bg: "#f1f3f5" code-block-border-left: "#31BAE9" mainfont: Source Sans Pro theme: journal toc: true toc-depth: 3 toc-location: left captions: true cap-location: margin table-captions: true tbl-cap-location: margin reference-location: margin comments: hypothesis: false execute: warning: false message: false --- As of version 6.2-0 of the `rms` package appearing in CRAN on 2021-03-18, there is a new transformation function `gTrans`, joining `lsp`, `rcs`, `pol`, etc. `gTrans` takes a numeric, character, or factor vector as its first argument and a function as its second argument. The latter can be any function that creates a numeric matrix. Give the columns of the matrix names if you don't want to use default names. If you want to take advantage of automatic tests of nonlinearity when running `rms::anova.rms` (e.g., by issuing `anova(fit)`), give the returned matrix an attribute `nonlinear` which is a vector of integers specifying which columns correpond to nonlinear terms. Columns default to being considered as linear terms. If you want the matrix you generate to have a linear term it is up to you to add a column containing the original (numeric) vector. `gTrans` is quite general. With it you can do things such as the following: * use a linear spline basis different from the one `lsp` uses * use a cubic spline basis different from the one `rcs` uses * use a quadratic spline * specify a linear, quadratic, or spline function with discontinuities * implement non-spline bases such as fractional polynomials * add cyclic trends to the regression * pool certain categories of a categorical predictor Examples where you might want to fit a discontinuous function are interrupted time series and allowing for special effects at zero. As an example of the latter, suppose that a predictor is the number of days that a patient has been in the hospital in the past year. Zero has a special meaning, so one may wish to allow for a spike at zero by using `gTrans(x, function(x) cbind(x == 0, sqrt(x), x))`. This fits the following model: $$Y = \beta_{0} + \beta_{1}[x = 0] + \beta_{2}\sqrt{x} + \beta_{3}x$$ The reason to use `gTrans` instead of creating a separate model term for the discontinuity is that all the predictor's basis functions remain connected. This is important for several reasons, including the following. * When studying the effect of varying $x$, the terms will be forced to vary together. This will force, for example, $\beta_1$ to be ignored when $x > 0$ where one is studying the effect of $x$ on the continuous part of the model. This applies to partial effect plots, odds ratio charts, nomograms, etc. * When using `anova(fit)` to do a "chunk" test for whether $x$ is associated with $Y$, the 1 d.f. for the discontinuous term will be added to the 2 d.f. for $\beta_2$ and $\beta_3$ to get the correct 3 d.f. test for flatness (no association). * `anova` can also provide chunk tests of nonlinearity when you keep the terms together. Here are some examples with graphical output. Start by simulating some random data. ```{r setup} require(rms) require(ggplot2) # knitrSet(lang='markdown', fig.path='png/') options(prType='html') ``` ```{r gen} n <- 40 set.seed(1) y <- runif(n) x1 <- runif(n) x2 <- rnorm(n) g <- sample(letters[1:4], n, TRUE) # We usually plot predicted values from the 10th smallest to the 10th largest observations # Instead for demonstrating fits we plot over the whole range dd <- datadist(x1, x2, g, q.display=c(0,1)); options(datadist='dd') ``` # Quadratic Fit Duplicating `pol(x, 2)` ```{r quad} f <- ols(y ~ pol(x1, 2) * g + x2) pol2 <- function(x) { # Give a custom name xsq for one column z <- cbind(x, xsq=x^2) attr(z, 'nonlinear') <- 2 z } h <- ols(y ~ gTrans(x1, pol2) * g + x2) specs(h, long=TRUE) rbind(coef(f), coef(h)) ``` ```{r quadh,results='asis'} anova(f) anova(h) summary(f) summary(h) ggplot(Predict(f)) ggplot(Predict(h)) ``` ```{r contr} k1 <- list(x1=c(.2, .4), g='b') k2 <- list(x1=c(.2, .4), g='d') contrast(f, k1) contrast(h, k1) contrast(f, k1, k2) contrast(h, k1, k2) ``` # Linear Spline Fit Duplicating `lsp` ```{r lsp} f <- ols(y ~ lsp(x1, c(.2, .4))) # Give custom names for columns lspline <- function(x) { z <- cbind(x, x.2=pmax(x - .2, 0), x.4=pmax(x - .4, 0)) attr(z, 'nonlinear') <- 2:3 z } h <- ols(y ~ gTrans(x1, lspline)) rbind(coef(f), coef(h)) ggplot(Predict(f)) ggplot(Predict(h)) ``` ```{r lspa,results='asis'} anova(f) anova(h) ``` Let's use a different linear spline basis where the parameters refer to slopes and not to slope increments. ```{r lsp2} curtail <- function(x, a, b=Inf) pmax(pmin(x, b) - a, 0) lspline <- function(x) cbind(curtail(x, 0, 0.2), curtail(x, 0.2, 0.4), curtail(x, 0.4)) h <- ols(y ~ gTrans(x1, lspline)) rbind(coef(f), coef(h)) ggplot(Predict(h)) ``` # More Exotic Linear Splines Fit a straight line from x1=0.1 on, but force a flat relationship for x1 in [0, 0.1]. First do it forcing continuity at x1=0.1 ```{r flin} h <- ols(y ~ pmax(x1, 0.1)) xseq <- c(0, 0.099, 1, 0.101, seq(0.2, .8, by=0.1)) ggplot(Predict(h, x1=xseq)) ``` Now allow discontinuity without a slope change. ```{r flin2} flin <- function(x) cbind(x < 0.1, x) h <- ols(y ~ gTrans(x1, flin)) yl <- c(-0.25, 1.25) ggplot(Predict(h, x1=xseq), ylim=yl) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y)) ``` Now have a discontinuity with a slope change. ```{r flin3} flin <- function(x) cbind(x < 0.1, pmax(x - 0.1, 0)) h <- ols(y ~ gTrans(x1, flin)) ggplot(Predict(h, x1=xseq), ylim=yl) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y)) ``` Discontinuity with a slope change and initial slope not restricted to be zero. ```{r flin3b} flin <- function(x) cbind(x < 0.1, x, pmax(x - 0.1, 0)) h <- ols(y ~ gTrans(x1, flin)) ggplot(Predict(h, x1=xseq), ylim=yl) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y)) ``` Discontinuous linear spline with two knots: ```{r flin4} dlsp <- function(x) { z <- cbind(x, x >= 0.2, pmax(x - .2, 0), pmax(x - .4, 0)) attr(z, 'nonlinear') <- 2:4 z } h <- ols(y ~ gTrans(x1, dlsp)) ggplot(Predict(h), ylim=yl) ggplot(Predict(h, x1=c(.1, .199, .2, .201, .3, .4, 1)), ylim=yl) ``` Discontinuous linear spline with different basis functions: ```{r flin5} dlsp <- function(x) { z <- cbind(x, x >= 0.2, curtail(x, 0.2, 0.4), curtail(x, 0.4)) attr(z, 'nonlinear') <- 2:4 z } h <- ols(y ~ gTrans(x1, dlsp)) ggplot(Predict(h), ylim=yl) ``` # Categorical Predictor ```{r cat} # Try on a categorical predictor gr <- function(x) cbind(bc=x %in% c('b','c'), d=x == 'd') h <- ols(y ~ gTrans(g, gr)) ggplot(Predict(h, g)) ``` # Different Basis Functions for Cubic Splines {#sec-splinebasis} Next let's use a B-spline basis for a restricted cubic spline function (natural spline). ```{r rcs} require(splines) f <- ols(y ~ rcs(x1, 5)) yl <- c(0, 1.25) ggplot(Predict(f), ylim=yl) h <- ols(y ~ gTrans(x1, function(x) ns(x, df=4))) ggplot(Predict(h), ylim=yl) mean(abs(Predict(f)$yhat - Predict(h)$yhat)) # Now try to specify the same knots for ns as used for rcs w <- rcs(x1, 5) kn <- attr(w, 'parms') nspl <- function(x) ns(x, knots=kn[2:4], Boundary.knots=c(kn[1], kn[5])) # Show the ns basis functions i <- order(x1) matplot(x1[i], nspl(x1)[i, ], type='l', xlab='x1', ylab='Natural Spline Bases') h <- ols(y ~ gTrans(x1, nspl)) ggplot(Predict(h), ylim=yl) mean(abs(Predict(f)$yhat - Predict(h)$yhat)) ``` Now use the simple truncated power basis for the natural spline, but orthogonalize it by using principal components on centered, SD-scaled data. ```{r pc} # Use knots already computed and stored in kn above # Compute PCs on rcs and store scaling, centering, and rotation objects # Note that this is automatic if you specify options(rcspc=TRUE) although not certain that # post-fitting predicted values will work as well as below X <- rcspline.eval(x1, kn, inclx=TRUE) # use low-level function that rcs calls # Show usual basis functions matplot(x1[i], X[i, ], type='l', xlab='x1', ylab='rcs Bases') p <- prcomp(X, scale=TRUE) matplot(x1[i], p$x[i, ], type='l', xlab='x1', ylab='PC Rotated rcs Bases') prcs <- function(x, rotation, center, scale, knots) { # deparse inside gTrans does not preserve matrix arguments rotation <- matrix(rotation, nrow=length(center)) X <- rcspline.eval(x, knots, inclx=TRUE) scale(X, center, scale) %*% rotation } # Define parameters inside new function so the function will be self contained formals(prcs) <- list(x=NULL, rotation=p$rotation, center=p$center, scale=p$scale, knots=kn) # Show fit with usual basis f <- ols(y ~ rcs(x1, kn)) ggplot(Predict(f)) # Show fit with rotated basis having no collinearities h <- ols(y ~ gTrans(x1, prcs)) ggplot(Predict(h)) mean(abs(Predict(f)$yhat - Predict(h)$yhat)) ``` QR-rotations of truncated power basis functions: ```{r} # Use the Hmisc qrxcenter function to use the QR decomposition on X w <- qrxcenter(X) # also use inside gTrans as in other examples round(w$R, 2) # Transformation matrix; 1st column = linear = 3.83*x1 Xt <- w$x round(cor(Xt), 3) # Compare with high collinearities in the original bases round(cor(X), 3) matplot(x1[i], Xt[i, ], type='l', xlab='x1', ylab='QR Rotated rcs Bases') ``` # Non-spline Basis The `gTrans` framework easily allows for non-spline basis functions for modeling continuous predictors. For example, [fractional polynomials](https://mfp.imbi.uni-freiburg.de/fp) can be used directly. Periodic functions may also be used if the cycle length is known. Using the data simulated above, here is an example of adding a short-cycle harmonic series to a linear spline. Here the cycle length is taken to be 0.2. By adding cosine terms one does not need to know the phase angle. A nice resource on adding cyclic trends to regression by Steve Simon is [here](http://www.pmean.com/07/CyclicalTrends.html). To render the fitted equation in math notation using the `latex` method in `rms` requires one to specify a function that creates LaTeX notation for the columns of the matrix produced by `gTrans`. The `sprintf` function is most handy for this, as shown below. ```{r harmon,results='asis'} # Define a function that will be used by the latex method to customize typesetting of hrm model components # Argument x will contain the variable's base name (here x1 but in LaTeX notation x_{1} for the subscript) # tex works only in rms version 6.2-1 and higher texhrm <- function(x) sprintf(c('%s', '(%s - 0.5)_{+}', '\\sin(2\\pi \\frac{%s}{0.2})', '\\cos(2\\pi \\frac{%s}{0.2})'), x) hrm <- function(x) { z <- cbind(x, slopeChange=pmax(x - 0.5, 0), sin=sin(2*pi*x/0.2), cos=cos(2*pi*x/0.2)) attr(z, 'nonlinear') <- 2:4 attr(z, 'tex') <- texhrm z } h <- ols(y ~ gTrans(x1, hrm)) h latex(h) ggplot(Predict(h)) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y)) ``` # Computing Environment ```{r echo=FALSE} markupSpecs$html$session() ```