---
title: "General Multi-parameter Transformations in `rms`"
author:
  - name: Frank Harrell
    url: https://hbiostat.org
    affiliation: Department of Biostatistics<br>Vanderbilt University School of Medicine
author-title: ""
affiliation-title: ""
published-title: ""
date: last-modified
format:
  html:
    embed-resources: true
    anchor-sections: true
    code-tools: true
    code-fold: false
    fig-width: 6
    fig-height: 4
    code-block-bg: "#f1f3f5"
    code-block-border-left: "#31BAE9"
    mainfont: Source Sans Pro
    theme: journal
    toc: true
    toc-depth: 3
    toc-location: left
    captions: true
    cap-location: margin
    table-captions: true
    tbl-cap-location: margin
    reference-location: margin
comments:
  hypothesis: false

execute:
  warning: false
  message: false
---

<!-- To compile: quarto render gTrans.qmd  -->
<style>
.table {
	width: 50%;
}
p.caption {
  font-size: 0.8em;
  color: DarkBlue;
}
</style>

As of version 6.2-0 of the `rms` package appearing in CRAN on 2021-03-18, there is a new transformation function `gTrans`, joining `lsp`, `rcs`, `pol`, etc.  `gTrans` takes a numeric, character, or factor vector as its first argument and a function as its second argument.  The latter can be any function that creates a numeric matrix.  Give the columns of the matrix  names if you don't want to use default names.  If you want to take advantage of automatic tests of nonlinearity when running `rms::anova.rms` (e.g., by issuing `anova(fit)`), give the returned matrix an attribute `nonlinear` which is a vector of integers specifying which columns correpond to nonlinear terms.  Columns default to being considered as linear terms.  If you want the matrix you generate to have a linear term it is up to you to add a column containing the original (numeric) vector.

`gTrans` is quite general.  With it you can do things such as the following:

* use a linear spline basis different from the one `lsp` uses
* use a cubic spline basis different from the one `rcs` uses
* use a quadratic spline
* specify a linear, quadratic, or spline function with discontinuities
* implement non-spline bases such as fractional polynomials
* add cyclic trends to the regression
* pool certain categories of a categorical predictor

Examples where you might want to fit a discontinuous function are interrupted time series and allowing for special effects at zero.  As an example of the latter, suppose that a predictor is the number of days that a patient has been in the hospital in the past year.  Zero has a special meaning, so one may wish to allow for a spike at zero by using `gTrans(x, function(x) cbind(x == 0, sqrt(x), x))`.  This fits the following model:

$$Y = \beta_{0} + \beta_{1}[x = 0] + \beta_{2}\sqrt{x} + \beta_{3}x$$
The reason to use `gTrans` instead of creating a separate model term for the discontinuity is that all the predictor's basis functions remain connected.  This is important for several reasons, including the following.

* When studying the effect of varying $x$, the terms will be forced to vary together.  This will force, for example, $\beta_1$ to be ignored when $x > 0$ where one is studying the effect of $x$ on the continuous part of the model.  This applies to partial effect plots, odds ratio charts, nomograms, etc.
* When using `anova(fit)` to do a "chunk" test for whether $x$ is associated with $Y$, the 1 d.f. for the discontinuous term will be added to the 2 d.f. for $\beta_2$ and $\beta_3$ to get the correct 3 d.f. test for flatness (no association).
* `anova` can also provide chunk tests of nonlinearity when you keep the terms together.

Here are some examples with graphical output.  Start by simulating some random data.

```{r setup}
require(rms)
require(ggplot2)
# knitrSet(lang='markdown', fig.path='png/')
options(prType='html')
```

```{r gen}
n <- 40
set.seed(1)
y  <- runif(n)
x1 <- runif(n)
x2 <- rnorm(n)
g  <- sample(letters[1:4], n, TRUE)
# We usually plot predicted values from the 10th smallest to the 10th largest observations
# Instead for demonstrating fits we plot over the whole range
dd <- datadist(x1, x2, g, q.display=c(0,1)); options(datadist='dd')
```
# Quadratic Fit Duplicating `pol(x, 2)`


```{r quad}
f <- ols(y ~ pol(x1, 2) * g + x2)
pol2 <- function(x) {
  # Give a custom name xsq for one column
  z <- cbind(x, xsq=x^2)
  attr(z, 'nonlinear') <- 2
  z
}
h <- ols(y ~ gTrans(x1, pol2) * g + x2)
specs(h, long=TRUE)
rbind(coef(f), coef(h))
```

```{r quadh,results='asis'}
anova(f)
anova(h)
summary(f)
summary(h)
ggplot(Predict(f))
ggplot(Predict(h))
```

```{r contr}
k1 <- list(x1=c(.2, .4), g='b')
k2 <- list(x1=c(.2, .4), g='d')
contrast(f, k1)
contrast(h, k1)
contrast(f, k1, k2)
contrast(h, k1, k2)
```

# Linear Spline Fit Duplicating `lsp`

```{r lsp}
f <- ols(y ~ lsp(x1, c(.2, .4)))
# Give custom names for columns
lspline <- function(x) {
  z <- cbind(x, x.2=pmax(x - .2, 0), x.4=pmax(x - .4, 0))
  attr(z, 'nonlinear') <- 2:3
  z
}
h <- ols(y ~ gTrans(x1, lspline))
rbind(coef(f), coef(h))
ggplot(Predict(f))
ggplot(Predict(h))
```

```{r lspa,results='asis'}
anova(f)
anova(h)
```

Let's use a different linear spline basis where the parameters refer to slopes and not to slope increments.

```{r lsp2}
curtail <- function(x, a, b=Inf) pmax(pmin(x, b) - a, 0)
lspline <- function(x) cbind(curtail(x, 0, 0.2), curtail(x, 0.2, 0.4), curtail(x, 0.4))
h <- ols(y ~ gTrans(x1, lspline))
rbind(coef(f), coef(h))
ggplot(Predict(h))
```

# More Exotic Linear Splines

Fit a straight line from x1=0.1 on, but force a flat relationship for x1 in [0, 0.1].  First do it forcing continuity at x1=0.1

```{r flin}
h <- ols(y ~ pmax(x1, 0.1))
xseq <- c(0, 0.099, 1, 0.101, seq(0.2, .8, by=0.1))
ggplot(Predict(h, x1=xseq))
```

Now allow discontinuity without a slope change.

```{r flin2}
flin <- function(x) cbind(x < 0.1, x)
h <- ols(y ~ gTrans(x1, flin))
yl <- c(-0.25, 1.25)
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
```

Now have a discontinuity with a slope change.

```{r flin3}
flin <- function(x) cbind(x < 0.1, pmax(x - 0.1, 0))
h <- ols(y ~ gTrans(x1, flin))
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
```

Discontinuity with a slope change and initial slope not restricted to be zero.

```{r flin3b}
flin <- function(x) cbind(x < 0.1, x, pmax(x - 0.1, 0))
h <- ols(y ~ gTrans(x1, flin))
ggplot(Predict(h, x1=xseq), ylim=yl) +
  geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
```

Discontinuous linear spline with two knots:

```{r flin4}
dlsp <- function(x) {
  z <- cbind(x, x >= 0.2, pmax(x - .2, 0), pmax(x - .4, 0))
  attr(z, 'nonlinear') <- 2:4
  z
}
h <- ols(y ~ gTrans(x1, dlsp))
ggplot(Predict(h), ylim=yl)
ggplot(Predict(h, x1=c(.1, .199, .2, .201, .3, .4, 1)), ylim=yl)
```

Discontinuous linear spline with different basis functions:

```{r flin5}
dlsp <- function(x) {
  z <- cbind(x, x >= 0.2, curtail(x, 0.2, 0.4), curtail(x, 0.4))
  attr(z, 'nonlinear') <- 2:4
  z
}
h <- ols(y ~ gTrans(x1, dlsp))
ggplot(Predict(h), ylim=yl)
```

# Categorical Predictor

```{r cat}
# Try on a categorical predictor
gr <- function(x) cbind(bc=x %in% c('b','c'), d=x == 'd')
h <- ols(y ~ gTrans(g, gr))
ggplot(Predict(h, g))
```

# Different Basis Functions for Cubic Splines {#sec-splinebasis}

Next let's use a B-spline basis for a restricted cubic spline function (natural spline).

```{r rcs}
require(splines)
f <- ols(y ~ rcs(x1, 5))
yl <- c(0, 1.25)
ggplot(Predict(f), ylim=yl)
h <- ols(y ~ gTrans(x1, function(x) ns(x, df=4)))
ggplot(Predict(h), ylim=yl)
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
# Now try to specify the same knots for ns as used for rcs
w <- rcs(x1, 5)
kn <- attr(w, 'parms')
nspl <- function(x) ns(x, knots=kn[2:4], Boundary.knots=c(kn[1], kn[5]))
# Show the ns basis functions
i <- order(x1)
matplot(x1[i], nspl(x1)[i, ], type='l', xlab='x1', ylab='Natural Spline Bases')

h <- ols(y ~ gTrans(x1, nspl))
ggplot(Predict(h), ylim=yl)
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
```

Now use the simple truncated power basis for the natural spline, but orthogonalize it by using principal components on centered, SD-scaled data.

```{r pc}
# Use knots already computed and stored in kn above
# Compute PCs on rcs and store scaling, centering, and rotation objects
# Note that this is automatic if you specify options(rcspc=TRUE) although not certain that
# post-fitting predicted values will work as well as below
X <- rcspline.eval(x1, kn, inclx=TRUE)  # use low-level function that rcs calls
# Show usual basis functions
matplot(x1[i], X[i, ], type='l', xlab='x1', ylab='rcs Bases')
p <- prcomp(X, scale=TRUE)
matplot(x1[i], p$x[i, ], type='l', xlab='x1', ylab='PC Rotated rcs Bases')
prcs <- function(x, rotation, center, scale, knots) {
  # deparse inside gTrans does not preserve matrix arguments
  rotation <- matrix(rotation, nrow=length(center))
  X <- rcspline.eval(x, knots, inclx=TRUE)
  scale(X, center, scale) %*% rotation
}
# Define parameters inside new function so the function will be self contained
formals(prcs) <- list(x=NULL, rotation=p$rotation, center=p$center, scale=p$scale, knots=kn)
# Show fit with usual basis
f <- ols(y ~ rcs(x1, kn))
ggplot(Predict(f))
# Show fit with rotated basis having no collinearities
h <- ols(y ~ gTrans(x1, prcs))
ggplot(Predict(h))
mean(abs(Predict(f)$yhat - Predict(h)$yhat))
```

QR-rotations of truncated power basis functions:

```{r}
# Use the Hmisc qrxcenter function to use the QR decomposition on X
w  <- qrxcenter(X)   # also use inside gTrans as in other examples
round(w$R,  2) # Transformation matrix; 1st column = linear = 3.83*x1
Xt <- w$x
round(cor(Xt), 3)
# Compare with high collinearities in the original bases
round(cor(X), 3)
matplot(x1[i], Xt[i, ], type='l', xlab='x1', ylab='QR Rotated rcs Bases')
```

# Non-spline Basis

The `gTrans` framework easily allows for non-spline basis functions for modeling continuous predictors.  For example, [fractional polynomials](https://mfp.imbi.uni-freiburg.de/fp) can be used directly.  Periodic functions may also be used if the cycle length is known.   Using the data simulated above, here is an example of adding a short-cycle harmonic series to a linear spline.  Here the cycle length is taken to be 0.2.  By adding cosine terms one does not need to know the phase angle.  A nice resource on adding cyclic trends to regression by Steve Simon is [here](http://www.pmean.com/07/CyclicalTrends.html).

To render the fitted equation in math notation using the `latex` method in `rms` requires one to specify a function that creates LaTeX notation for the columns of the matrix produced by `gTrans`.  The `sprintf` function is most handy for this, as shown below.

```{r harmon,results='asis'}
# Define a function that will be used by the latex method to customize typesetting of hrm model components
# Argument x will contain the variable's base name (here x1 but in LaTeX notation x_{1} for the subscript)
# tex works only in rms version 6.2-1 and higher
texhrm <- function(x) sprintf(c('%s', '(%s - 0.5)_{+}',
                                '\\sin(2\\pi \\frac{%s}{0.2})', '\\cos(2\\pi \\frac{%s}{0.2})'), x)
hrm <- function(x) {
  z <- cbind(x, slopeChange=pmax(x - 0.5, 0), sin=sin(2*pi*x/0.2), cos=cos(2*pi*x/0.2))
  attr(z, 'nonlinear') <- 2:4
  attr(z, 'tex')       <- texhrm
  z
}
h <- ols(y ~ gTrans(x1, hrm))
h
latex(h)
ggplot(Predict(h)) + geom_point(aes(x=x1, y=y), data=data.frame(x1, y))
```

# Computing Environment

```{r echo=FALSE}
markupSpecs$html$session()
```
