Pseudo \(R^2\) Measures

Background

The goal in choosing an \(R^2\) measure that applies to all sorts of statistical models is to quantify predictability of the outcome variable \(Y\) from the predictors \(X\) in such a way as to be statistically efficient. Efficiency matters. For example, when comparing two models we want to be able to detect when one model has more predictive discrimination than the other. Rank measures such as the \(c\)-index (concordance probability; AUROC when \(Y\) is binary) are not efficient, e.g., two models can have very similar \(c\) when one model is demonstrably better than the other, e.g., it yields a better variety of predictions. Pseudo \(R^2\) measures based on the log-likelihood are efficient.

Unadjusted \(R^2\) measures will continue to increase as variables are added to the model, even if performance of the model on new data deteriorates. So unadjusted \(R^2\) measures are biased high by not accounting for overfitting. Adjusted \(R^2\) account for overfitting. If \(p\) denotes the number of non-intercept predictor degrees of freedom in a model, in choosing an adjusted \(R^2\) measure one must decide on how to penalize for \(p\). There are two overall approaches:

The \(1\times p\) adjustment used in traditional \(R^{2}_\text{adj}\) which adjusts for overfitting and is useful for (almost) unbiasedly estimating true model performance with reference to the current sample
The \(2\times p\) adjustment used in McFadden’s \(R^2\) and AIC which targets out-of-sample predictive ability. The addition \(p\) penalty can be thought of as penalizing for variation in the predictor matrix \(X\).

In ordinary linear models with maximum-information continuous \(Y\), the effective sample size is the number of observations and the formula for \(R^2\) estimates the same population quantity regardless of sample size. In pseudo \(R^2\) the choice of the sample size to insert into the formulas is less obvious. For survival analysis with right-censored data, the data’s information content is much more related to the number of uncensored observations than to the total number of observations. For general censoring, this approach provides an effective sample size in a wide family of semiparametric survival models. For univariate uncensored \(Y\), a general approach to effective sample sizes is here.

Notation

Let a = deviance of the full model and b = deviance of the intercept-only model. LR \(\chi^{2} = b - a\)

Let k = total number of parameters in the model

Let p = number of non-intercepts in the model

Then AIC = a + 2k

McFadden’s adjusted \(R^2\) is \(1 - \frac{a + 2k}{b}\)

I can’t find justification for the factor 2.

Maddala-Cox-Snell (MCS) \(R^2\): \(1 - \exp(-LR/n)\)

For some models, the MCS \(R^2\) cannot attain a value of 1.0 even with a perfect model. The Nagelkerke \(R^2\) (\(R^{2}_{N}\)) divides the MCS \(R^2\) by the maximum attainable value which is \(1 - \exp(-b/n)\). For a binary logistic example, suppose there is one binary predictor x that is balanced, and y=x. The MCS \(R^2\) is 0.75 and \(R^{2}_{N}=1.0\) for predicting y from itself. But there is controversy over whether \(R^{2}_{N}\) is properly recalibrated over its whole range and its use of \(n\) doesn’t apply to censored data, so we don’t use \(R^{2}_{N}\) below.

Adjusted \(R^2\) Measures

The idea of adjustment is to not reward \(R^2\) for overfitting. The most commonly used adjusted \(R^2\) with linear models is \(1 - (1 - R^{2})\frac{n-1}{n-p-1}\) which is obtained by replacing the effective estimate of the residual variance with the unbiased estimate \(\frac{\sum_{i=1}^{n} r^{2}_{i}}{n-p-1}\) where \(r\) is a residual.

Carrying this to MCS gives \(1 - \exp(-LR/n)\frac{n-1}{n-p-1} = 1 - \exp(-LR/n + \log\frac{n-1}{n-p-1}) = 1 - \exp(-(LR - n \log\frac{n-1}{n-p-1}) / n)\).

\(n \log\frac{n-1}{n-p-1} = n \log(1 + \frac{p}{n-p-1}) \approx \frac{np}{n-p-1} \approx p\).

So applying linear model adjusted \(R^2\) to MCS \(R^2\) is approximately \(1 - \exp(-(LR - p) / n)\). This is sensible because under the global null hypothesis of no associations between any X’s and Y the expected value of \(LR\) is \(p\). Thus \(LR - p\) is a chance correction for \(LR\).

Adjusted Modified MCS \(R^2\)

Besides \(R^{2}_{N}\), the R rms package implements 4 types of MCS \(R^2\) computed by the Hmisc package R2Measures function. Either of the two \(p\) adjustments can be used, with the \(LR - p\) method being the default. This is a slight modification of the Mittlbock & Schemper approach (see references). The first two measures use the raw sample size \(n\) and the second two use the effective sample size \(m\). The effective sample size \(m\) is taken to be the following:

For right-censored time-to-event data (survival analysis) \(m\) is the number of uncensored observations (number of events). This is exactly correct when the survival distribution is exponential or the context is the Cox-logrank two-sample test for comparing survival distributions. For front-loaded hazard functions where instantaneous event rates are very high at the beginning of follow-up, uncensored observations convey more information and \(m\) should be between \(n\) and the number of events. There is currently no guidance for exactly how to estimate \(m\) in this case. See the Benedetti reference.
For binary, ordinal, semi-continuous, or continuous uncensored response variable \(Y\) the effective sample size is taken as the sample size \(m < n\) for a continuous variable that makes the approximate variance of the log odds ratio in a proportional odds model equal to the variance from the original \(Y\) of size \(n\). This also makes the power of the Wilcoxon test for the smaller continuous \(Y\) and the larger \(Y\) with ties equivalent. This approach is due to Whitehead and is a good approximation for the binary \(Y\) case. Let \(y_{1}, y_{2}, ..., y_{k}\) be the distinct values of \(Y\) and \(p_{1}, p_{2}, .., p_{k}\) be the proportion of \(Y\) values occurring for these distinct \(Y\) values. The effective sample size is \(m = n(1 - \sum_{i}^{k} p_{i}^{3})\). The multiplier for \(n\) is what is computed as the Info information measure by the R Hmisc describe function, and this \(m\) is used in the Hmisc popower function.

There are four MCS-based \(R^2\) measures, prooduced in order by the R2Measures function.

\(R^{2}_{n}\): original MCS \(R^2\)
\(R^{2}_{p,n}\): adjusted for estimating \(p\) regression coefficients (non-intercepts)
\(R^{2}_{m}\): use effective instead of actual sample size, don’t penalize for overfitting
\(R^{2}_{p,m}\): effective sample size adjusted for \(p\) estimated regression coefficients

Note that when comparing the performance of a binary \(Y\) model with that of an ordinal \(Y\) model it is not appropriate to use a measure based on \(m\). That is because the ordinal model is charged with a more difficult prediction task but would be penalized for a higher effective sample size.

References

https://www.glmj.org/archives/articles/Smith_v39n2.pdf
https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds
https://stats.stackexchange.com/questions/82105/mcfaddens-pseudo-r2-interpretation
http://eml.berkeley.edu/~mcfadden/travel.html especially https://eml.berkeley.edu/~mcfadden/travel/ch5.pdf
http://thestatsgeek.com/2014/02/08/r-squared-in-logistic-regression
https://statisticalhorizons.com/r2logistic
https://support.sas.com/resources/papers/proceedings/proceedings/sugi25/25/st/25p256.pdf
Mittlbock and Schemper which mentions van Houwelingen’s idea of correcting log-likelihood by (p+1)/2 in the numerator and 1/2 in the denominator of McFadden’s \(R^2\). So this is not consistent with AIC but is consistent with a chance correction for the \(\chi^2\) statistic.
R Hmisc package R2Measures function
Benedetti
Whitehead
https://stats.stackexchange.com/questions/48703
https://discourse.mc-stan.org/t/r-2-calculation-for-brm-model-with-cumulative-family-type