The goal in choosing an \(R^2\) measure that applies to all sorts of statistical models is to quantify predictability of the outcome variable \(Y\) from the predictors \(X\) in such a way as to be statistically efficient. Efficiency matters. For example, when comparing two models we want to be able to detect when one model has more predictive discrimination than the other. Rank measures such as the \(c\)-index (concordance probability; AUROC when \(Y\) is binary) are not efficient, e.g., two models can have very similar \(c\) when one model is demonstrably better than the other, e.g., it yields a better variety of predictions. Pseudo \(R^2\) measures based on the log-likelihood are efficient.
Unadjusted \(R^2\) measures will continue to increase as variables are added to the model, even if performance of the model on new data deteriorates. So unadjusted \(R^2\) measures are biased high by not accounting for overfitting. Adjusted \(R^2\) account for overfitting. If \(p\) denotes the number of non-intercept predictor degrees of freedom in a model, in choosing an adjusted \(R^2\) measure one must decide on how to penalize for \(p\). There are two overall approaches:
In ordinary linear models with maximum-information continuous \(Y\), the effective sample size is the number of observations and the formula for \(R^2\) estimates the same population quantity regardless of sample size. In pseudo \(R^2\) the choice of the sample size to insert into the formulas is less obvious. For survival analysis with right-censored data, the data’s information content is much more related to the number of uncensored observations than to the total number of observations. For general censoring, this approach provides an effective sample size in a wide family of semiparametric survival models. For univariate uncensored \(Y\), a general approach to effective sample sizes is here.
Let a = deviance of the full model and b = deviance of the intercept-only model. LR \(\chi^{2} = b - a\)
Let k = total number of parameters in the model
Let p = number of non-intercepts in the model
Then AIC = a + 2k
McFadden’s adjusted \(R^2\) is \(1 - \frac{a + 2k}{b}\)
I can’t find justification for the factor 2.
Maddala-Cox-Snell (MCS) \(R^2\): \(1 - \exp(-LR/n)\)
For some models, the MCS \(R^2\) cannot attain a value of 1.0 even with a perfect model. The Nagelkerke \(R^2\) (\(R^{2}_{N}\)) divides the MCS \(R^2\) by the maximum attainable value which is \(1 - \exp(-b/n)\). For a binary logistic example, suppose there is one binary predictor x that is balanced, and y=x. The MCS \(R^2\) is 0.75 and \(R^{2}_{N}=1.0\) for predicting y from itself. But there is controversy over whether \(R^{2}_{N}\) is properly recalibrated over its whole range and its use of \(n\) doesn’t apply to censored data, so we don’t use \(R^{2}_{N}\) below.
The idea of adjustment is to not reward \(R^2\) for overfitting. The most commonly used adjusted \(R^2\) with linear models is \(1 - (1 - R^{2})\frac{n-1}{n-p-1}\) which is obtained by replacing the effective estimate of the residual variance with the unbiased estimate \(\frac{\sum_{i=1}^{n} r^{2}_{i}}{n-p-1}\) where \(r\) is a residual.
Carrying this to MCS gives \(1 - \exp(-LR/n)\frac{n-1}{n-p-1} = 1 - \exp(-LR/n + \log\frac{n-1}{n-p-1}) = 1 - \exp(-(LR - n \log\frac{n-1}{n-p-1}) / n)\).
\(n \log\frac{n-1}{n-p-1} = n \log(1 + \frac{p}{n-p-1}) \approx \frac{np}{n-p-1} \approx p\).
So applying linear model adjusted \(R^2\) to MCS \(R^2\) is approximately \(1 - \exp(-(LR - p) / n)\). This is sensible because under the global null hypothesis of no associations between any X’s and Y the expected value of \(LR\) is \(p\). Thus \(LR - p\) is a chance correction for \(LR\).
Besides \(R^{2}_{N}\), the R
rms
package implements 4 types of MCS \(R^2\) computed by the Hmisc
package R2Measures
function. Either of the two \(p\)
adjustments can be used, with the \(LR -
p\) method being the default. This is a slight modification of
the Mittlbock & Schemper approach (see references). The first two
measures use the raw sample size \(n\)
and the second two use the effective sample size \(m\). The effective sample size \(m\) is taken to be the following:
Info
information measure by the R Hmisc
describe
function, and this \(m\) is used in the Hmisc
popower
function.There are four MCS-based \(R^2\)
measures, prooduced in order by the R2Measures
function.
Note that when comparing the performance of a binary \(Y\) model with that of an ordinal \(Y\) model it is not appropriate to use a measure based on \(m\). That is because the ordinal model is charged with a more difficult prediction task but would be penalized for a higher effective sample size.