RMS Skills Chat

RMS Modeling Strategy — Chat Transcript

Topics: binary outcome modeling · ESS · spearman2 · varclus · redun · rms skill corrections
Date: May 2026


Frank

I want to test locally installed skills. Which local skills are available and are used automatically as needed?

Claude

Based on the available skills listed in my configuration, here are all the skills available — both public and user-installed:

Public Skills

User-Installed Skills

How they work: They’re used automatically. When a request matches a skill’s trigger description, the corresponding SKILL.md is read before writing any code or producing any file. The rms skill fires on any question about rms, lrm, ols, cph, restricted cubic splines, calibration, validation, nomograms, etc.

Frank

I have 30 candidate predictors, not all of which are suspected to act linearly, and a binary outcome. The overall sample size is 400 and the distribution of Y is 350 Y=0 and 50 Y=1. What is a recommended strategy for developing a reliable predictive model using the 30 candidate predictors and the binary Y?

Claude

This is a highly constrained problem, and the constraints dominate every decision.

Situation Assessment

Effective sample size (ESS): For binary Y, ESS = min(events, non-events) = 50 events. (Note: this ESS formula is incorrect — see correction below.)

Budget: With ESS = 50 and a rule of thumb of 10–20 events per parameter (EPV), the entire model budget is roughly 3–5 degrees of freedom. With 30 candidate predictors, this mismatch is severe.

Step 1: Pre-specify everything before examining Y

Any data-driven decisions made after looking at Y consume phantom degrees of freedom. All decisions must be made before any Y-outcome analysis.

Step 2: Rank predictors using subject-matter knowledge and Y-free statistics

Use spearman2() to rank predictors by their marginal association with Y via generalized Spearman ρ². (Note: the suggested use for predictor elimination here is incorrect — see correction below.)

library(Hmisc)
s2 <- spearman2(y ~ x1 + x2 + ... + x30, data = dt, p = 2)
plot(s2)

Step 3: Pre-specify a severely constrained model

Strategy Total df Notes
5 predictors, all linear 5 df ~10 EPV; adequate but assumes linearity
4 predictors, top 2 with RCS(3 knots) 6 df ~8 EPV; tight but defensible
3 predictors, one with RCS(4 knots) 5 df Only if one predictor clearly dominates

Step 4: Handle missing data properly

imp <- aregImpute(~ x1 + x3 + x7 + x12 + x20 + y, data = dt,
                  n.impute = 10, nk = 3)

Step 5: Use penalized lrm or Bayesian blrm

dd <- datadist(dt); options(datadist = "dd")
f <- fit.mult.impute(
  y ~ rcs(x3, 3) + rcs(x7, 3) + x1 + x12 + x20,
  lrm, imp, data = dt, x = TRUE, y = TRUE
)
p <- pentrace(f, penalty = c(0, 1, 2, 4, 8, 16, 32, 64, 128))
f.pen <- update(f, penalty = p$penalty)

Step 6: Validate with the bootstrap

set.seed(17)
v <- validate(f.pen, B = 300)
cal <- calibrate(f.pen, B = 300)
plot(cal)

What Not to Do

Temptation Why to resist
Stepwise selection among 30 predictors Catastrophic; phantom df throughout
Machine learning ML needs ~200 total events; regression wins. (Note: should be 200 EPV — see correction below.)
Data splitting for validation Bootstrap is correct here
Categorizing continuous predictors Discards information

Frank

First of all there is a minor error in what you wrote. ESS for binary Y is 3np(1-p) not min(events, non-events). Then later you wrote that machine learning may need 200 events per variable but we have 50. We don’t have 50 events per variable. We have 50 events in total. Then there is a serious misread of the rms philosophy. The spearman2 approach can help us allocate more than one d.f. to certain predictors but it is definitely NOT to be used to identify 5-8 candidate predictors. Examining predictive promise of predictors is never to be used to drop candidate predictors entirely.

Claude

All three corrections are well taken.

1. ESS for binary Y

The correct formula is 3np(1−p), not min(events, non-events). With n = 400, p = 50/400 = 0.125:

ESS = 3 × 400 × 0.125 × 0.875 = 131.25

That’s meaningfully larger than 50 and changes the budget calculation. With ~10–20 EPV, the df budget is roughly 7–13 df — still tight relative to 30 predictors, but less desperate than implied.

2. EPV comparison with ML

The statement should have been: ML may need ~200 total events, not 200 EPV. (Note: this correction is itself still wrong — ML needs ~200 EPV, not 200 total events. See next turn.)

3. The role of spearman2 — the serious error

spearman2() is for allocating degrees of freedom among predictors already in the model — specifically, to decide which continuous predictors merit 4–5 knot splines versus 3 knots. It is emphatically not a tool for dropping predictors.

Examining predictive promise to eliminate candidates is a form of outcome-guided variable selection. It:

The correct approach is to include all pre-specified candidate predictors, spending df efficiently across all of them via linear terms for most, 3-knot RCS for those with strong prior reason to expect nonlinearity, and penalization to shrink the full set appropriately.

Frank

There remains an error: ML may need 200 EPV not 200 events total. For regression we want to start with the rule of thumb that the number of free parameters to estimate (other than the intercept) should be below m/15 where m is the ESS. The sample size here is insufficient for solving for how much penalization to use. We need a different approach to reduce the number of free parameters.

Claude

You’re right on both counts.

1. ML EPV correction

ML needs ~200 EPV (events per variable/parameter), not 200 total events. The contrast with regression is: regression is competitive at ~10–15 EPV; ML needs ~200 EPV. With ESS ≈ 131 and 30 predictors, ML is completely out of the question.

2. Parameter budget and the penalization problem

With ESS ≈ 131 and the rule of thumb of m/15, the budget for free parameters (excluding the intercept) is:

131 / 15 ≈ 9 parameters

The deeper problem: pentrace() estimates the penalty from the data, which itself requires sufficient information. With ESS ≈ 131 and 30 candidates, there isn’t enough data to reliably solve for the penalty — you’d be spending precious information just to determine how much to shrink.

The correct approach is to reduce free parameters through pre-specification, not penalization:

Frank

In this case there is not subject-matter knowledge to allow us to pre-specify 9 parameters to estimate. We need an entirely different approach.

Claude

You’re right — without subject-matter knowledge to guide pre-specification, and with the sample size insufficient to estimate penalization reliably, neither approach works. A fundamentally different strategy is needed.

The situation calls for data reduction that is completely outcome-blind — reducing the predictor space using only the X variables, with no reference to Y whatsoever. This avoids phantom degrees of freedom entirely.

1. Redundancy analysis (redun() in Hmisc)

Identifies predictors that are nearly perfectly predictable from linear combinations of the others. Redundant predictors can be dropped without any reference to Y.

r <- redun(~ x1 + x2 + ... + x30, data = dt, nk = 3)
r$Out   # predictors flagged as redundant

2. Variable clustering (varclus() in Hmisc)

Clusters predictors by their inter-correlations (Spearman-based). From each cluster, select one representative predictor on subject-matter grounds, or form a cluster summary. Entirely Y-blind.

vc <- varclus(~ x1 + x2 + ... + x30, data = dt)
plot(vc)

3. Sparse principal components or PCA on X

Reduce the 30 predictors to a smaller number of orthogonal components explaining most of the X-space variance, then use those components as predictors. Fully Y-blind.

The combined workflow

  1. Run varclus() to understand correlation structure among the 30 predictors
  2. Run redun() to eliminate genuinely redundant predictors
  3. From remaining predictors, use spearman2() to guide df allocation (nonlinearity decisions only) among those retained after the Y-blind reduction
  4. Fit lrm() with the reduced, pre-specified set targeting ≤ 9 df
  5. Bootstrap validate with validate() and calibrate()

This keeps all outcome-guided decisions out of the variable reduction step, preserving the integrity of subsequent inference.