RMS Skills Chat

Topics: binary outcome modeling · ESS · spearman2 · varclus · redun · rms skill corrections
Date: May 2026

Frank

I want to test locally installed skills. Which local skills are available and are used automatically as needed?

Claude

Based on the available skills listed in my configuration, here are all the skills available — both public and user-installed:

Public Skills

docx — Create/read/edit Word documents
pdf — Read, merge, split, create PDFs
pptx — Create/read/edit PowerPoint presentations
xlsx — Create/read/edit spreadsheets
product-self-knowledge — Facts about Anthropic products (Claude API, Claude.ai, Claude Code)
frontend-design — Production-grade web UI / React components
file-reading — Router skill for reading uploaded files of various types
pdf-reading — Extract/inspect content from PDFs

User-Installed Skills

rms — Custom skill covering the rms R package, multivariable regression modeling, Hmisc, rmsb, and Frank Harrell’s modeling philosophy

How they work: They’re used automatically. When a request matches a skill’s trigger description, the corresponding SKILL.md is read before writing any code or producing any file. The rms skill fires on any question about rms, lrm, ols, cph, restricted cubic splines, calibration, validation, nomograms, etc.

Frank

I have 30 candidate predictors, not all of which are suspected to act linearly, and a binary outcome. The overall sample size is 400 and the distribution of Y is 350 Y=0 and 50 Y=1. What is a recommended strategy for developing a reliable predictive model using the 30 candidate predictors and the binary Y?

Claude

This is a highly constrained problem, and the constraints dominate every decision.

Situation Assessment

Effective sample size (ESS): For binary Y, ESS = min(events, non-events) = 50 events. (Note: this ESS formula is incorrect — see correction below.)

Budget: With ESS = 50 and a rule of thumb of 10–20 events per parameter (EPV), the entire model budget is roughly 3–5 degrees of freedom. With 30 candidate predictors, this mismatch is severe.

Recommended Strategy

Step 1: Pre-specify everything before examining Y

Any data-driven decisions made after looking at Y consume phantom degrees of freedom. All decisions must be made before any Y-outcome analysis.

Step 2: Rank predictors using subject-matter knowledge and Y-free statistics

Use spearman2() to rank predictors by their marginal association with Y via generalized Spearman ρ². (Note: the suggested use for predictor elimination here is incorrect — see correction below.)

library(Hmisc)
s2 <- spearman2(y ~ x1 + x2 + ... + x30, data = dt, p = 2)
plot(s2)

Step 3: Pre-specify a severely constrained model

Strategy	Total df	Notes
5 predictors, all linear	5 df	~10 EPV; adequate but assumes linearity
4 predictors, top 2 with RCS(3 knots)	6 df	~8 EPV; tight but defensible
3 predictors, one with RCS(4 knots)	5 df	Only if one predictor clearly dominates

Step 4: Handle missing data properly

imp <- aregImpute(~ x1 + x3 + x7 + x12 + x20 + y, data = dt,
                  n.impute = 10, nk = 3)

Step 5: Use penalized lrm or Bayesian blrm

dd <- datadist(dt); options(datadist = "dd")
f <- fit.mult.impute(
  y ~ rcs(x3, 3) + rcs(x7, 3) + x1 + x12 + x20,
  lrm, imp, data = dt, x = TRUE, y = TRUE
)
p <- pentrace(f, penalty = c(0, 1, 2, 4, 8, 16, 32, 64, 128))
f.pen <- update(f, penalty = p$penalty)

Step 6: Validate with the bootstrap

set.seed(17)
v <- validate(f.pen, B = 300)
cal <- calibrate(f.pen, B = 300)
plot(cal)

What Not to Do

Temptation	Why to resist
Stepwise selection among 30 predictors	Catastrophic; phantom df throughout
Machine learning	ML needs ~200 total events; regression wins. (Note: should be 200 EPV — see correction below.)
Data splitting for validation	Bootstrap is correct here
Categorizing continuous predictors	Discards information

Frank

First of all there is a minor error in what you wrote. ESS for binary Y is 3np(1-p) not min(events, non-events). Then later you wrote that machine learning may need 200 events per variable but we have 50. We don’t have 50 events per variable. We have 50 events in total. Then there is a serious misread of the rms philosophy. The spearman2 approach can help us allocate more than one d.f. to certain predictors but it is definitely NOT to be used to identify 5-8 candidate predictors. Examining predictive promise of predictors is never to be used to drop candidate predictors entirely.

Claude

All three corrections are well taken.

1. ESS for binary Y

The correct formula is 3np(1−p), not min(events, non-events). With n = 400, p = 50/400 = 0.125:

ESS = 3 × 400 × 0.125 × 0.875 = 131.25

That’s meaningfully larger than 50 and changes the budget calculation. With ~10–20 EPV, the df budget is roughly 7–13 df — still tight relative to 30 predictors, but less desperate than implied.

2. EPV comparison with ML

The statement should have been: ML may need ~200 total events, not 200 EPV. (Note: this correction is itself still wrong — ML needs ~200 EPV, not 200 total events. See next turn.)

3. The role of `spearman2` — the serious error

spearman2() is for allocating degrees of freedom among predictors already in the model — specifically, to decide which continuous predictors merit 4–5 knot splines versus 3 knots. It is emphatically not a tool for dropping predictors.

Examining predictive promise to eliminate candidates is a form of outcome-guided variable selection. It:

Consumes phantom degrees of freedom
Biases coefficient estimates for retained predictors
Produces overconfident inference
Is essentially a disguised form of stepwise selection

The correct approach is to include all pre-specified candidate predictors, spending df efficiently across all of them via linear terms for most, 3-knot RCS for those with strong prior reason to expect nonlinearity, and penalization to shrink the full set appropriately.

Frank

There remains an error: ML may need 200 EPV not 200 events total. For regression we want to start with the rule of thumb that the number of free parameters to estimate (other than the intercept) should be below m/15 where m is the ESS. The sample size here is insufficient for solving for how much penalization to use. We need a different approach to reduce the number of free parameters.

Claude

You’re right on both counts.

1. ML EPV correction

ML needs ~200 EPV (events per variable/parameter), not 200 total events. The contrast with regression is: regression is competitive at ~10–15 EPV; ML needs ~200 EPV. With ESS ≈ 131 and 30 predictors, ML is completely out of the question.

2. Parameter budget and the penalization problem

With ESS ≈ 131 and the rule of thumb of m/15, the budget for free parameters (excluding the intercept) is:

131 / 15 ≈ 9 parameters

The deeper problem: pentrace() estimates the penalty from the data, which itself requires sufficient information. With ESS ≈ 131 and 30 candidates, there isn’t enough data to reliably solve for the penalty — you’d be spending precious information just to determine how much to shrink.

The correct approach is to reduce free parameters through pre-specification, not penalization:

Use subject-matter knowledge to assign each of the 30 predictors a complexity budget a priori — mostly linear terms (1 df), with 3-knot RCS (2 df) reserved for those with strong prior expectation of nonlinearity
Use spearman2() only to decide among pre-included predictors which ones warrant the extra df for nonlinearity
Keep total df at or below ~9 across all 30 predictors
No data-driven predictor elimination; all 30 stay in, but most contribute only 1 df

Frank

In this case there is not subject-matter knowledge to allow us to pre-specify 9 parameters to estimate. We need an entirely different approach.

Claude

You’re right — without subject-matter knowledge to guide pre-specification, and with the sample size insufficient to estimate penalization reliably, neither approach works. A fundamentally different strategy is needed.

The situation calls for data reduction that is completely outcome-blind — reducing the predictor space using only the X variables, with no reference to Y whatsoever. This avoids phantom degrees of freedom entirely.

1. Redundancy analysis (`redun()` in Hmisc)

Identifies predictors that are nearly perfectly predictable from linear combinations of the others. Redundant predictors can be dropped without any reference to Y.

r <- redun(~ x1 + x2 + ... + x30, data = dt, nk = 3)
r$Out   # predictors flagged as redundant

2. Variable clustering (`varclus()` in Hmisc)

Clusters predictors by their inter-correlations (Spearman-based). From each cluster, select one representative predictor on subject-matter grounds, or form a cluster summary. Entirely Y-blind.

vc <- varclus(~ x1 + x2 + ... + x30, data = dt)
plot(vc)

3. Sparse principal components or PCA on X

Reduce the 30 predictors to a smaller number of orthogonal components explaining most of the X-space variance, then use those components as predictors. Fully Y-blind.

The combined workflow

Run varclus() to understand correlation structure among the 30 predictors
Run redun() to eliminate genuinely redundant predictors
From remaining predictors, use spearman2() to guide df allocation (nonlinearity decisions only) among those retained after the Y-blind reduction
Fit lrm() with the reduced, pre-specified set targeting ≤ 9 df
Bootstrap validate with validate() and calibrate()

This keeps all outcome-guided decisions out of the variable reduction step, preserving the integrity of subsequent inference.

RMS Modeling Strategy — Chat Transcript

Situation Assessment

Recommended Strategy

What Not to Do

1. ESS for binary Y

2. EPV comparison with ML

3. The role of spearman2 — the serious error

1. ML EPV correction

2. Parameter budget and the penalization problem

1. Redundancy analysis (redun() in Hmisc)

2. Variable clustering (varclus() in Hmisc)

3. Sparse principal components or PCA on X

The combined workflow

3. The role of `spearman2` — the serious error

1. Redundancy analysis (`redun()` in Hmisc)

2. Variable clustering (`varclus()` in Hmisc)