Chat Transcript

Predictive Modeling Strategy with rms
— Corrections & Methodology

Topics: binary outcome modeling · ESS · spearman2 · varclus · redun · rms skill corrections  ·  May 2026
Frank

I want to test locally installed skills. Which local skills are available and are used automatically as needed?


Claude

Based on the available skills listed in my configuration, here are all the skills available — both public and user-installed:

Public Skills

  • docx — Create/read/edit Word documents
  • pdf — Read, merge, split, create PDFs
  • pptx — Create/read/edit PowerPoint presentations
  • xlsx — Create/read/edit spreadsheets
  • product-self-knowledge — Facts about Anthropic products (Claude API, Claude.ai, Claude Code)
  • frontend-design — Production-grade web UI / React components
  • file-reading — Router skill for reading uploaded files of various types
  • pdf-reading — Extract/inspect content from PDFs

User-Installed Skills

  • rms — Your custom skill covering the rms R package, multivariable regression modeling, Hmisc, rmsb, and Frank Harrell's modeling philosophy

How they work: Yes, they're used automatically. When your request matches a skill's trigger description, I read the corresponding SKILL.md before writing any code or producing any file. The rms skill in particular will fire on any question about rms, lrm, ols, cph, restricted cubic splines, calibration, validation, nomograms, etc.

To test a skill, just ask something that would naturally trigger it — for example, an rms modeling question, a request to create a .docx file, or asking me to read an uploaded PDF. I'll read the skill first and follow its guidance.


Frank

I have 30 candidate predictors, not all of which are suspected to act linearly, and a binary outcome. The overall sample size is 400 and the distribution of Y is 350 Y=0 and 50 Y=1. What is a recommended strategy for developing a reliable predictive model using the 30 candidate predictors and the binary Y?


Claude

This is a highly constrained problem, and the constraints dominate every decision. Here's the full recommended strategy.


Situation Assessment

Effective sample size (ESS): For binary Y, ESS = min(events, non-events) = 50 events. This is the binding constraint — not n = 400. [Note: this ESS formula is incorrect — see correction below.]

Budget: With ESS = 50 and a rule of thumb of 10–20 events per parameter (EPV), your entire model budget is roughly 3–5 degrees of freedom before overfitting becomes serious. You have 30 candidate predictors. This mismatch is severe.


Recommended Strategy

Step 1: Pre-specify everything before examining Y

Any data-driven decisions made after looking at Y consume phantom degrees of freedom. All decisions in Steps 2–4 must be made before any Y-outcome analysis.

Step 2: Rank predictors using subject-matter knowledge and Y-free statistics

Use spearman2() to rank predictors by their marginal association with Y via generalized Spearman ρ². [Note: the suggested use for elimination here is incorrect — see correction below.]

library(Hmisc)
s2 <- spearman2(y ~ x1 + x2 + ... + x30, data = dt, p = 2)
plot(s2)

Step 3: Pre-specify a severely constrained model

StrategyTotal dfNotes
5 predictors, all linear5 df~10 EPV; adequate but assumes linearity
4 predictors, top 2 with RCS(3 knots)6 df~8 EPV; tight but defensible
3 predictors, one with RCS(4 knots)5 dfOnly if one predictor clearly dominates

Step 4: Handle missing data properly

imp <- aregImpute(~ x1 + x3 + x7 + x12 + x20 + y, data = dt,
                  n.impute = 10, nk = 3)

Step 5: Use penalized lrm or Bayesian blrm

dd <- datadist(dt); options(datadist = "dd")
f <- fit.mult.impute(
  y ~ rcs(x3, 3) + rcs(x7, 3) + x1 + x12 + x20,
  lrm, imp, data = dt, x = TRUE, y = TRUE
)
p <- pentrace(f, penalty = c(0, 1, 2, 4, 8, 16, 32, 64, 128))
f.pen <- update(f, penalty = p$penalty)

Step 6: Validate with the bootstrap

set.seed(17)
v <- validate(f.pen, B = 300)
cal <- calibrate(f.pen, B = 300)
plot(cal)

What Not to Do

TemptationWhy to resist
Stepwise selection among 30 predictorsCatastrophic with ESS = 50; phantom df throughout
Machine learningML needs ~200 total events; regression wins here. [Note: should be 200 EPV — see correction below.]
Data splitting for validationBootstrap is correct here
Categorizing continuous predictorsDiscards information

Frank

First of all there is a minor error in what you wrote. ESS for binary Y is 3np(1-p) not min(events, non-events). Then later you wrote that machine learning may need 200 events per variable but we have 50. We don't have 50 events per variable. We have 50 events in total. Then there is a serious misread of the rms philosophy. The spearman2 approach can help us allocate more than one d.f. to certain predictors but it is definitely NOT to be used to identify 5-8 candidate predictors. Examining predictive promise of predictors is never to be used to drop candidate predictors entirely.


Claude

All three corrections are well taken. Let me address each:


1. ESS for binary Y

The correct formula is 3np(1−p), not min(events, non-events). With n = 400, p = 50/400 = 0.125:

ESS = 3 × 400 × 0.125 × 0.875 = 131.25

That's meaningfully larger than 50 and changes the budget calculation. With ~10–20 EPV, the df budget is roughly 7–13 df — still tight relative to 30 predictors, but less desperate than I implied.


2. EPV comparison with ML

The statement should have been: ML may need ~200 total events, not 200 EPV. With only 50 total events, the comparison is even more lopsided against ML — regression is the clear choice. [Note: this correction is itself still wrong — ML needs ~200 EPV, not 200 total events. See next correction.]


3. The role of spearman2 — the serious error

spearman2() is for allocating degrees of freedom among predictors already in the model — specifically, to decide which continuous predictors merit 4–5 knot splines versus 3 knots. It is emphatically not a tool for dropping predictors.

Examining predictive promise to eliminate candidates is a form of outcome-guided variable selection. It:

  • Consumes phantom degrees of freedom
  • Biases coefficient estimates for retained predictors
  • Produces overconfident inference
  • Is essentially a disguised form of stepwise selection

The correct approach is to include all pre-specified candidate predictors, spending df efficiently across all of them via linear terms for most predictors, 3-knot RCS for those with strong prior reason to expect nonlinearity, and penalization to shrink the full set appropriately.


Frank

There remains an error: ML may need 200 EPV not 200 events total. For regression we want to start with the rule of thumb that the number of free parameters to estimate (other than the intercept) should be below m/15 where m is the ESS. The sample size here is insufficient for solving for how much penalization to use. We need a different approach to reduce the number of free parameters.


Claude

You're right on both counts.


1. ML EPV correction

ML needs ~200 EPV (events per variable/parameter), not 200 total events. So the contrast with regression is: regression is competitive at ~10–15 EPV; ML needs ~200 EPV. With ESS ≈ 131 and 30 predictors, ML is completely out of the question.


2. Parameter budget and the penalization problem

With ESS ≈ 131 and the rule of thumb of m/15, the budget for free parameters (excluding the intercept) is:

131 / 15 ≈ 9 parameters

The deeper problem: pentrace() estimates the penalty from the data, which itself requires sufficient information. With ESS ≈ 131 and 30 candidates, there isn't enough data to reliably solve for the penalty.

The correct approach is to reduce free parameters through pre-specification, not penalization:

  • Use subject-matter knowledge to assign each of the 30 predictors a complexity budget a priori — mostly linear terms (1 df), with 3-knot RCS (2 df) reserved for the small number of continuous predictors with strong prior expectation of nonlinearity
  • Use spearman2() only to decide among pre-included predictors which ones warrant the extra df for nonlinearity
  • Keep total df at or below ~9 across all 30 predictors, meaning the vast majority must enter as linear terms
  • No data-driven predictor elimination; all 30 stay in, but most contribute only 1 df

Frank

In this case there is not subject-matter knowledge to allow us to pre-specify 9 parameters to estimate. We need an entirely different approach.


Claude

You're right — without subject-matter knowledge to guide pre-specification, and with the sample size insufficient to estimate penalization reliably, neither of those approaches works. A fundamentally different strategy is needed.

The situation calls for data reduction that is completely outcome-blind — reducing the predictor space using only the X variables, with no reference to Y whatsoever. This avoids phantom degrees of freedom entirely. The key tools:


1. Redundancy analysis (redun() in Hmisc)

Identifies predictors that are nearly perfectly predictable from linear combinations of the others. Redundant predictors can be dropped without any reference to Y — they contribute no new information regardless of outcome.

r <- redun(~ x1 + x2 + ... + x30, data = dt, nk = 3)
r$Out   # predictors flagged as redundant

2. Variable clustering (varclus() in Hmisc)

Clusters predictors by their inter-correlations (Spearman-based). From each cluster, either select one representative predictor on subject-matter grounds, or form a cluster summary. Again entirely Y-blind.

vc <- varclus(~ x1 + x2 + ... + x30, data = dt)
plot(vc)

3. Sparse principal components or PCA on X

Reduce the 30 predictors to a smaller number of orthogonal components explaining most of the X-space variance, then use those components as predictors. Fully Y-blind.


The combined workflow

  1. Run varclus() to understand correlation structure among the 30 predictors
  2. Run redun() to eliminate genuinely redundant predictors
  3. From remaining predictors, use spearman2() to guide df allocation (nonlinearity decisions only) among those retained after the Y-blind reduction
  4. Fit lrm() with the reduced, pre-specified set targeting ≤ 9 df
  5. Bootstrap validate with validate() and calibrate()

This keeps all outcome-guided decisions out of the variable reduction step, preserving the integrity of subsequent inference.


Frank

What is the best way to update the master rms skills files to take into account the corrections I've given you in this chat?


Claude

The skill file lives at /mnt/skills/user/rms/SKILL.md, which is mounted read-only — I can't edit it directly. You'd need to copy it to your working directory, make the edits, then replace the original yourself.

Here's what I'd suggest:

Option A — I draft the specific edits now: I identify the exact passages in SKILL.md that need updating and write the corrected replacement text for each. You then apply them yourself (copy the file from iCloud Drive, patch, replace). This is precise and reviewable.

Option B — I produce a corrected copy: I copy the skill file to /home/claude, apply all edits, and present the updated file for you to download and replace the original at its location in your skills GitHub repo.

The corrections to incorporate are:

  1. ESS for binary Y: Replace min(events, non-events) with 3np(1−p)
  2. ML EPV threshold: Clarify that ML needs ~200 EPV, not 200 total events
  3. spearman2() role: Sharpen the language to make explicit that it guides df allocation among already-included predictors only — never predictor elimination
  4. Outcome-blind data reduction: Add redun() and varclus() as the recommended approach when subject-matter knowledge is insufficient to pre-specify a parsimonious model — this is currently absent from the skill entirely
  5. Penalization caveat: Note that pentrace() itself requires sufficient ESS to estimate reliably, and is not a substitute for pre-specification when sample size is limited

Which option would you prefer?


Frank

Is there a permanent link to this chat?


Claude

I searched recent chats but the current conversation didn't appear in the results (the most recent chats shown are from earlier today on different topics). The current chat URL should be visible in your browser's address bar — that is the permanent link. Claude doesn't have a reliable way to identify the current conversation's own ID from within it.

A few suggestions:

  1. From claude.ai in a browser: Log in, find this conversation in the left sidebar, open it, and the URL in the address bar will be the permanent link in the form https://claude.ai/chat/{uuid}.
  2. From the Claude app: The "Copy URL" feature should produce a valid https://claude.ai/chat/{uuid} link. If it's not opening in a browser, it may be a bug — worth trying pasting it manually into Safari or Chrome rather than tapping it directly.
  3. Title search: This conversation has a fairly unique topic. If you need to relocate it later, searching for something like "spearman2 variable clustering rms skill" in your chat history should surface it quickly.

If the Copy URL bug persists, it's worth reporting via the feedback button in the app.


Frank

Format this entire chat into an html file I can download


Claude

Creating a formatted HTML transcript of this entire conversation for download.