Topics: binary outcome modeling · ESS · spearman2 ·
varclus · redun · rms skill corrections
Date: May 2026
Frank
I want to test locally installed skills. Which local skills are available and are used automatically as needed?
Claude
Based on the available skills listed in my configuration, here are all the skills available — both public and user-installed:
Public Skills
User-Installed Skills
rms R
package, multivariable regression modeling, Hmisc, rmsb, and Frank
Harrell’s modeling philosophyHow they work: They’re used automatically. When a
request matches a skill’s trigger description, the corresponding
SKILL.md is read before writing any code or producing any
file. The rms skill fires on any question about
rms, lrm, ols, cph,
restricted cubic splines, calibration, validation, nomograms, etc.
Frank
I have 30 candidate predictors, not all of which are suspected to act linearly, and a binary outcome. The overall sample size is 400 and the distribution of Y is 350 Y=0 and 50 Y=1. What is a recommended strategy for developing a reliable predictive model using the 30 candidate predictors and the binary Y?
Claude
This is a highly constrained problem, and the constraints dominate every decision.
Effective sample size (ESS): For binary Y, ESS = min(events, non-events) = 50 events. (Note: this ESS formula is incorrect — see correction below.)
Budget: With ESS = 50 and a rule of thumb of 10–20 events per parameter (EPV), the entire model budget is roughly 3–5 degrees of freedom. With 30 candidate predictors, this mismatch is severe.
Step 1: Pre-specify everything before examining Y
Any data-driven decisions made after looking at Y consume phantom degrees of freedom. All decisions must be made before any Y-outcome analysis.
Step 2: Rank predictors using subject-matter knowledge and Y-free statistics
Use spearman2() to rank predictors by their marginal
association with Y via generalized Spearman ρ². (Note: the suggested
use for predictor elimination here is incorrect — see correction
below.)
library(Hmisc)
s2 <- spearman2(y ~ x1 + x2 + ... + x30, data = dt, p = 2)
plot(s2)Step 3: Pre-specify a severely constrained model
| Strategy | Total df | Notes |
|---|---|---|
| 5 predictors, all linear | 5 df | ~10 EPV; adequate but assumes linearity |
| 4 predictors, top 2 with RCS(3 knots) | 6 df | ~8 EPV; tight but defensible |
| 3 predictors, one with RCS(4 knots) | 5 df | Only if one predictor clearly dominates |
Step 4: Handle missing data properly
imp <- aregImpute(~ x1 + x3 + x7 + x12 + x20 + y, data = dt,
n.impute = 10, nk = 3)Step 5: Use penalized lrm or Bayesian
blrm
dd <- datadist(dt); options(datadist = "dd")
f <- fit.mult.impute(
y ~ rcs(x3, 3) + rcs(x7, 3) + x1 + x12 + x20,
lrm, imp, data = dt, x = TRUE, y = TRUE
)
p <- pentrace(f, penalty = c(0, 1, 2, 4, 8, 16, 32, 64, 128))
f.pen <- update(f, penalty = p$penalty)Step 6: Validate with the bootstrap
set.seed(17)
v <- validate(f.pen, B = 300)
cal <- calibrate(f.pen, B = 300)
plot(cal)| Temptation | Why to resist |
|---|---|
| Stepwise selection among 30 predictors | Catastrophic; phantom df throughout |
| Machine learning | ML needs ~200 total events; regression wins. (Note: should be 200 EPV — see correction below.) |
| Data splitting for validation | Bootstrap is correct here |
| Categorizing continuous predictors | Discards information |
Frank
First of all there is a minor error in what you wrote. ESS for binary Y is 3np(1-p) not min(events, non-events). Then later you wrote that machine learning may need 200 events per variable but we have 50. We don’t have 50 events per variable. We have 50 events in total. Then there is a serious misread of the rms philosophy. The spearman2 approach can help us allocate more than one d.f. to certain predictors but it is definitely NOT to be used to identify 5-8 candidate predictors. Examining predictive promise of predictors is never to be used to drop candidate predictors entirely.
Claude
All three corrections are well taken.
The correct formula is 3np(1−p), not min(events, non-events). With n = 400, p = 50/400 = 0.125:
ESS = 3 × 400 × 0.125 × 0.875 = 131.25
That’s meaningfully larger than 50 and changes the budget calculation. With ~10–20 EPV, the df budget is roughly 7–13 df — still tight relative to 30 predictors, but less desperate than implied.
The statement should have been: ML may need ~200 total events, not 200 EPV. (Note: this correction is itself still wrong — ML needs ~200 EPV, not 200 total events. See next turn.)
spearman2 — the serious errorspearman2() is for allocating degrees of freedom
among predictors already in the model — specifically, to decide
which continuous predictors merit 4–5 knot splines versus 3 knots. It is
emphatically not a tool for dropping predictors.
Examining predictive promise to eliminate candidates is a form of outcome-guided variable selection. It:
The correct approach is to include all pre-specified candidate predictors, spending df efficiently across all of them via linear terms for most, 3-knot RCS for those with strong prior reason to expect nonlinearity, and penalization to shrink the full set appropriately.
Frank
There remains an error: ML may need 200 EPV not 200 events total. For regression we want to start with the rule of thumb that the number of free parameters to estimate (other than the intercept) should be below m/15 where m is the ESS. The sample size here is insufficient for solving for how much penalization to use. We need a different approach to reduce the number of free parameters.
Claude
You’re right on both counts.
ML needs ~200 EPV (events per variable/parameter), not 200 total events. The contrast with regression is: regression is competitive at ~10–15 EPV; ML needs ~200 EPV. With ESS ≈ 131 and 30 predictors, ML is completely out of the question.
With ESS ≈ 131 and the rule of thumb of m/15, the budget for free parameters (excluding the intercept) is:
131 / 15 ≈ 9 parameters
The deeper problem: pentrace() estimates the
penalty from the data, which itself requires sufficient information.
With ESS ≈ 131 and 30 candidates, there isn’t enough data to reliably
solve for the penalty — you’d be spending precious information just to
determine how much to shrink.
The correct approach is to reduce free parameters through pre-specification, not penalization:
spearman2() only to decide among pre-included
predictors which ones warrant the extra df for nonlinearityFrank
In this case there is not subject-matter knowledge to allow us to pre-specify 9 parameters to estimate. We need an entirely different approach.
Claude
You’re right — without subject-matter knowledge to guide pre-specification, and with the sample size insufficient to estimate penalization reliably, neither approach works. A fundamentally different strategy is needed.
The situation calls for data reduction that is completely outcome-blind — reducing the predictor space using only the X variables, with no reference to Y whatsoever. This avoids phantom degrees of freedom entirely.
redun() in Hmisc)Identifies predictors that are nearly perfectly predictable from linear combinations of the others. Redundant predictors can be dropped without any reference to Y.
r <- redun(~ x1 + x2 + ... + x30, data = dt, nk = 3)
r$Out # predictors flagged as redundantvarclus() in Hmisc)Clusters predictors by their inter-correlations (Spearman-based). From each cluster, select one representative predictor on subject-matter grounds, or form a cluster summary. Entirely Y-blind.
vc <- varclus(~ x1 + x2 + ... + x30, data = dt)
plot(vc)Reduce the 30 predictors to a smaller number of orthogonal components explaining most of the X-space variance, then use those components as predictors. Fully Y-blind.
varclus() to understand correlation structure among
the 30 predictorsredun() to eliminate genuinely redundant
predictorsspearman2() to guide df
allocation (nonlinearity decisions only) among those retained after the
Y-blind reductionlrm() with the reduced, pre-specified set targeting
≤ 9 dfvalidate() and
calibrate()This keeps all outcome-guided decisions out of the variable reduction step, preserving the integrity of subsequent inference.