Challenges of High-Dimensional Data Analysis

Frank E Harrell Jr

Department of Biostatistics
Vanderbilt University School of Medicine
Nashville Tennessee USA

Department of Biomedical Informatics Grand Rounds
2025-09-10

This talk will discuss challenges and hazards in learning from data, especially when analyzing a large number of patient features. Some of the points that will be emphasized are sample size requirements for stable and reliable results, hopelessness of feature selection, importance of computing uncertainties about feature importance, the importance of attempting only to answer questions for which the sample size is adequate, and how the latter relates to unsupervised learning (data reduction).

To Be Able to Do Feature Selection You Must Be Able to Estimate Feature Importance

Side Notes

  • Even if feature importance were estimable, estimating it doesn’t make ML interpretable
  • False discovery rates pretend that false negative rates don’t exist
  • Using SMOTE requires a very deep misunderstanding of statistics

Don’t Trust Any Method That

  • Doesn’t have a sample size formula or simulation
  • Doesn’t produce an unbiasedly-estimated calibration curve that is close to the line of identify

If \(n\) Is Too Small For …

  • \(p = 1\) it is too small for \(p > 1\) (\(p\) = number of candidate features)
  • handling \(p\) features without penalization it is usually too small for determining a good penalty factor (shrinkage; regularization)
  • estimating predictive performance it is too small for developing a prediction model or forced-choice classifier

Required Sample Size as a Function of \(p\)

  • Binary \(Y\), binary candidate features
  • Association between \(X\) and \(Y\) quantified by odds ratio (OR)
  • When \(n < \frac{p}{2}\) there is almost no relationship between true and estimated ORs
  • See this

What To Do If \(n\) is Too Low for \(p\)

  • If \(n > 15p\) (very roughly) the usual analyses are likely to be reliable
  • If \(n < p\) things are pretty hopeless unless you’re in a high signal:noise ratio setting such as visual pattern recognition
  • Recourse: compute effective sample size then estimate the complexity of the question that this sample size will allow you to ask
  • Ask a simpler question than your original one
    • agregate data to a coarser level
    • use data reduction (unsupervised learning) to find themes in the predictors to then use against outcomes
      • principal components analysis (PCA)
      • variable clustering followed by PCA - see this and this
      • sparse PCA
      • machine learning autoencoders
  • Screen candidate features
    • remove genetic variances with low minor allele frequencies
    • keep only genes having bimodal expression distributions (because they are thought to represent a mixture of good and bad ultimate outcomes; idea of Baggerly & Coombs)
    • redundancy analysis
  • Bayesian recourses
  • Avoid machine learning
    • machine learning other than autoencoders will not help
    • reason: ML requires much higher sample sizes than statistical models
      • ML does not capitalize on additivity assumptions
      • allows for all possible interactions
      • interaction effects are very hard to estimate

Feature Selection

  • Has a very low chance of finding most of the truly important features
  • Has a very low chance of not finding irrelevant features
  • lasso example (link at bottom of page)
    • \(n = p = 500\), binary \(X\)’s and \(Y\)
    • Sample \(\beta\)’s from a Laplace distribution (optimal for lasso)
    • 2000 simulations
    • For each true \(\beta\) compute fraction of 2000 for which that feature was selected

It Gets Worse

  • Results shown here apply to unbiased data with
    • No missings
    • No measurement error
    • Strong study design including randomization of sample processing order
  • Need to strongly adjust for readily available data
    • E.g. flexible nonlinear adjustment for risk factors, extent of disease, age, symptoms, clinical chemistry, hematology variables when examinining predictive information of imaging or molecular data
  • Need to quit making linearity assumptions as lasso does

Accuracy Measures

  • Don’t use measures such as sensitivity, specificity, and ROC curves that are designed for retrospective case-control studies
  • Don’t use measures such as precision, recall, and ROC curves that invite the use of arbitrary cutoffs on markers or predicted risks

What I’d Like to See Reported

  • “We analyzed a 100,000 patient sample from the EHR. The effective sample size is estimated to be 23,000 accounting for missing values and measurement errors.”
  • Figure with uncertainty intervals for feature importance
  • “To select important genetic variants from the 500,000 candidates our criterion was a Bayesian posterior probability of OR outside \([0.9, \frac{1}{0.9}]\) exceeding 0.9, and to conclude a feature is unimportant the probability must be < 0.1. Unfortunately 470,000 variants had a probability in (0.1, 0.9) so we concluded that the data lacked information needed to select variants.”
  • Avoidance of binary decisions, e.g. feature selection
  • No use of false discovery probabilities unaccompanied by false negative probabilities
  • “Feature selection found 50 important proteins. We fitted a non-parsimonious penalized model on all the 2000 non-selected proteins and obtained higher \(R^2\) with the ‘losers’. So we decided to abandon the 50-protein parsimonious result.”
  • 100 repeats of 10-fold CV or several hundred bootstrap resamples to estimate overfitting-corrected
  • No use of temporal or geographic validation
  • More modeling driven by medical/biological knowledge
  • Recognition that
    • more complex questions need larger sample sizes and less biased data
    • unstructured analysis of \(p > \frac{n}{2}\) features is nearly futile
    • unstructured analysis of \(p > \frac{n}{15} \rightarrow\) overstated effects



Resources