Challenges of High-Dimensional Data Analysis

Department of Biomedical Informatics Grand Rounds
2025-09-10

Abstract

This talk will discuss challenges and hazards in learning from data, especially when analyzing a large number of patient features. Some of the points that will be emphasized are sample size requirements for stable and reliable results, hopelessness of feature selection, importance of computing uncertainties about feature importance, the importance of attempting only to answer questions for which the sample size is adequate, and how the latter relates to unsupervised learning (data reduction).

To Be Able to Do Feature Selection You Must Be Able to Estimate Feature Importance

Sample size required to well estimate a single correlation coefficient is \(n=400\)
Don’t believe feature importance when unaccompanied by uncertainty intervals
Don’t believe feature importance when accompanied by uncertainty intervals
- Because most uncertainty intervals are too wide to learn from
- This just reflects the difficulty of the task, especially if features are correlated
- Related to ridge regression outperforming lasso
Low-dimensional relative explained variation example
Low-dimensional importance ranks example
High-dimensional importance ranks example

Side Notes

Even if feature importance were estimable, estimating it doesn’t make ML interpretable
False discovery rates pretend that false negative rates don’t exist
Using SMOTE requires a very deep misunderstanding of statistics

Don’t Trust Any Method That

Doesn’t have a sample size formula or simulation
Doesn’t produce an unbiasedly-estimated calibration curve that is close to the line of identify

If \(n\) Is Too Small For …

\(p = 1\) it is too small for \(p > 1\) (\(p\) = number of candidate features)
handling \(p\) features without penalization it is usually too small for determining a good penalty factor (shrinkage; regularization)
estimating predictive performance it is too small for developing a prediction model or forced-choice classifier

Required Sample Size as a Function of \(p\)

Binary \(Y\), binary candidate features
Association between \(X\) and \(Y\) quantified by odds ratio (OR)
When \(n < \frac{p}{2}\) there is almost no relationship between true and estimated ORs
See this

What To Do If \(n\) is Too Low for \(p\)

If \(n > 15p\) (very roughly) the usual analyses are likely to be reliable
If \(n < p\) things are pretty hopeless unless you’re in a high signal:noise ratio setting such as visual pattern recognition
Recourse: compute effective sample size then estimate the complexity of the question that this sample size will allow you to ask
Ask a simpler question than your original one
- agregate data to a coarser level
  - analyze activation in 12 brain regions instead of 10,000 voxels in fMRI
  - impose genetic pathways or project high-dimensional genetic variants onto lower-dimensional gene expressions or use methods with a small number of hyperparameters
- use data reduction (unsupervised learning) to find themes in the predictors to then use against outcomes
  - principal components analysis (PCA)
  - variable clustering followed by PCA - see this and this
  - sparse PCA
  - machine learning autoencoders
Screen candidate features
- remove genetic variances with low minor allele frequencies
- keep only genes having bimodal expression distributions (because they are thought to represent a mixture of good and bad ultimate outcomes; idea of Baggerly & Coombs)
- redundancy analysis
Bayesian recourses
- carefully specify a prior distribution such as the horseshoe prior that models the family of effects of all candidate features
- put a prior on the overall \(R^2\) and let that filter down to put priors on all the features’ parameters
Avoid machine learning
- machine learning other than autoencoders will not help
- reason: ML requires much higher sample sizes than statistical models
  - ML does not capitalize on additivity assumptions
  - allows for all possible interactions
  - interaction effects are very hard to estimate

Feature Selection

Has a very low chance of finding most of the truly important features
Has a very low chance of not finding irrelevant features
lasso example (link at bottom of page)
- \(n = p = 500\), binary \(X\)’s and \(Y\)
- Sample \(\beta\)’s from a Laplace distribution (optimal for lasso)
- 2000 simulations
- For each true \(\beta\) compute fraction of 2000 for which that feature was selected

It Gets Worse

Results shown here apply to unbiased data with
- No missings
- No measurement error
- Strong study design including randomization of sample processing order
Need to strongly adjust for readily available data
- E.g. flexible nonlinear adjustment for risk factors, extent of disease, age, symptoms, clinical chemistry, hematology variables when examinining predictive information of imaging or molecular data
Need to quit making linearity assumptions as lasso does

Accuracy Measures

Don’t use measures such as sensitivity, specificity, and ROC curves that are designed for retrospective case-control studies
Don’t use measures such as precision, recall, and ROC curves that invite the use of arbitrary cutoffs on markers or predicted risks

What I’d Like to See Reported

“We analyzed a 100,000 patient sample from the EHR. The effective sample size is estimated to be 23,000 accounting for missing values and measurement errors.”
Figure with uncertainty intervals for feature importance
“To select important genetic variants from the 500,000 candidates our criterion was a Bayesian posterior probability of OR outside \([0.9, \frac{1}{0.9}]\) exceeding 0.9, and to conclude a feature is unimportant the probability must be < 0.1. Unfortunately 470,000 variants had a probability in (0.1, 0.9) so we concluded that the data lacked information needed to select variants.”
Avoidance of binary decisions, e.g. feature selection
No use of false discovery probabilities unaccompanied by false negative probabilities
“Feature selection found 50 important proteins. We fitted a non-parsimonious penalized model on all the 2000 non-selected proteins and obtained higher \(R^2\) with the ‘losers’. So we decided to abandon the 50-protein parsimonious result.”
100 repeats of 10-fold CV or several hundred bootstrap resamples to estimate overfitting-corrected
- smooth nonparametric calibration curve
- predictive performance measures (\(R^2\), Brier score, etc.)
- confidence intervals for both
No use of temporal or geographic validation
More modeling driven by medical/biological knowledge
Recognition that
- more complex questions need larger sample sizes and less biased data
- unstructured analysis of \(p > \frac{n}{2}\) features is nearly futile
- unstructured analysis of \(p > \frac{n}{15} \rightarrow\) overstated effects

Department of Biomedical Informatics Grand Rounds2025-09-10