Zotero Report

Using a Supervised Principal Components Analysis for Variable Selection in High‐Dimensional Datasets Reduces False Discovery Rates

Item Type	Journal Article
Author	Insha Ullah
Author	Kerrie Mengersen
Author	Anthony N. Pettitt
Author	Benoit Liquet
Abstract	ABSTRACT High‐dimensional datasets, where the number of variables ‘’ is much larger than the number of samples ‘’, are ubiquitous and often render standard classification techniques unreliable due to overfitting. An important research problem is feature selection, which ranks candidate variables based on their relevance to the outcome variable and retains those that satisfy a chosen criterion. This article proposes a computationally efficient variable selection method based on principal component analysis tailored to a binary classification problem or case‐control study. This method is accessible and is suitable for the analysis of high‐dimensional datasets. We demonstrate the superior performance of our method through extensive simulations. A semi‐real gene expression dataset, a challenging childhood acute lymphoblastic leukemia gene expression study, and a GWAS that attempts to identify single‐nucleotide polymorphisms (SNPs) associated with rice grain length further demonstrate the usefulness of our method in genomic applications. We expect our method to accurately identify important features and reduce the False Discovery Rate (fdr) by accounting for the correlation between variables and by de‐noising data in the training phase, which also makes it robust to mild outliers in the training data. Our method is almost as fast as univariate filters, so it allows valid statistical inference. The ability to make such inferences sets this method apart from most current multivariate statistical tools designed for today's high‐dimensional data.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70110
Accessed	6/7/2025, 1:30:02 AM
Volume	44
Pages	e70110
Publication	Statistics in Medicine
DOI	10.1002/sim.70110
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/7/2025, 1:30:02 AM
Modified	6/7/2025, 1:30:02 AM

Notes:

Seems to be reinventing sliced inverse regression without attribution

The Impact of Violation of the Proportional Hazards Assumption on the Calibration of the Cox Proportional Hazards Model

Item Type	Journal Article
Author	Peter C. Austin
Author	Daniele Giardiello
Abstract	ABSTRACT The Cox proportional hazards regression model is frequently used to develop clinical prediction models for time‐to‐event outcomes, allowing clinicians to estimate an individual's risk of experiencing the outcome within specified time horizons (e.g., estimate an individual's 10‐year risk of death). The Cox regression model models the association between covariates and the hazard of the outcome. A key assumption of the Cox model is the proportional hazards assumption: the ratio of the hazard function for any two individuals is constant over time, and the ratio is a function of only their covariates and the regression coefficients. Calibration is an important aspect of the validation of clinical prediction models. Calibration refers to the concordance between predicted and observed risk. The impact of the violation of the proportional hazards assumption on the calibration of clinical prediction models developed using the Cox model has not been examined. We conducted a set of Monte Carlo simulations to assess the impact of the magnitude of the violation of the proportional hazards assumption on the calibration of the Cox model. We compared the calibration of predictions obtained using a Cox regression model that ignored the violation of the proportional hazards assumption with those obtained using accelerated failure time (AFT) models, Royston and Parmar's spline‐based parametric survival models, and generalized linear models using pseudo‐observations. We found that violation of the proportional hazards assumption had negligible impact on the calibration of predictions obtained using a Cox model.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70161
Accessed	6/13/2025, 1:50:07 AM
Volume	44
Pages	e70161
Publication	Statistics in Medicine
DOI	10.1002/sim.70161
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/13/2025, 1:50:07 AM
Modified	6/13/2025, 1:50:48 AM

Tags:

non-ph
non-proportional-hazards
calibration
flexible-parametric-distribution
flexible-survival-model
aft-model

The Elements of Statistical Learning

Item Type	Book
Author	Trevor Hastie
Author	Robert Tibshirani
Author	Jerome H. Friedman
Date	2008
Extra	Citation Key: has08ele tex.citeulike-article-id= 13265716 tex.posted-at= 2014-07-14 14:10:02 tex.priority= 0 ISBN-10: 0387848576; ISBN-13: 978-0387848570
Place	New York
Publisher	Springer
Edition	second ed.
Date Added	7/7/2018, 1:38:33 PM
Modified	5/4/2025, 3:34:59 PM

Tags:

data-mining
machine-learning

Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models

Item Type	Journal Article
Author	Maximilian Scholz
Author	Paul-Christian Bürkner
Date	2025-04-13
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.tandfonline.com/doi/full/10.1080/00949655.2024.2449534
Accessed	5/22/2025, 2:02:00 PM
Volume	95
Pages	1226-1249
Publication	Journal of Statistical Computation and Simulation
DOI	10.1080/00949655.2024.2449534
Issue	6
Journal Abbr	Journal of Statistical Computation and Simulation
ISSN	0094-9655, 1563-5163
Date Added	5/22/2025, 2:02:00 PM
Modified	5/22/2025, 2:02:57 PM

Tags:

bayes
causal-inference
causal-effects
causal-model
glm
model
causal-analysis

On the Uses and Abuses of Regression Models: A Call for Reform of Statistical Practice and Teaching

Item Type	Journal Article
Author	John B. Carlin
Author	Margarita Moreno‐Betancur
Abstract	ABSTRACT Regression methods dominate the practice of biostatistical analysis, but biostatistical training emphasizes the details of regression models and methods ahead of the purposes for which such modeling might be useful. More broadly, statistics is widely understood to provide a body of techniques for “modeling data,” underpinned by what we describe as the “true model myth”: that the task of the statistician/data analyst is to build a model that closely approximates the true data generating process. By way of our own historical examples and a brief review of mainstream clinical research journals, we describe how this perspective has led to a range of problems in the application of regression methods, including misguided “adjustment” for covariates, misinterpretation of regression coefficients and the widespread fitting of regression models without a clear purpose. We then outline a new approach to the teaching and application of biostatistical methods, which situates them within a framework that first requires clear definition of the substantive research question at hand, within one of three categories: descriptive, predictive, or causal. Within this approach, the development and application of (multivariable) regression models, as well as other advanced biostatistical methods, should proceed differently according to the type of question. Regression methods will no doubt remain central to statistical practice as they provide a powerful tool for representing variation in a response or outcome variable as a function of “input” variables, but their conceptualization and usage should follow from the purpose at hand.
Date	06/2025
Language	en
Short Title	On the Uses and Abuses of Regression Models
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.10244
Accessed	6/26/2025, 9:23:06 AM
Volume	44
Pages	e10244
Publication	Statistics in Medicine
DOI	10.1002/sim.10244
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/26/2025, 9:23:06 AM
Modified	6/26/2025, 9:24:11 AM

Tags:

causal-inference
regression
practice-guidelines
model

Minimum sample size for developing a multivariable prediction model: PART II ‐ binary and time‐to‐event outcomes

Item Type	Journal Article
Author	Richard D Riley
Author	Kym Ie Snell
Author	Joie Ensor
Author	Danielle L Burke
Author	Frank E Harrell Jr
Author	Karel Gm Moons
Author	Gary S Collins
Abstract	When designing a study to develop a new prediction model with binary or time‐to‐event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants ( n ) and outcome events ( E ) relative to the number of predictor parameters ( p ) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R 2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p , and require prespecification of the model's anticipated Cox‐Snell R 2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.
Date	2019-03-30
Language	en
Short Title	Minimum sample size for developing a multivariable prediction model
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.7992
Accessed	5/3/2025, 5:31:03 PM
Volume	38
Pages	1276-1296
Publication	Statistics in Medicine
DOI	10.1002/sim.7992
Issue	7
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	5/3/2025, 5:31:03 PM
Modified	5/3/2025, 5:32:43 PM

Tags:

sample-size
prediction

Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes

Item Type	Journal Article
Author	Richard D. Riley
Author	Kym I.E. Snell
Author	Joie Ensor
Author	Danielle L. Burke
Author	Frank E. Harrell
Author	Karel G.M. Moons
Author	Gary S. Collins
Abstract	In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects ( n ) relative to the number of predictor parameters ( p ) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R 2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R 2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African‐American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.
Date	2019-03-30
Language	en
Short Title	Minimum sample size for developing a multivariable prediction model
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.7993
Accessed	5/3/2025, 5:29:50 PM
Volume	38
Pages	1262-1275
Publication	Statistics in Medicine
DOI	10.1002/sim.7993
Issue	7
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	5/3/2025, 5:29:50 PM
Modified	5/3/2025, 5:33:02 PM

Tags:

sample-size
prediction

Joint modeling of longitudinal endpoints and its applications to trial planning, monitoring and analysis

Item Type	Journal Article
Author	Liangcai Zhang
Author	George Capuano
Author	Vladimir Dragalin
Author	John Jezorwski
Author	Kim Hung Lo
Author	Fei Chen
Date	2025-04-20
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.tandfonline.com/doi/full/10.1080/10543406.2025.2489280
Accessed	4/24/2025, 5:57:16 PM
Pages	1-15
Publication	Journal of Biopharmaceutical Statistics
DOI	10.1080/10543406.2025.2489280
Journal Abbr	Journal of Biopharmaceutical Statistics
ISSN	1054-3406, 1520-5711
Date Added	4/24/2025, 5:57:16 PM
Modified	4/24/2025, 5:57:44 PM

Tags:

multiple-endpoints
joint-model

Improving the Modeling of Binary Response Regression Based on New Proposals for Statistical Diagnostics With Applications to Medical Data

Item Type	Journal Article
Author	Manuel Galea
Author	Mónica Catalán
Author	Alejandra Tapia
Author	Viviana Giampaoli
Author	Víctor Leiva
Abstract	ABSTRACT Binary regression models utilizing logit or probit link functions have been extensively employed for examining the relationship between binary responses and covariates, particularly in medicine. Nonetheless, an erroneous specification of the link function may result in poor model fitting and compromise the statistical significance of covariate effects. In this study, we introduce a diagnostic method associated with a novel family of link functions enabling the assessment of sensitivity for symmetric links in relation to their asymmetric counterparts. This new family offers a comprehensive model encompassing nested symmetric cases. Our method proves beneficial in modeling medical data, especially when evaluating the sensitivity of the commonly used logit link function, prized for its interpretability via odds ratio. Moreover, our method advocates a general link based on the logit function when a standard link is unsatisfactory. We employ likelihood‐based methods to estimate parameters of the general model and conduct local influence analysis under the case‐weight perturbation scheme. Regarding local influence, we emphasize the relevance of employing appropriate perturbations to avoid misleading outcomes. Additionally, we introduce a diagnostic method for local influence, assessing the sensitivity of odds ratio under two perturbation schemes. Monte Carlo simulations are conducted to evaluate both the diagnostic method performance and parameter estimation of the general model, supplemented by illustrations using medical data related to menstruation and respiratory problems. The results confirm the efficacy of our proposal, highlighting the critical role of statistical diagnostics in modeling.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70073
Accessed	6/7/2025, 1:00:38 AM
Volume	44
Pages	e70073
Publication	Statistics in Medicine
DOI	10.1002/sim.70073
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/7/2025, 1:00:38 AM
Modified	6/7/2025, 1:01:04 AM

Tags:

binary-data
probit
logistic
link-function

Hazards Constitute Key Quantities for Analyzing, Interpreting and Understanding Time‐to‐Event Data

Item Type	Journal Article
Author	Jan Beyersmann
Author	Claudia Schmoor
Author	Martin Schumacher
Abstract	ABSTRACT Censoring makes time‐to‐event data special and requires customized statistical techniques. Survival and event history analysis therefore builds on hazards as the identifiable quantities in the presence of rather general censoring schemes. The reason is that hazards are conditional quantities, given previous survival, which enables estimation based on the current risk set—those still alive and under observation. But it is precisely their conditional nature that has made hazards subject of critique from a causal perspective: A beneficial treatment will help patients survive longer than had they remained untreated. Hence, in a randomized trial, randomization is broken in later risk sets, which, however, are the basis for statistical inference. We survey this dilemma—after all, mapping analyses of hazards onto probabilities in randomized trials is viewed as still having a causal interpretation—and argue that a causal interpretation is possible taking a functional point of view. We illustrate matters with examples from benefit–risk assessment: Prolonged survival may lead to more adverse events, but this need not imply a worse safety profile of the novel treatment. These examples illustrate that the situation at hand is conveniently parameterized using hazards, that the need to use survival techniques is not always fully appreciated and that censoring not necessarily leads to the question of “what, if no censoring?” The discussion should concentrate on how to correctly interpret causal hazard contrasts and analyses of hazards should routinely be translated onto probabilities.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/bimj.70057
Accessed	6/7/2025, 12:55:03 AM
Volume	67
Pages	e70057
Publication	Biometrical Journal
DOI	10.1002/bimj.70057
Issue	3
Journal Abbr	Biometrical J
ISSN	0323-3847, 1521-4036
Date Added	6/7/2025, 12:55:03 AM
Modified	6/7/2025, 12:55:57 AM

Tags:

survival
causality
hazard-function
causal-effects
causal-risk-difference
itt
causal-analysis

Developing clinical prediction models: a step-by-step guide

Item Type	Journal Article
Author	Orestis Efthimiou
Author	Michael Seo
Author	Konstantina Chalkou
Author	Thomas Debray
Author	Matthias Egger
Author	Georgia Salanti
Date	2024-09-03
Language	en
Short Title	Developing clinical prediction models
Library Catalog	DOI.org (Crossref)
URL	https://www.bmj.com/lookup/doi/10.1136/bmj-2023-078276
Accessed	7/29/2025, 5:12:14 PM
Pages	e078276
Publication	BMJ
DOI	10.1136/bmj-2023-078276
Journal Abbr	BMJ
ISSN	1756-1833
Date Added	7/29/2025, 5:12:14 PM
Modified	7/29/2025, 5:13:27 PM

Tags:

bootstrap
design
rms
strategy
teaching-mds
validation
variable-selection

Dealing with continuous variables and modelling non-linear associations in healthcare data: practical guide

Item Type	Journal Article
Author	Pedro Lopez-Ayala
Author	Richard D Riley
Author	Gary S Collins
Author	Tobias Zimmermann
Date	2025-07-16
Language	en
Short Title	Dealing with continuous variables and modelling non-linear associations in healthcare data
Library Catalog	Crossref
URL	https://www.bmj.com/lookup/doi/10.1136/bmj-2024-082440
Accessed	7/16/2025, 1:29:28 PM
Rights	http://www.bmj.com/company/legal-information/terms-conditions/legal-information/tdm-licencepolicy
Extra	Publisher: BMJ
Volume	390
Pages	e082440
Publication	BMJ
DOI	10.1136/bmj-2024-082440
ISSN	1756-1833
Date Added	7/16/2025, 1:29:28 PM
Modified	7/16/2025, 1:30:25 PM

Tags:

categorization
categorization-of-continuous-variables
fractional-polynomial
regression
spline
teaching-mds

Parent Item: Comparing smoothing techniques in Cox models for exposure-response relationships

authors wrote a SAS macro for restricted cubic splines even though such a macro has existed since 1984; would have gotten more useful results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials

Clinical Prediction Models

Item Type	Book
Author	Ewout W. Steyerberg
Date	2019
Place	New York
Publisher	Springer
ISBN	3-030-16398-9
Edition	2nd
Date Added	7/7/2018, 1:38:33 PM
Modified	5/3/2025, 4:30:51 PM

A Joint Model for (Un)Bounded Longitudinal Markers, Competing Risks, and Recurrent Events Using Patient Registry Data

Item Type	Journal Article
Author	Pedro Miranda Afonso
Author	Dimitris Rizopoulos
Author	Anushka K. Palipana
Author	Emrah Gecili
Author	Cole Brokamp
Author	John P. Clancy
Author	Rhonda D. Szczesniak
Author	Eleni‐Rosalina Andrinopoulou
Abstract	ABSTRACT Joint models for longitudinal and survival data have become a popular framework for studying the association between repeatedly measured biomarkers and clinical events. Nevertheless, addressing complex survival data structures, especially handling both recurrent and competing event times within a single model, remains a challenge. This causes important information to be disregarded. Moreover, existing frameworks rely on a Gaussian distribution for continuous markers, which may be unsuitable for bounded biomarkers, resulting in biased estimates of associations. To address these limitations, we propose a Bayesian shared‐parameter joint model that simultaneously accommodates multiple (possibly bounded) longitudinal markers, a recurrent event process, and competing risks. We use the beta distribution to model responses bounded within any interval without sacrificing the interpretability of the association. The model offers various forms of association, discontinuous risk intervals, and both gap and calendar timescales. A simulation study shows that it outperforms simpler joint models. We utilize the US Cystic Fibrosis Foundation Patient Registry to study the associations between changes in lung function and body mass index, and the risk of recurrent pulmonary exacerbations, while accounting for the competing risks of death and lung transplantation. Our efficient implementation allows fast fitting of the model despite its complexity and the large sample size from this patient registry. Our comprehensive approach provides new insights into cystic fibrosis disease progression by quantifying the relationship between the most important clinical markers and events more precisely than has been possible before. The model implementation is available in the R package JMbayes2 .
Date	04/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70057
Accessed	4/29/2025, 7:49:11 AM
Volume	44
Pages	e70057
Publication	Statistics in Medicine
DOI	10.1002/sim.70057
Issue	8-9
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	4/29/2025, 7:49:11 AM
Modified	4/29/2025, 7:50:05 AM

Tags:

bayes
multiple-endpoints
shared-parameter-models
shared-parameter
joint-model