Zotero Report

Using a Supervised Principal Components Analysis for Variable Selection in High‐Dimensional Datasets Reduces False Discovery Rates

Item Type	Journal Article
Author	Insha Ullah
Author	Kerrie Mengersen
Author	Anthony N. Pettitt
Author	Benoit Liquet
Abstract	ABSTRACT High‐dimensional datasets, where the number of variables ‘’ is much larger than the number of samples ‘’, are ubiquitous and often render standard classification techniques unreliable due to overfitting. An important research problem is feature selection, which ranks candidate variables based on their relevance to the outcome variable and retains those that satisfy a chosen criterion. This article proposes a computationally efficient variable selection method based on principal component analysis tailored to a binary classification problem or case‐control study. This method is accessible and is suitable for the analysis of high‐dimensional datasets. We demonstrate the superior performance of our method through extensive simulations. A semi‐real gene expression dataset, a challenging childhood acute lymphoblastic leukemia gene expression study, and a GWAS that attempts to identify single‐nucleotide polymorphisms (SNPs) associated with rice grain length further demonstrate the usefulness of our method in genomic applications. We expect our method to accurately identify important features and reduce the False Discovery Rate (fdr) by accounting for the correlation between variables and by de‐noising data in the training phase, which also makes it robust to mild outliers in the training data. Our method is almost as fast as univariate filters, so it allows valid statistical inference. The ability to make such inferences sets this method apart from most current multivariate statistical tools designed for today's high‐dimensional data.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70110
Accessed	6/7/2025, 8:30:02 AM
Volume	44
Pages	e70110
Publication	Statistics in Medicine
DOI	10.1002/sim.70110
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/7/2025, 8:30:02 AM
Modified	6/7/2025, 8:30:02 AM

Notes:

Seems to be reinventing sliced inverse regression without attribution

The Elements of Statistical Learning

Item Type	Book
Author	Trevor Hastie
Author	Robert Tibshirani
Author	Jerome H. Friedman
Date	2008
Extra	Citation Key: has08ele tex.citeulike-article-id= 13265716 tex.posted-at= 2014-07-14 14:10:02 tex.priority= 0 ISBN-10: 0387848576; ISBN-13: 978-0387848570
Place	New York
Publisher	Springer
Edition	second ed.
Date Added	7/7/2018, 8:38:33 PM
Modified	5/4/2025, 10:34:59 PM

Tags:

data-mining
machine-learning

The Choice Between Pearson's <i>χ</i><sup>2</sup> Test and Fisher's Exact Test for 2 × 2 Tables

Item Type	Journal Article
Author	Markus Neuhäuser
Author	Graeme D. Ruxton
Abstract	ABSTRACT Pearson's asymptotic χ 2 test is often used to compare binary data between two groups. However, when the sample sizes or expected frequencies are small, the test is usually replaced by Fisher's exact test. Several alternative rules of thumb exist for defining “small” in this context. Replacing one test with another based on the obtained data is unusual in statistical practice. Moreover, this commonly‐used switch is unnecessary because Pearson's χ 2 test can easily be carried out as an exact test for any sample sizes. Therefore, we recommend routinely using an exact test regardless of the obtained data. This change of approach allows prespecifying a particular test and a much less ambiguous and more reliable analysis.
Date	05/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/pst.70012
Accessed	3/30/2025, 3:06:57 PM
Volume	24
Pages	e70012
Publication	Pharmaceutical Statistics
DOI	10.1002/pst.70012
Issue	3
Journal Abbr	Pharmaceutical Statistics
ISSN	1539-1604, 1539-1612
Date Added	3/30/2025, 3:06:57 PM
Modified	3/30/2025, 3:07:24 PM

Tags:

fishers-exact-test
pearson-chi-squared-test

Notes:

Getting an exact test based on Pearson chi-square

Probability‐scale residuals for continuous, discrete, and censored data

Item Type	Journal Article
Author	Bryan E. Shepherd
Author	Chun Li
Author	Qi Liu
Abstract	Abstract We describe a new residual for general regression models defined as , where y is the observed outcome and is a random variable from the fitted distribution. This probability‐scale residual (PSR) can be written as , whereas the popular observed‐minus‐expected residual can be thought of as . Therefore the PSR is useful in settings where differences are not meaningful or where the expectation of the fitted distribution cannot be calculated. We present several desirable properties of the PSR that make it useful for diagnostics and measuring residual correlation, especially across different outcome types. We demonstrate its utility for continuous, ordered discrete, and censored outcomes, including current status data, and with various models including Cox regression, quantile regression, and ordinal cumulative probability models, for which fully specified distributions are not desirable or needed, and in some cases suitable residuals are not available. The residual is illustrated with simulated data and real data sets from HIV‐infected patients on therapy in the southeastern United States and Latin America. The Canadian Journal of Statistics 44: 463–479; 2016 © 2016 Statistical Society of Canada , Résumé Les auteurs décrivent une nouvelle forme de résidus pour un modèle général de régression définis par , où y est la valeur observée et est une variable aléatoire suivant la distribution prescrite par le modèle ajusté. Lié à une échelle de probabilités, ce résidu peut s’écrire alors que la définition populaire correspond plutôt à . Le résidu proposé est donc utile si la différence entre la valeur observée et espérée de la définition populaire n'a pas de sens interprétable, ou lorsque la valeur espérée selon le modèle n'est pas calculable. Les auteurs présentent de nombreuses propriétés désirables de leurs résidus, rendant cette approche utile pour le diagnostic de modèles et le calcul de corrélations dans les résidus, surtout en présence d'observations de types différents. Ils illustrent son usage pour des données continues, ordonnées discrètes et censurées, y compris des données de statut actuel. Ils considèrent différents modèles dont la régression de Cox, la régression quantile et les modèles ordinaux de probabilités cumulatives. Les distributions implicites de ces modèles n'ont pas besoin d’être complètement définies et, dans certains cas, les résidus habituels sont simplement indisponibles. Ils illustrent leur nouvelle définition des résidus par des simulations et avec un jeu de données réelles portant sur des patients du VIH suivant une thérapie dans le sud‐est des États‐Unis ou en Amérique latine. La revue canadienne de statistique 44: 463–479; 2016 © 2016 Société statistique du Canada
Date	12/2016
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/cjs.11302
Accessed	3/13/2025, 1:17:59 PM
Rights	http://onlinelibrary.wiley.com/termsAndConditions#vor
Volume	44
Pages	463-479
Publication	Canadian Journal of Statistics
DOI	10.1002/cjs.11302
Issue	4
Journal Abbr	Can J Statistics
ISSN	0319-5724, 1708-945X
Date Added	3/13/2025, 1:17:59 PM
Modified	3/13/2025, 1:18:31 PM

Tags:

ordinal
residuals
gof
residual-plot

Prediction can be safely used as a proxy for explanation in causally consistent Bayesian generalized linear models

Item Type	Journal Article
Author	Maximilian Scholz
Author	Paul-Christian Bürkner
Date	2025-04-13
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.tandfonline.com/doi/full/10.1080/00949655.2024.2449534
Accessed	5/22/2025, 9:02:00 PM
Volume	95
Pages	1226-1249
Publication	Journal of Statistical Computation and Simulation
DOI	10.1080/00949655.2024.2449534
Issue	6
Journal Abbr	Journal of Statistical Computation and Simulation
ISSN	0094-9655, 1563-5163
Date Added	5/22/2025, 9:02:00 PM
Modified	5/22/2025, 9:02:57 PM

Tags:

bayes
causal-analysis
causal-effects
causal-inference
causal-model
glm
model

Minimum sample size for developing a multivariable prediction model: PART II ‐ binary and time‐to‐event outcomes

Item Type	Journal Article
Author	Richard D Riley
Author	Kym Ie Snell
Author	Joie Ensor
Author	Danielle L Burke
Author	Frank E Harrell Jr
Author	Karel Gm Moons
Author	Gary S Collins
Abstract	When designing a study to develop a new prediction model with binary or time‐to‐event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants ( n ) and outcome events ( E ) relative to the number of predictor parameters ( p ) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R 2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p , and require prespecification of the model's anticipated Cox‐Snell R 2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.
Date	2019-03-30
Language	en
Short Title	Minimum sample size for developing a multivariable prediction model
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.7992
Accessed	5/4/2025, 12:31:03 AM
Volume	38
Pages	1276-1296
Publication	Statistics in Medicine
DOI	10.1002/sim.7992
Issue	7
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	5/4/2025, 12:31:03 AM
Modified	5/4/2025, 12:32:43 AM

Tags:

sample-size
prediction

Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes

Item Type	Journal Article
Author	Richard D. Riley
Author	Kym I.E. Snell
Author	Joie Ensor
Author	Danielle L. Burke
Author	Frank E. Harrell
Author	Karel G.M. Moons
Author	Gary S. Collins
Abstract	In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects ( n ) relative to the number of predictor parameters ( p ) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R 2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R 2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African‐American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories.
Date	2019-03-30
Language	en
Short Title	Minimum sample size for developing a multivariable prediction model
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.7993
Accessed	5/4/2025, 12:29:50 AM
Volume	38
Pages	1262-1275
Publication	Statistics in Medicine
DOI	10.1002/sim.7993
Issue	7
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	5/4/2025, 12:29:50 AM
Modified	5/4/2025, 12:33:02 AM

Tags:

sample-size
prediction

Joint modeling of longitudinal endpoints and its applications to trial planning, monitoring and analysis

Item Type	Journal Article
Author	Liangcai Zhang
Author	George Capuano
Author	Vladimir Dragalin
Author	John Jezorwski
Author	Kim Hung Lo
Author	Fei Chen
Date	2025-04-20
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.tandfonline.com/doi/full/10.1080/10543406.2025.2489280
Accessed	4/25/2025, 12:57:16 AM
Pages	1-15
Publication	Journal of Biopharmaceutical Statistics
DOI	10.1080/10543406.2025.2489280
Journal Abbr	Journal of Biopharmaceutical Statistics
ISSN	1054-3406, 1520-5711
Date Added	4/25/2025, 12:57:16 AM
Modified	4/25/2025, 12:57:44 AM

Tags:

multiple-endpoints
joint-model

Improving the Modeling of Binary Response Regression Based on New Proposals for Statistical Diagnostics With Applications to Medical Data

Item Type	Journal Article
Author	Manuel Galea
Author	Mónica Catalán
Author	Alejandra Tapia
Author	Viviana Giampaoli
Author	Víctor Leiva
Abstract	ABSTRACT Binary regression models utilizing logit or probit link functions have been extensively employed for examining the relationship between binary responses and covariates, particularly in medicine. Nonetheless, an erroneous specification of the link function may result in poor model fitting and compromise the statistical significance of covariate effects. In this study, we introduce a diagnostic method associated with a novel family of link functions enabling the assessment of sensitivity for symmetric links in relation to their asymmetric counterparts. This new family offers a comprehensive model encompassing nested symmetric cases. Our method proves beneficial in modeling medical data, especially when evaluating the sensitivity of the commonly used logit link function, prized for its interpretability via odds ratio. Moreover, our method advocates a general link based on the logit function when a standard link is unsatisfactory. We employ likelihood‐based methods to estimate parameters of the general model and conduct local influence analysis under the case‐weight perturbation scheme. Regarding local influence, we emphasize the relevance of employing appropriate perturbations to avoid misleading outcomes. Additionally, we introduce a diagnostic method for local influence, assessing the sensitivity of odds ratio under two perturbation schemes. Monte Carlo simulations are conducted to evaluate both the diagnostic method performance and parameter estimation of the general model, supplemented by illustrations using medical data related to menstruation and respiratory problems. The results confirm the efficacy of our proposal, highlighting the critical role of statistical diagnostics in modeling.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70073
Accessed	6/7/2025, 8:00:38 AM
Volume	44
Pages	e70073
Publication	Statistics in Medicine
DOI	10.1002/sim.70073
Issue	13-14
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	6/7/2025, 8:00:38 AM
Modified	6/7/2025, 8:01:04 AM

Tags:

binary-data
link-function
logistic
probit

Hazards Constitute Key Quantities for Analyzing, Interpreting and Understanding Time‐to‐Event Data

Item Type	Journal Article
Author	Jan Beyersmann
Author	Claudia Schmoor
Author	Martin Schumacher
Abstract	ABSTRACT Censoring makes time‐to‐event data special and requires customized statistical techniques. Survival and event history analysis therefore builds on hazards as the identifiable quantities in the presence of rather general censoring schemes. The reason is that hazards are conditional quantities, given previous survival, which enables estimation based on the current risk set—those still alive and under observation. But it is precisely their conditional nature that has made hazards subject of critique from a causal perspective: A beneficial treatment will help patients survive longer than had they remained untreated. Hence, in a randomized trial, randomization is broken in later risk sets, which, however, are the basis for statistical inference. We survey this dilemma—after all, mapping analyses of hazards onto probabilities in randomized trials is viewed as still having a causal interpretation—and argue that a causal interpretation is possible taking a functional point of view. We illustrate matters with examples from benefit–risk assessment: Prolonged survival may lead to more adverse events, but this need not imply a worse safety profile of the novel treatment. These examples illustrate that the situation at hand is conveniently parameterized using hazards, that the need to use survival techniques is not always fully appreciated and that censoring not necessarily leads to the question of “what, if no censoring?” The discussion should concentrate on how to correctly interpret causal hazard contrasts and analyses of hazards should routinely be translated onto probabilities.
Date	06/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/bimj.70057
Accessed	6/7/2025, 7:55:03 AM
Volume	67
Pages	e70057
Publication	Biometrical Journal
DOI	10.1002/bimj.70057
Issue	3
Journal Abbr	Biometrical J
ISSN	0323-3847, 1521-4036
Date Added	6/7/2025, 7:55:03 AM
Modified	6/7/2025, 7:55:57 AM

Tags:

causal-analysis
causal-effects
causal-risk-difference
causality
hazard-function
itt
survival

Parent Item: Comparing smoothing techniques in Cox models for exposure-response relationships

authors wrote a SAS macro for restricted cubic splines even though such a macro has existed since 1984; would have gotten more useful results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials

Clinical Prediction Models

Item Type	Book
Author	Ewout W. Steyerberg
Date	2019
Place	New York
Publisher	Springer
ISBN	3-030-16398-9
Edition	2nd
Date Added	7/7/2018, 8:38:33 PM
Modified	5/3/2025, 11:30:51 PM

A Joint Model for (Un)Bounded Longitudinal Markers, Competing Risks, and Recurrent Events Using Patient Registry Data

Item Type	Journal Article
Author	Pedro Miranda Afonso
Author	Dimitris Rizopoulos
Author	Anushka K. Palipana
Author	Emrah Gecili
Author	Cole Brokamp
Author	John P. Clancy
Author	Rhonda D. Szczesniak
Author	Eleni‐Rosalina Andrinopoulou
Abstract	ABSTRACT Joint models for longitudinal and survival data have become a popular framework for studying the association between repeatedly measured biomarkers and clinical events. Nevertheless, addressing complex survival data structures, especially handling both recurrent and competing event times within a single model, remains a challenge. This causes important information to be disregarded. Moreover, existing frameworks rely on a Gaussian distribution for continuous markers, which may be unsuitable for bounded biomarkers, resulting in biased estimates of associations. To address these limitations, we propose a Bayesian shared‐parameter joint model that simultaneously accommodates multiple (possibly bounded) longitudinal markers, a recurrent event process, and competing risks. We use the beta distribution to model responses bounded within any interval without sacrificing the interpretability of the association. The model offers various forms of association, discontinuous risk intervals, and both gap and calendar timescales. A simulation study shows that it outperforms simpler joint models. We utilize the US Cystic Fibrosis Foundation Patient Registry to study the associations between changes in lung function and body mass index, and the risk of recurrent pulmonary exacerbations, while accounting for the competing risks of death and lung transplantation. Our efficient implementation allows fast fitting of the model despite its complexity and the large sample size from this patient registry. Our comprehensive approach provides new insights into cystic fibrosis disease progression by quantifying the relationship between the most important clinical markers and events more precisely than has been possible before. The model implementation is available in the R package JMbayes2 .
Date	04/2025
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70057
Accessed	4/29/2025, 2:49:11 PM
Volume	44
Pages	e70057
Publication	Statistics in Medicine
DOI	10.1002/sim.70057
Issue	8-9
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	4/29/2025, 2:49:11 PM
Modified	4/29/2025, 2:50:05 PM

Tags:

bayes
multiple-endpoints
shared-parameter-models
shared-parameter
joint-model

A Comparison of Statistical Methods for Time‐To‐Event Analyses in Randomized Controlled Trials Under Non‐Proportional Hazards

Item Type	Journal Article
Author	Florian Klinglmüller
Author	Tobias Fellinger
Author	Franz König
Author	Tim Friede
Author	Andrew C. Hooker
Author	Harald Heinzl
Author	Martina Mittlböck
Author	Jonas Brugger
Author	Maximilian Bardo
Author	Cynthia Huber
Author	Norbert Benda
Author	Martin Posch
Author	Robin Ristl
Abstract	ABSTRACT While well‐established methods for time‐to‐event data are available when the proportional hazards assumption holds, there is no consensus on the best inferential approach under non‐proportional hazards (NPH). However, a wide range of parametric and non‐parametric methods for testing and estimation in this scenario have been proposed. To provide recommendations on the statistical analysis of clinical trials where non‐proportional hazards are expected, we conducted a simulation study under different scenarios of non‐proportional hazards, including delayed onset of treatment effect, crossing hazard curves, subgroups with different treatment effects, and changing hazards after disease progression. We assessed type I error rate control, power, and confidence interval coverage, where applicable, for a wide range of methods, including weighted log‐rank tests, the MaxCombo test, summary measures such as the restricted mean survival time (RMST), average hazard ratios, and milestone survival probabilities, as well as accelerated failure time regression models. We found a trade‐off between interpretability and power when choosing an analysis strategy under NPH scenarios. While analysis methods based on weighted logrank tests typically were favorable in terms of power, they do not provide an easily interpretable treatment effect estimate. Also, depending on the weight function, they test a narrow null hypothesis of equal hazard functions, and rejection of this null hypothesis may not allow for a direct conclusion of treatment benefit in terms of the survival function. In contrast, non‐parametric procedures based on well‐interpretable measures like the RMST difference had lower power in most scenarios. Model‐based methods based on specific survival distributions had larger power; however, often gave biased estimates and lower than nominal confidence interval coverage. The application of the studied methods is illustrated in a case study with reconstructed data from a phase III oncologic trial.
Date	2025-02-28
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://onlinelibrary.wiley.com/doi/10.1002/sim.70019
Accessed	3/6/2025, 9:09:37 PM
Volume	44
Pages	e70019
Publication	Statistics in Medicine
DOI	10.1002/sim.70019
Issue	5
Journal Abbr	Statistics in Medicine
ISSN	0277-6715, 1097-0258
Date Added	3/6/2025, 9:09:37 PM
Modified	3/6/2025, 9:10:59 PM

Tags:

non-ph
RMST

Notes:

DIfferences in nonparametric RMST had low power, parametric estimates had greater power but need assumptions to hold