Item Type | Journal Article |
---|---|
Author | Insha Ullah |
Author | Kerrie Mengersen |
Author | Anthony N. Pettitt |
Author | Benoit Liquet |
Abstract | ABSTRACT High‐dimensional datasets, where the number of variables ‘’ is much larger than the number of samples ‘’, are ubiquitous and often render standard classification techniques unreliable due to overfitting. An important research problem is feature selection, which ranks candidate variables based on their relevance to the outcome variable and retains those that satisfy a chosen criterion. This article proposes a computationally efficient variable selection method based on principal component analysis tailored to a binary classification problem or case‐control study. This method is accessible and is suitable for the analysis of high‐dimensional datasets. We demonstrate the superior performance of our method through extensive simulations. A semi‐real gene expression dataset, a challenging childhood acute lymphoblastic leukemia gene expression study, and a GWAS that attempts to identify single‐nucleotide polymorphisms (SNPs) associated with rice grain length further demonstrate the usefulness of our method in genomic applications. We expect our method to accurately identify important features and reduce the False Discovery Rate (fdr) by accounting for the correlation between variables and by de‐noising data in the training phase, which also makes it robust to mild outliers in the training data. Our method is almost as fast as univariate filters, so it allows valid statistical inference. The ability to make such inferences sets this method apart from most current multivariate statistical tools designed for today's high‐dimensional data. |
Date | 06/2025 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.70110 |
Accessed | 6/7/2025, 8:30:02 AM |
Volume | 44 |
Pages | e70110 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.70110 |
Issue | 13-14 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 6/7/2025, 8:30:02 AM |
Modified | 6/7/2025, 8:30:02 AM |
Seems to be reinventing sliced inverse regression without attribution
Item Type | Book |
---|---|
Author | Trevor Hastie |
Author | Robert Tibshirani |
Author | Jerome H. Friedman |
Date | 2008 |
Extra | Citation Key: has08ele tex.citeulike-article-id= 13265716 tex.posted-at= 2014-07-14 14:10:02 tex.priority= 0 ISBN-10: 0387848576; ISBN-13: 978-0387848570 |
Place | New York |
Publisher | Springer |
Edition | second ed. |
Date Added | 7/7/2018, 8:38:33 PM |
Modified | 5/4/2025, 10:34:59 PM |
Item Type | Journal Article |
---|---|
Author | Markus Neuhäuser |
Author | Graeme D. Ruxton |
Abstract | ABSTRACT Pearson's asymptotic χ 2 test is often used to compare binary data between two groups. However, when the sample sizes or expected frequencies are small, the test is usually replaced by Fisher's exact test. Several alternative rules of thumb exist for defining “small” in this context. Replacing one test with another based on the obtained data is unusual in statistical practice. Moreover, this commonly‐used switch is unnecessary because Pearson's χ 2 test can easily be carried out as an exact test for any sample sizes. Therefore, we recommend routinely using an exact test regardless of the obtained data. This change of approach allows prespecifying a particular test and a much less ambiguous and more reliable analysis. |
Date | 05/2025 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/pst.70012 |
Accessed | 3/30/2025, 3:06:57 PM |
Volume | 24 |
Pages | e70012 |
Publication | Pharmaceutical Statistics |
DOI | 10.1002/pst.70012 |
Issue | 3 |
Journal Abbr | Pharmaceutical Statistics |
ISSN | 1539-1604, 1539-1612 |
Date Added | 3/30/2025, 3:06:57 PM |
Modified | 3/30/2025, 3:07:24 PM |
Getting an exact test based on Pearson chi-square
Item Type | Journal Article |
---|---|
Author | Bryan E. Shepherd |
Author | Chun Li |
Author | Qi Liu |
Abstract | Abstract We describe a new residual for general regression models defined as , where y is the observed outcome and is a random variable from the fitted distribution. This probability‐scale residual (PSR) can be written as , whereas the popular observed‐minus‐expected residual can be thought of as . Therefore the PSR is useful in settings where differences are not meaningful or where the expectation of the fitted distribution cannot be calculated. We present several desirable properties of the PSR that make it useful for diagnostics and measuring residual correlation, especially across different outcome types. We demonstrate its utility for continuous, ordered discrete, and censored outcomes, including current status data, and with various models including Cox regression, quantile regression, and ordinal cumulative probability models, for which fully specified distributions are not desirable or needed, and in some cases suitable residuals are not available. The residual is illustrated with simulated data and real data sets from HIV‐infected patients on therapy in the southeastern United States and Latin America. The Canadian Journal of Statistics 44: 463–479; 2016 © 2016 Statistical Society of Canada , Résumé Les auteurs décrivent une nouvelle forme de résidus pour un modèle général de régression définis par , où y est la valeur observée et est une variable aléatoire suivant la distribution prescrite par le modèle ajusté. Lié à une échelle de probabilités, ce résidu peut s’écrire alors que la définition populaire correspond plutôt à . Le résidu proposé est donc utile si la différence entre la valeur observée et espérée de la définition populaire n'a pas de sens interprétable, ou lorsque la valeur espérée selon le modèle n'est pas calculable. Les auteurs présentent de nombreuses propriétés désirables de leurs résidus, rendant cette approche utile pour le diagnostic de modèles et le calcul de corrélations dans les résidus, surtout en présence d'observations de types différents. Ils illustrent son usage pour des données continues, ordonnées discrètes et censurées, y compris des données de statut actuel. Ils considèrent différents modèles dont la régression de Cox, la régression quantile et les modèles ordinaux de probabilités cumulatives. Les distributions implicites de ces modèles n'ont pas besoin d’être complètement définies et, dans certains cas, les résidus habituels sont simplement indisponibles. Ils illustrent leur nouvelle définition des résidus par des simulations et avec un jeu de données réelles portant sur des patients du VIH suivant une thérapie dans le sud‐est des États‐Unis ou en Amérique latine. La revue canadienne de statistique 44: 463–479; 2016 © 2016 Société statistique du Canada |
Date | 12/2016 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/cjs.11302 |
Accessed | 3/13/2025, 1:17:59 PM |
Rights | http://onlinelibrary.wiley.com/termsAndConditions#vor |
Volume | 44 |
Pages | 463-479 |
Publication | Canadian Journal of Statistics |
DOI | 10.1002/cjs.11302 |
Issue | 4 |
Journal Abbr | Can J Statistics |
ISSN | 0319-5724, 1708-945X |
Date Added | 3/13/2025, 1:17:59 PM |
Modified | 3/13/2025, 1:18:31 PM |
Item Type | Journal Article |
---|---|
Author | Maximilian Scholz |
Author | Paul-Christian Bürkner |
Date | 2025-04-13 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://www.tandfonline.com/doi/full/10.1080/00949655.2024.2449534 |
Accessed | 5/22/2025, 9:02:00 PM |
Volume | 95 |
Pages | 1226-1249 |
Publication | Journal of Statistical Computation and Simulation |
DOI | 10.1080/00949655.2024.2449534 |
Issue | 6 |
Journal Abbr | Journal of Statistical Computation and Simulation |
ISSN | 0094-9655, 1563-5163 |
Date Added | 5/22/2025, 9:02:00 PM |
Modified | 5/22/2025, 9:02:57 PM |
Item Type | Journal Article |
---|---|
Author | Richard D Riley |
Author | Kym Ie Snell |
Author | Joie Ensor |
Author | Danielle L Burke |
Author | Frank E Harrell Jr |
Author | Karel Gm Moons |
Author | Gary S Collins |
Abstract | When designing a study to develop a new prediction model with binary or time‐to‐event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants ( n ) and outcome events ( E ) relative to the number of predictor parameters ( p ) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R 2 , and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p , and require prespecification of the model's anticipated Cox‐Snell R 2 , which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required. |
Date | 2019-03-30 |
Language | en |
Short Title | Minimum sample size for developing a multivariable prediction model |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.7992 |
Accessed | 5/4/2025, 12:31:03 AM |
Volume | 38 |
Pages | 1276-1296 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.7992 |
Issue | 7 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 5/4/2025, 12:31:03 AM |
Modified | 5/4/2025, 12:32:43 AM |
Item Type | Journal Article |
---|---|
Author | Richard D. Riley |
Author | Kym I.E. Snell |
Author | Joie Ensor |
Author | Danielle L. Burke |
Author | Frank E. Harrell |
Author | Karel G.M. Moons |
Author | Gary S. Collins |
Abstract | In the medical literature, hundreds of prediction models are being developed to predict health outcomes in individuals. For continuous outcomes, typically a linear regression model is developed to predict an individual's outcome value conditional on values of multiple predictors (covariates). To improve model development and reduce the potential for overfitting, a suitable sample size is required in terms of the number of subjects ( n ) relative to the number of predictor parameters ( p ) for potential inclusion. We propose that the minimum value of n should meet the following four key criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥ 0.9; (ii) small absolute difference of ≤ 0.05 in the apparent and adjusted R 2 ; (iii) precise estimation (a margin of error ≤ 10% of the true value) of the model's residual standard deviation; and similarly, (iv) precise estimation of the mean predicted outcome value (model intercept). The criteria require prespecification of the user's chosen p and the model's anticipated R 2 as informed by previous studies. The value of n that meets all four criteria provides the minimum sample size required for model development. In an applied example, a new model to predict lung function in African‐American women using 25 predictor parameters requires at least 918 subjects to meet all criteria, corresponding to at least 36.7 subjects per predictor parameter. Even larger sample sizes may be needed to additionally ensure precise estimates of key predictor effects, especially when important categorical predictors have low prevalence in certain categories. |
Date | 2019-03-30 |
Language | en |
Short Title | Minimum sample size for developing a multivariable prediction model |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.7993 |
Accessed | 5/4/2025, 12:29:50 AM |
Volume | 38 |
Pages | 1262-1275 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.7993 |
Issue | 7 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 5/4/2025, 12:29:50 AM |
Modified | 5/4/2025, 12:33:02 AM |
Item Type | Journal Article |
---|---|
Author | Liangcai Zhang |
Author | George Capuano |
Author | Vladimir Dragalin |
Author | John Jezorwski |
Author | Kim Hung Lo |
Author | Fei Chen |
Date | 2025-04-20 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://www.tandfonline.com/doi/full/10.1080/10543406.2025.2489280 |
Accessed | 4/25/2025, 12:57:16 AM |
Pages | 1-15 |
Publication | Journal of Biopharmaceutical Statistics |
DOI | 10.1080/10543406.2025.2489280 |
Journal Abbr | Journal of Biopharmaceutical Statistics |
ISSN | 1054-3406, 1520-5711 |
Date Added | 4/25/2025, 12:57:16 AM |
Modified | 4/25/2025, 12:57:44 AM |
Item Type | Journal Article |
---|---|
Author | Manuel Galea |
Author | Mónica Catalán |
Author | Alejandra Tapia |
Author | Viviana Giampaoli |
Author | Víctor Leiva |
Abstract | ABSTRACT Binary regression models utilizing logit or probit link functions have been extensively employed for examining the relationship between binary responses and covariates, particularly in medicine. Nonetheless, an erroneous specification of the link function may result in poor model fitting and compromise the statistical significance of covariate effects. In this study, we introduce a diagnostic method associated with a novel family of link functions enabling the assessment of sensitivity for symmetric links in relation to their asymmetric counterparts. This new family offers a comprehensive model encompassing nested symmetric cases. Our method proves beneficial in modeling medical data, especially when evaluating the sensitivity of the commonly used logit link function, prized for its interpretability via odds ratio. Moreover, our method advocates a general link based on the logit function when a standard link is unsatisfactory. We employ likelihood‐based methods to estimate parameters of the general model and conduct local influence analysis under the case‐weight perturbation scheme. Regarding local influence, we emphasize the relevance of employing appropriate perturbations to avoid misleading outcomes. Additionally, we introduce a diagnostic method for local influence, assessing the sensitivity of odds ratio under two perturbation schemes. Monte Carlo simulations are conducted to evaluate both the diagnostic method performance and parameter estimation of the general model, supplemented by illustrations using medical data related to menstruation and respiratory problems. The results confirm the efficacy of our proposal, highlighting the critical role of statistical diagnostics in modeling. |
Date | 06/2025 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.70073 |
Accessed | 6/7/2025, 8:00:38 AM |
Volume | 44 |
Pages | e70073 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.70073 |
Issue | 13-14 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 6/7/2025, 8:00:38 AM |
Modified | 6/7/2025, 8:01:04 AM |
Item Type | Journal Article |
---|---|
Author | Jan Beyersmann |
Author | Claudia Schmoor |
Author | Martin Schumacher |
Abstract | ABSTRACT Censoring makes time‐to‐event data special and requires customized statistical techniques. Survival and event history analysis therefore builds on hazards as the identifiable quantities in the presence of rather general censoring schemes. The reason is that hazards are conditional quantities, given previous survival, which enables estimation based on the current risk set—those still alive and under observation. But it is precisely their conditional nature that has made hazards subject of critique from a causal perspective: A beneficial treatment will help patients survive longer than had they remained untreated. Hence, in a randomized trial, randomization is broken in later risk sets, which, however, are the basis for statistical inference. We survey this dilemma—after all, mapping analyses of hazards onto probabilities in randomized trials is viewed as still having a causal interpretation—and argue that a causal interpretation is possible taking a functional point of view. We illustrate matters with examples from benefit–risk assessment: Prolonged survival may lead to more adverse events, but this need not imply a worse safety profile of the novel treatment. These examples illustrate that the situation at hand is conveniently parameterized using hazards, that the need to use survival techniques is not always fully appreciated and that censoring not necessarily leads to the question of “what, if no censoring?” The discussion should concentrate on how to correctly interpret causal hazard contrasts and analyses of hazards should routinely be translated onto probabilities. |
Date | 06/2025 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/bimj.70057 |
Accessed | 6/7/2025, 7:55:03 AM |
Volume | 67 |
Pages | e70057 |
Publication | Biometrical Journal |
DOI | 10.1002/bimj.70057 |
Issue | 3 |
Journal Abbr | Biometrical J |
ISSN | 0323-3847, 1521-4036 |
Date Added | 6/7/2025, 7:55:03 AM |
Modified | 6/7/2025, 7:55:57 AM |
authors wrote a SAS macro for restricted cubic splines even though such a macro has existed since 1984; would have gotten more useful results had simulation been used so would know the true regression shape;measure of agreement of two estimated curves by computing the area between them, standardized by average of areas under the two;penalized spline and rcs were closer to each other than to fractional polynomials
Item Type | Book |
---|---|
Author | Ewout W. Steyerberg |
Date | 2019 |
Place | New York |
Publisher | Springer |
ISBN | 3-030-16398-9 |
Edition | 2nd |
Date Added | 7/7/2018, 8:38:33 PM |
Modified | 5/3/2025, 11:30:51 PM |
Item Type | Journal Article |
---|---|
Author | Pedro Miranda Afonso |
Author | Dimitris Rizopoulos |
Author | Anushka K. Palipana |
Author | Emrah Gecili |
Author | Cole Brokamp |
Author | John P. Clancy |
Author | Rhonda D. Szczesniak |
Author | Eleni‐Rosalina Andrinopoulou |
Abstract | ABSTRACT Joint models for longitudinal and survival data have become a popular framework for studying the association between repeatedly measured biomarkers and clinical events. Nevertheless, addressing complex survival data structures, especially handling both recurrent and competing event times within a single model, remains a challenge. This causes important information to be disregarded. Moreover, existing frameworks rely on a Gaussian distribution for continuous markers, which may be unsuitable for bounded biomarkers, resulting in biased estimates of associations. To address these limitations, we propose a Bayesian shared‐parameter joint model that simultaneously accommodates multiple (possibly bounded) longitudinal markers, a recurrent event process, and competing risks. We use the beta distribution to model responses bounded within any interval without sacrificing the interpretability of the association. The model offers various forms of association, discontinuous risk intervals, and both gap and calendar timescales. A simulation study shows that it outperforms simpler joint models. We utilize the US Cystic Fibrosis Foundation Patient Registry to study the associations between changes in lung function and body mass index, and the risk of recurrent pulmonary exacerbations, while accounting for the competing risks of death and lung transplantation. Our efficient implementation allows fast fitting of the model despite its complexity and the large sample size from this patient registry. Our comprehensive approach provides new insights into cystic fibrosis disease progression by quantifying the relationship between the most important clinical markers and events more precisely than has been possible before. The model implementation is available in the R package JMbayes2 . |
Date | 04/2025 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.70057 |
Accessed | 4/29/2025, 2:49:11 PM |
Volume | 44 |
Pages | e70057 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.70057 |
Issue | 8-9 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 4/29/2025, 2:49:11 PM |
Modified | 4/29/2025, 2:50:05 PM |
Item Type | Journal Article |
---|---|
Author | Florian Klinglmüller |
Author | Tobias Fellinger |
Author | Franz König |
Author | Tim Friede |
Author | Andrew C. Hooker |
Author | Harald Heinzl |
Author | Martina Mittlböck |
Author | Jonas Brugger |
Author | Maximilian Bardo |
Author | Cynthia Huber |
Author | Norbert Benda |
Author | Martin Posch |
Author | Robin Ristl |
Abstract | ABSTRACT While well‐established methods for time‐to‐event data are available when the proportional hazards assumption holds, there is no consensus on the best inferential approach under non‐proportional hazards (NPH). However, a wide range of parametric and non‐parametric methods for testing and estimation in this scenario have been proposed. To provide recommendations on the statistical analysis of clinical trials where non‐proportional hazards are expected, we conducted a simulation study under different scenarios of non‐proportional hazards, including delayed onset of treatment effect, crossing hazard curves, subgroups with different treatment effects, and changing hazards after disease progression. We assessed type I error rate control, power, and confidence interval coverage, where applicable, for a wide range of methods, including weighted log‐rank tests, the MaxCombo test, summary measures such as the restricted mean survival time (RMST), average hazard ratios, and milestone survival probabilities, as well as accelerated failure time regression models. We found a trade‐off between interpretability and power when choosing an analysis strategy under NPH scenarios. While analysis methods based on weighted logrank tests typically were favorable in terms of power, they do not provide an easily interpretable treatment effect estimate. Also, depending on the weight function, they test a narrow null hypothesis of equal hazard functions, and rejection of this null hypothesis may not allow for a direct conclusion of treatment benefit in terms of the survival function. In contrast, non‐parametric procedures based on well‐interpretable measures like the RMST difference had lower power in most scenarios. Model‐based methods based on specific survival distributions had larger power; however, often gave biased estimates and lower than nominal confidence interval coverage. The application of the studied methods is illustrated in a case study with reconstructed data from a phase III oncologic trial. |
Date | 2025-02-28 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://onlinelibrary.wiley.com/doi/10.1002/sim.70019 |
Accessed | 3/6/2025, 9:09:37 PM |
Volume | 44 |
Pages | e70019 |
Publication | Statistics in Medicine |
DOI | 10.1002/sim.70019 |
Issue | 5 |
Journal Abbr | Statistics in Medicine |
ISSN | 0277-6715, 1097-0258 |
Date Added | 3/6/2025, 9:09:37 PM |
Modified | 3/6/2025, 9:10:59 PM |
DIfferences in nonparametric RMST had low power, parametric estimates had greater power but need assumptions to hold