Annotated References

Adcock, C. J. (1997). Sample size determination: A review. The Statistician, 46, 261–283.

Akazawa, K., Nakamura, T., & Palesch, Y. (1997). Power of logrank test and Cox regression model in clinical trials with heterogeneous samples. Stat Med, 16, 583–597.

Aliferis, C. F., Statnikov, A., Tsamardinos, I., Schildcrout, J. S., Shepherd, B. E., & Harrell, F. E. (2009). Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS ONE, 4(3).

refutation of mic05pre

Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ, 311, 485.

Ambler, G., Brady, A. R., & Royston, P. (2002). Simplifying a prognostic model: A simulation study based on clinical data. Stat Med, 21(24), 3803–3822. https://doi.org/10.1002/sim.1422

ordinary backward stepdown worked well when there was a large fraction of truly irrelevant predictors

Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. PNASs, 99(10), 6562–6566. https://doi.org/10.1073/pnas.102102699

Relied on an improper accuracy score (proportion classified correct) so had to use the .632 bootstrap unnecessarily

Andersen, P. K., Klein, J. P., & Zhang, M.-J. (1999). Testing for centre effects in multi-centre survival studies: A monte carlo comparison of fixed and random effects tests. Stat Med, 18, 1489–1500.

Andersson, P. G. (2023). The Wald Confidence Interval for a Binomial p as an Illuminating “Bad” Example. The American Statistician, 0(0), 1–6. https://doi.org/10.1080/00031305.2023.2183257

Best, W. R., Becktel, J. M., Singleton, J. W., & Kern, F. (1976). Development of a Crohn’s disease activity index. Gastroent, 70, 439–444.

development of CDAI

Bland, J. M., & Altman, D. G. (2011). Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials, 12(1), 264. https://doi.org/10.1186/1745-6215-12-264

Bordley, R. (2007). Statistical decisionmaking without math. Chance, 20(3), 39–44.

Brazer, S. R., Pancotto, F. S., Long III, T. T., Harrell, F. E., Lee, K. L., Tyor, M. P., & Pryor, D. B. (1991). Using ordinal logistic regression to estimate the likelihood of colorectal neoplasia. J Clin Epi, 44, 1263–1270.

Briggs, W. M., & Zaretzki, R. (2008). The skill plot: A graphical technique for evaluating continuous diagnostic tests (with discussion). Biometrics, 64, 250–261.

"statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x_c. ... ROC curves lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio receivers (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x_c, and is not especially interested in how well he would have done had he used some different cutoff."; in the discussion David Hand states "when integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches to the different kinds of misclassifications."; see Lin, Kvam, Lu Stat in Med 28:798-813;2009

Califf, R. M., Harrell, F. E., Lee, K. L., Rankin, J. S., & Others. (1989). The evolution of medical and surgical therapy for coronary artery disease. JAMA, 261, 2077–2086.

Califf, R. M., McKinnis, R. A., Burks, J., Lee, K. L., Harrell FE, V. S., Pryor, D. B., Wagner, G. S., & Rosati, R. A. (1982). Prognostic implications of ventricular arrhythmias during 24 hour ambulatory monitoring in patients undergoing cardiac catheterization for coronary artery disease. Am J Card, 50, 23–31.

Chang, M. (2016). Principles of Scientific Methods. Chapman and Hall/CRC. https://doi.org/10.1201/b17167

Chen, Q., Nian, H., Zhu, Y., Talbot, H. K., Griffin, M. R., & Harrell, F. E. (2016). Too many covariates and too few cases? - a comparative study. Stat Med, 35(25), 4546–4558. https://doi.org/10.1002/sim.7021

Choi, L., Blume, J. D., & Dupont, W. D. (2015). Elucidating the Foundations of Statistical Inference with 2 x 2 Tables. PLoS ONE, 10(4), e0121263+. https://doi.org/10.1371/journal.pone.0121263

Chotai, S., Devin, C. J., Archer, K. R., Bydon, M., McGirt, M. J., Nian, H., Harrell, F. E., Dittus, R. S., Asher, A. L., & QOD Vanguard Sites. (2017). Effect of patients’ functional status on satisfaction with outcomes 12 months after elective spine surgery for lumbar degenerative disease. Spine J, 17(12), 1783–1793. https://doi.org/10.1016/j.spinee.2017.05.027

Cleveland, W. S. (1984). Graphs in scientific publications. Am Statistician, 38, 261–269.

Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press.

Committee for Proprietary Medicinal Products. (2004). Points to consider on adjustment for baseline covariates. Stat Med, 23, 701–709.

Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations in the design and analysis of clinical trials. J Roy Stat Soc A, 159, 93–110.

argues that if results are intended to be interpreted marginally, there may be no need for controlling experimentwise error rate. FH phrasing: Cook and Farewell point out that when a strong priority order is pre-specified for separate clinical questions, and that same order is also the reporting order (no cherry picking), there is no need for multiplicity adjustment. This is in contrast with a study whose aim is to find an endpoint or find a patient subgroup that is benefited by treatment, a situation requiring conservative multiplicity adjustment.

Davis, C. S. (2002). Statistical Methods for the Analysis of Repeated Measurements. Springer.

Diggle, P. J., Heagerty, P., Liang, K.-Y., & Zeger, S. L. (2002). Analysis of Longitudinal Data (second). Oxford University Press.

Dupont, W. D. (2008). Statistical Modeling for Biomedical Researchers (second). Cambridge University Press.

Edwards, D. (1999). On model pre-specification in confirmatory randomized studies. Stat Med, 18, 771–785.

Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. Sci Am, 236(5), 119–127.

Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of dichotomization. Pharm Stat, 8, 50–61. https://doi.org/10.1002/pst.331

optimal cutpoint depends on unknown parameters;should only entertain dichotomization when "estimating a value of the cumulative distribution and when the assumed model is very different from the true model";nice graphics

Ford, I., Norrie, J., & Ahmadi, S. (1995). Model inconsistency, illustrated by the Cox proportional hazards model. Stat Med, 14, 735–746.

Friedman, J. H. (1984). A variable span smoother (Technical Report No. 5). Laboratory for Computational Statistics, Department of Statistics, Stanford University.

Gail, M. H. (1986). Adjusting for covariates that have the same distribution in exposed and unexposed cohorts. In S. H. Moolgavkar & R. L. Prentice (Eds.), Modern Statistical Methods in Chronic Disease Epidemiology (pp. 3–18). Wiley.

unadjusted test can have larger type I error than nominal

Gail, M. H., Wieand, S., & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71, 431–444.

bias if omitted covariables and model is nonlinear

Gamalo-Siebers, M., Savic, J., Basu, C., Zhao, X., Gopalakrishnan, M., Gao, A., Song, G., Baygani, S., Thompson, L., Xia, H. A., Price, K., Tiwari, R., & Carlin, B. P. (2017). Statistical modeling for Bayesian extrapolation of adult clinical trial information in pediatric drug evaluation. Pharm Stat. https://doi.org/10.1002/pst.1807

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Paperback; Cambridge University Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/052168689X

Giannoni, A., Baruah, R., Leong, T., Rehman, M. B., Pastormerlo, L. E., Harrell, F. E., Coats, A. J., & Francis, D. P. (2014). Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. PLoS ONE, 9(1). https://doi.org/10.1371/journal.pone.0081699

Glass, D. J. (2014). Experimental Design for Biologists (2 edition). Cold Spring Harbor Laboratory Press.

Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102, 359–378.

wonderful review article except missing references from Scandanavian and German medical decision making literature

Govers, T. M., Rovers, M. M., Brands, M. T., Dronkers, E. A. C., Jong, R. J. B. de, Merkx, M. A. W., Takes, R. P., & Grutters, J. P. C. (2018). Integrated prediction and decision models are valuable in informing personalized decision making. Journal of Clinical Epidemiology, 0(0). https://doi.org/10.1016/j.jclinepi.2018.08.016

Greenland, S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56, 915–921. https://doi.org/10.1111/j.0006-341X.2000.00915.x

use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;"if one feels compelled to estimate $\tau^{2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values";claim about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards compositional model;"models need to be complex to capture uncertainty about the relations...an honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an elephant (see Draper, 1995)’". See also gus06per.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epi, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

Best article on misinterpretation of p-values. Pithy summaries.

Hackam, D. G., & Redelmeier, D. A. (2006). Translation of research evidence from animals to humans. JAMA, 296, 1731–1732.

review of basic science literature that documents systemic methodologic shortcomings. In a personal communication on 20Oct06 the authors reported that they found a few more biostatistical problems that could not make it into the JAMA article (for space constraints);none of the articles contained a sample size calculation;none of the articles identified a primary outcome measure;none of the articles mentioned whether they tested assumptions or did distributional testing (though a few used non-parametric tests);most articles had more than 30 endpoints (but few adjusted for multiplicity, as noted in the article)

Harrell, F. E. (2020a). Hmisc: A package of miscellaneous R functions. https://hbiostat.org/R/Hmisc

Harrell, F. E. (2020b). rms: R functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. https://hbiostat.org/R/rms

Harrell, F. E., Margolis, P. A., Gove, S., Mason, K. E., Mulholland, E. K., Lehmann, D., Muhe, L., Gatchalian, S., & Eichenwald, H. F. (1998). Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat Med, 17, 909–944. http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract

Hauck, W. W., Anderson, S., & Marcus, S. M. (1998). Should we adjust for covariates in nonlinear regression analyses of randomized trials? Controlled Clin Trials, 19, 249–256. https://doi.org/10.1016/S0197-2456(97)00147-5

"For use in a clinician-patient context, there is only a single person, that patient, of interest. The subject-specific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population-averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates."

Herschtal, A. (2023). The effect of dichotomization of skewed adjustment covariates in the analysis of clinical trials. BMC Medical Research Methodology, 23(1), 60. https://doi.org/10.1186/s12874-023-01878-9

Hlatky, M. A., Greenland, P., Arnett, D. K., Ballantyne, C. M., Criqui, M. H., Elkind, M. S., Go, A. S., Harrell, F. E., Hong, Y., Howard, B. V., Howard, V. J., Hsue, P. Y., Kramer, C. M., McConnell, J. P., Normand, S. L., O’Donnell, C. J., Smith, S. C., & Wilson, P. W. (2009). Criteria for evaluation of novel markers of cardiovascular risk: A scientific statement from the American Heart Association. Circ, 119(17), 2408–2416.

graph with different symbols for diseased and non-diseased

Hlatky, M. A., Pryor, D. B., Harrell, F. E., Califf, R. M., Mark, D. B., & Rosati, R. A. (1984). Factors affecting the sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med, 77, 64–71. http://www.sciencedirect.com/science/article/pii/0002934384904376#

Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Ann Math Stat, 19, 293–325.

Partially reprinted in: Kotz, S., Johnson, N.L. (1992) Breakthroughs in Statistics, Vol I, pp 308-334. Springer-Verlag. ISBN 0-387-94037-5

Holländer, N., Sauerbrei, W., & Schumacher, M. (2004). Confidence intervals for the effect of a prognostic factor after selection of an “optimal” cutpoint. Stat Med, 23, 1701–1713. https://doi.org/10.1002/sim.1611

true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;"It should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the “optimal” cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology."; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub

Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D. G., & Newman, T. B. (2013). Designing Clinical Research (Fourth edition). LWW.

Investigators, C. (1989). Preliminary report: Effect of Encainide and Flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. NEJM, 321(6), 406–412.

Ioannidis, J. P. A., & Lau, J. (1997). The impact of high-risk patients on the results of clinical trials. J Clin Epi, 50, 1089–1098. https://doi.org/10.1016/S0895-4356(97)00149-2

high risk patients can dominate clinical trials results;high risk patients may be imbalanced even if overall study is balanced;magnesium;differential treatment effect by patient risk;GUSTO;small vs. large trials vs. meta-analysis

Ionnidis, J. P. A. (2010). Expectations, validity, and reality in omics. J Clin Epi, 63, 945–949.

"Each new field has a rapid exponential growth of its literature over 5–8 years (“new field phase”), followed by an “established field” phase when growth rates are more modest, and then an “over-maturity” phase, where the rates of growth are similar to the growth of the scientific literature at large or even smaller. There is a parallel in the spread of an infectious epidemic that emerges rapidly and gets established when a large number of scientists (and articles) are infected with these concepts. Then momentum decreases, although many scientists remain infected and continue to work on this field. New omics infections continuously arise in the scientific community.";"A large number of personal genomic tests are already sold in the market, mostly with direct to consumer advertisement and for “recreational genomics” purposes (translate: information for the fun of information)."

Kent, D. M., & Hayward, R. (2007). Limitations of applying summary results of clinical trials to individual patients. JAMA, 298, 1209–1212. https://doi.org/10.1001/jama.298.10.1209

variation in absolute risk reduction in RCTs;failure of subgroup analysis;covariable adjustment;covariate adjustment;nice summary of individual patient absolute benefit vs. patient risk

Knaus, W. A., Harrell, F. E., Fisher, C. J., Wagner, D. P., Opan, S. M., Sadoff, J. C., Draper, E. A., Walawander, C. A., Conboy, K., & Grasela, T. H. (1993). The clinical evaluation of new drugs for sepsis: A prospective study design based on survival analysis. JAMA, 270, 1233–1241. https://doi.org/10.1001/jama.270.10.1233

Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.

Kornbrot, D. E. (1990). The rank difference test: A new and meaningful alternative to the Wilcoxon signed ranks test for ordinal data. British Journal of Mathematical and Statistical Psychology, 43(2), 241–264. https://doi.org/10.1111/j.2044-8317.1990.tb00939.x

Kotz, S., & Johnson, N. L. (Eds.). (1988). Encyclopedia of Statistical Sciences (Vol. 9). Wiley.

Krouwer, J. S. (2008). Why Bland-Altman plots should use X, not (Y+X)/2 when X is a reference method. Stat Med, 27(5), 778–780. https://doi.org/10.1002/sim.3086

Leek, J. T., & Peng, R. D. (2015). What is the question? Science, 347(6228), 1314–1315. https://doi.org/10.1126/science.aaa6146

Lenth, R. V. (2001). Some practical guidelines for effective sample size determination. Am Statistician, 55, 187–193. https://doi.org/10.1198/000313001317098149

problems with Cohen’s method

MacKay, R. J., & Oldford, R. W. (2000). Scientific Method, Statistical Method and the Speed of Light. Statist. Sci., 15(3), 254–278. https://doi.org/10.1214/ss/1009212817

Matthews, J. N. S., Altman, D. G., Campbell, M. J., & Royston, P. (1990). Analysis of serial measurements in medical research. BMJ, 300, 230–235. https://doi.org/10.1136/bmj.300.6719.230

Matthews, J. N. S., & Badi, N. H. (2015). Inconsistent treatment estimates from mis-specified logistic regression analyses of randomized trials. Stat Med, 34(19), 2681–2694. https://doi.org/10.1002/sim.6508

Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet, 365, 488–492.

comment on p. 454;validation;microarray;bioinformatics;machine learning;nearest centroid;severe problems with data splitting;high variability of list of genes;problems with published studies;nice results for effect of training sample size on misclassification error;nice use of confidence intervals on accuracy estimates;unstable molecular signatures;high instability due to dependence on selection of training sample

Moons, K. G. M., & Harrell, F. E. (2003). Sensitivity and specificity should be de-emphasized in diagnostic accuracy studies. Acad Rad, 10, 670–672.

Moons, K. G. M., van Es, G.-A., Deckers, J. W., Habbema, J. D. F., & Grobbee, D. E. (1997). Limitations of sensitivity, specificity, likelihood ratio, and Bayes’ theorem in assessing diagnostic probabilities: A clinical example. Epi, 8(1), 12–17.

non-constancy of sensitivity, specificity, likelihood ratio in a real example

Moore, T. J. (1995). Deadly Medicine: Why Tens of Thousands of Patients Died in America’s Worst Drug Disaster. Simon & Shuster.

Multicenter Postinfarction Research Group. (1983). Risk stratification and survival after myocardial infarction. NEJM, 309, 331–336.

terrible example of dichotomizing continuous variables;figure ins Papers/modelingPredictors

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nat Hum Behav, 1(1), 0021+. https://doi.org/10.1038/s41562-016-0021

Murrell, P. (2013). InfoVis and statistical graphics: Comment. J Comp Graph Stat, 22(1), 33–37. https://doi.org/10.1080/10618600.2012.751875

Excellent brief how-to list; incorporated into graphscourse

Naggara, O., Raymond, J., Guilbert, F., Roy, D., Weill, A., & Altman, D. G. (2011). Analysis by categorizing or dichotomizing continuous variables is inadvisable: An example from the natural history of unruptured aneurysms. Am J Neuroradiol, 32(3), 437–440. https://doi.org/10.3174/ajnr.A2425

Neuhaus, J. M. (1998). Estimation efficiency with omitted covariates in generalized linear models. J Am Stat Assoc, 93, 1124–1129.

"to improve the efficiency of estimated covariate effects of interest, analysts of randomized clinical trial data should adjust for covariates that are strongly associated with the outcome, and ... analysts of observational data should not adjust for covariates that do not confound the association of interest"

Newman, T. B., & Kohn, M. A. (2009). Evidence-Based Diagnosis. Cambridge University Press.

Nuzzo, R. (2015). How scientists fool themselves — and how they can stop. Nature, 526(7572), 182–185.

O’Brien, P. C. (1988). Comparing two samples: Extensions of the t, rank-sum, and log-rank test. J Am Stat Assoc, 83, 52–61. https://doi.org/10.1080/01621459.1988.10478564

see Hauck WW, Hyslop T, Anderson S (2000) Stat in Med 19:887-899

Ohman, E. M., Armstrong, P. W., Christenson, R. H., Granger, C. B., Katus, H. A., Hamm, C. W., O’Hannesian, M. A., Wagner, G. S., Kleiman, N. S., Harrell, F. E., Califf, R. M., Topol, E. J., Lee, K. L., & Investigators, T. G. (1996). Cardiac troponin T levels for risk stratification in acute myocardial ischemia. NEJM, 335, 1333–1341.

Paré, G., Mehta, S. R., Yusuf, S., & Others. (2010). Effects of CYP2C19 genotype on outcomes of clopidogrel treatment. NEJM, online.

Paul, D., Bair, E., Hastie, T., & Tibshirani, R. (2008). “Preconditioning” for feature selection and regression in high-dimensional problems. Ann Stat, 36(4), 1595–1619. https://doi.org/10.1214/009053607000000578

develop consistent Y using a latent variable structure, using for example supervised principal components. Then run stepwise regression or lasso predicting Y (lasso worked better). Can run into problems when a predictor has importance in an adjusted sense but has no marginal correlation with Y;model approximation;model simplification

Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27, 157–172.

small differences in ROC area can still be very meaningful;example of insignificant test for difference in ROC areas with very significant results from new method;Yates’ discrimination slope;reclassification table;limiting version of this based on whether and amount by which probabilities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to the editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97, 2012

Platt, J. R. (1964). Strong inference. Science, 146(3642), 347–353.

Pryor, D. B., Harrell, F. E., Lee, K. L., Califf, R. M., & Rosati, R. A. (1983). Estimating the likelihood of significant coronary artery disease. Am J Med, 75, 771–780.

R Development Team. (2025). R: A language and environment for statistical computing. R Foundation for Statistical Computing; www.r-project.org. http://www.R-project.org

Raab, G. M., Day, S., & Sales, J. (2004). How to select covariates to include in the analysis of a clinical trial. Controlled Clin Trials, 21, 330–342.

how correlated with outcome must a variable before adding it helps more than hurts, as a function of sample size;planning;design;variable selection

Robinson, L. D., & Jewell, N. P. (1991). Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev, 59, 227–240.

Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med, 25, 127–141. https://doi.org/10.1002/sim.2331

destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data

Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized studies. Stat Med, 26, 20–36.

Ruxton, Graeme D., & Colegrave, Nick. (2017). Experimental Design for the Life Sciences (Fourth Edition). Oxford University Press.

Sargent, D. J., & Hodges, J. S. (1996). A hierarchical model method for subgroup analysis of time-to-event data in the Cox regression setting.

Schoenfeld, D. A. (1983). Sample size formulae for the proportional hazards regression model. Biometrics, 39, 499–503.

Schwemer, G. (2000). General linear models for multicenter clinical trials. Controlled Clin Trials, 21, 21–29.

Senn, S. (2004). Controversies concerning randomization and additivity in clinical trials. Stat Med, 23, 3729–3753. https://doi.org/10.1002/sim.2074

p. 3735: "in the pharmaceutical industry, in analyzing the data, if a linear model is employed, it is usual to fit centre as a factor but unusual to fit block.";p. 3739: a large trial "is not less vulnerable to chance covariate imbalance";p. 3741:"There is no place, in my view, for classical minimization" (vs. the method of Atkinson);"If an investigator uses such [allocation based on covariates] schemes, she or he is honour bound, in my opinion, as a very minimum, to adjust for the factors used to balance, since the fact that they are being used to balance is an implicit declaration that they will have prognostic value.";"The point of view is sometimes defended that analyses that ignore covariates are superior because they are simpler. I do not accept this. A value of $\pi=3$ is a simple one and accurate to one significant figure ... However very few would seriously maintain that if should generally be adopted by engineers.";p. 3742: "as Fisher pointed out ... if we balance by a predictive covariate but do not fit the covariate in the model, not only do we not exploit the covariate, we actually increase the expected declared standard error."; p. 3744:"I would like to see standard errors for group means abolished."; p. 3744:"A common habit, however, in analyzing trials with three or more arms is to pool the variances from all arms when calculating the standard error of a given contrast. In my view this is a curious practice ... it relies on an assumption of additivity of all</all> treatments when comparing only two. ... a classical t-test is robust to heteroscedasticity provide that sample sizes are equal in the groups being compared and that the variance is internal to those two groups but is not robust where an external estimate is being used."; p. 3745: "By adjusting main effects for interactions a type III analysis is similarly illogical to Neyman’s hypothesis test."; "Guyatt et al. ... found a ’method for estimating the proportion of patients who benefit from a treatment ... In fact they had done no such thing."; p. 3746: "When I checked the Web of Science on 29 June 2003, the paper by Horwitz et al. had been cited 28 times and that by Guyatt et al. had been cited 79 times. The letters pointing out the fallacies had been cited only 8 and 5 times respectively."; "if we pool heterogeneous strata, the odds ratio of the treatment effect will be different from that in every stratum, even if from stratum to stratum it does not vary."; p. 3747: "Part of the problem with Poisson, proportional hazard and logistic regression approaches is that they use a single parameter, the linear predictor, with no equivalent of the variance parameter in the Normal case. This means that lack of fit impacts on the estimate of the predictor. ... what is the value of randomization if, in all except the Normal case, we cannot guarantee to have unbiased estimates. My view ... was that the form of analysis envisaged (that is to say, which factors and covariates should be fitted) justified the allocation and not vice versa."; "use the additive measure at the point of analysis and transform to the relevant scale at the point of implementation. This transformation at the point of medical decision-making will require auxiliary information on the level of background risk of the patient."; p. 3748:"The decision to fit prognostic factors has a far more dramatic effect on the precision of our inferences than the choice of an allocation based on covariates or randomization approach and one of my chief objections to the allocation based on covariates approach is that trialists have tended to use the fact that they have balanced as an excuse for not fitting. This is a grave mistake."

Senn, S. J. (2005). Dichotomania: An obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. Proceedings of the International Statistical Institute, 55th Session. http://hbiostat.org/papers/Senn/dichotomania.pdf

Senn, S., Anisimov, V. V., & Fedorov, V. V. (2010). Comparisons of minimization and Atkinson’s algorithm. Stat Med, 29, 721–730.

"fitting covariates may make a more valuable and instructive contribution to inferences about treatment effects than only balancing them"

Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated measures in clinical trials: Simple strategies for analysis using summary measures. Stat Med, 19, 861–877. https://doi.org/10.1002/(SICI)1097-0258(20000330)19:6<861::AID-SIM407>3.0.CO;2-F

Shen, X., Huang, H.-C., & Ye, J. (2004). Inference after model selection. J Am Stat Assoc, 99, 751–762.

uses optimal approximation for estimating mean and variance of complex statistics adjusting for model selection

Sigueira, A. L., & Taylor, J. M. G. (1999). Treatment effects in a logistic model involving the Box-Cox transformation. J Am Stat Assoc, 94, 240–246.

Box-Cox transformation of a covariable;validity of inference for treatment effect when treat exponent for covariable as fixed

Spanos, A., Harrell, F. E., & Durack, D. T. (1989). Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262, 2700–2707. https://doi.org/10.1001/jama.262.19.2700

Steyerberg, E. W. (2018). Validation in prediction research: The waste by data-splitting. Journal of Clinical Epidemiology, 0(0). https://doi.org/10.1016/j.jclinepi.2018.07.010

Steyerberg, E. W., Bossuyt, P. M. M., & Lee, K. L. (2000). Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139, 745–751. https://doi.org/10.1016/S0002-8703(00)90001-2

Subramanian, J., & Simon, R. (2010). Gene expression-based prognostic signatures in lung cancer: Ready for clinical use? J Nat Cancer Inst, 102, 464–474.

none demonstrated to have clinical utility;bioinformatics;quality scoring of papers

Teresi, J. A., Yu, X., Stewart, A. L., & Hays, R. D. (2022). Guidelines for Designing and Evaluating Feasibility Pilot Studies. Medical Care, 60(1), 95. https://doi.org/10.1097/MLR.0000000000001664

Tukey, J. W. (1993). Tightening the Clinical Trial. Controlled Clin Trials, 14, 266–285. https://doi.org/10.1016/0197-2456(93)90225-3

showed that asking clinicians to make up regression coefficients out of thin air is better than not adjusting for covariables

van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology, 14(1), 137+. https://doi.org/10.1186/1471-2288-14-137

Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination measure. But much good practical information. Recursive partitioning fared poorly.

van Klaveren, D., Vergouwe, Y., Farooq, V., Serruys, P. W., & Steyerberg, E. W. (2015). Estimates of absolute treatment benefit for individual patients required careful modeling of statistical interactions. J Clin Epi, 68(11), 1366–1374. https://doi.org/10.1016/j.jclinepi.2015.02.012

Vickers, A. J. (2008). Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. Am Statistician, 62(4), 314–320.

limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for taking some action;decision curve;has other good references to decision analysis

Vickers, A. J., Basch, E., & Kattan, M. W. (2008). Against diagnosis. Ann Int Med, 149, 200–203.

"The act of diagnosis requires that patients be placed in a binary category of either having or not having a certain disease. Accordingly, the diseases of particular concern for industrialized countries—such as type 2 diabetes, obesity, or depression—require that a somewhat arbitrary cut-point be chosen on a continuous scale of measurement (for example, a fasting glucose level >6.9 mmol/L [>125 mg/dL] for type 2 diabetes). These cut-points do not ade- quately reflect disease biology, may inappropriately treat patients on either side of the cut-point as 2 homogenous risk groups, fail to incorporate other risk factors, and are invariable to patient preference."

Wainer, H. (2006). Finding what is not there through the unfortunate binning of results: The Mendel effect. Chance, 19(1), 49–56.

can find bins that yield either positive or negative association;especially pertinent when effects are small;"With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk." - John von Neumann

White, I. R., Morris, T. P., & Williamson, E. (2021). Covariate adjustment in randomised trials: Canonical link functions protect against model mis-specification. http://arxiv.org/abs/2107.07278

Comment: 10 pages, 1 figure

White, I. R., & Thompson, S. G. (2005). Adjusting for partially missing baseline measurements in randomized trials. Stat Med, 24, 993–1007.

Whitehead, J. (1993). Sample size calculations for ordered categorical data. Stat Med, 12, 2257–2271.

Wilcox, R., Carlson, M., Azen, S., & Clark, F. (2013). Avoid lost discoveries, because of violations of standard assumptions, by using modern robust statistical methods. Journal of Clinical Epidemiology, 66(3), 319–329. https://doi.org/10.1016/j.jclinepi.2012.09.003

Xie, Y. (2015). Dynamic Documents with R and knitr, second edition (second). Chapman and Hall.

Yamaguchi, T., & Ohashi, Y. (1999). Investigating centre effects in a multi-centre clinical trial of superficial bladder cancer. Stat Med, 18, 1961–1971.