Annotated References

Adcock, C. J. (1997). Sample size determination: A review. The Statistician, 46, 261–283.
Akazawa, K., Nakamura, T., & Palesch, Y. (1997). Power of logrank test and Cox regression model in clinical trials with heterogeneous samples. Stat Med, 16, 583–597.
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Schildcrout, J. S., Shepherd, B. E., & Harrell, F. E. (2009). Factors influencing the statistical power of complex data analysis protocols for molecular signature development from microarray data. PLoS ONE, 4(3).
refutation of mic05pre
Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not evidence of absence. BMJ, 311, 485.
Ambler, G., Brady, A. R., & Royston, P. (2002). Simplifying a prognostic model: A simulation study based on clinical data. Stat Med, 21(24), 3803–3822.
ordinary backward stepdown worked well when there was a large fraction of truly irrelevant predictors
Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. PNASs, 99(10), 6562–6566.
Relied on an improper accuracy score (proportion classified correct) so had to use the .632 bootstrap unnecessarily
Andersen, P. K., Klein, J. P., & Zhang, M.-J. (1999). Testing for centre effects in multi-centre survival studies: A monte carlo comparison of fixed and random effects tests. Stat Med, 18, 1489–1500.
Andersson, P. G. (2023). The Wald Confidence Interval for a Binomial p as an Illuminating Bad Example. The American Statistician, 0(0), 1–6.
Best, W. R., Becktel, J. M., Singleton, J. W., & Kern, F. (1976). Development of a Crohn’s disease activity index. Gastroent, 70, 439–444.
development of CDAI
Bland, J. M., & Altman, D. G. (2011). Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials, 12(1), 264.
Bordley, R. (2007). Statistical decisionmaking without math. Chance, 20(3), 39–44.
Brazer, S. R., Pancotto, F. S., Long III, T. T., Harrell, F. E., Lee, K. L., Tyor, M. P., & Pryor, D. B. (1991). Using ordinal logistic regression to estimate the likelihood of colorectal neoplasia. J Clin Epi, 44, 1263–1270.
Briggs, W. M., & Zaretzki, R. (2008). The skill plot: A graphical technique for evaluating continuous diagnostic tests (with discussion). Biometrics, 64, 250–261.
"statistics such as the AUC are not especially relevant to someone who must make a decision about a particular x_c. ... ROC curves lack or obscure several quantities that are necessary for evaluating the operational effectiveness of diagnostic tests. ... ROC curves were first used to check how radio <i>receivers</i> (like radar receivers) operated over a range of frequencies. ... This is not how most ROC curves are used now, particularly in medicine. The receiver of a diagnostic measurement ... wants to make a decision based on some x_c, and is not especially interested in how well he would have done had he used some different cutoff."; in the discussion David Hand states "when integrating to yield the overall AUC measure, it is necessary to decide what weight to give each value in the integration. The AUC implicitly does this using a weighting derived empirically from the data. This is nonsensical. The relative importance of misclassifying a case as a noncase, compared to the reverse, cannot come from the data itself. It must come externally, from considerations of the severity one attaches to the different kinds of misclassifications."; see Lin, Kvam, Lu Stat in Med 28:798-813;2009
Califf, R. M., Harrell, F. E., Lee, K. L., Rankin, J. S., & Others. (1989). The evolution of medical and surgical therapy for coronary artery disease. JAMA, 261, 2077–2086.
Califf, R. M., McKinnis, R. A., Burks, J., Lee, K. L., Harrell FE, V. S., Pryor, D. B., Wagner, G. S., & Rosati, R. A. (1982). Prognostic implications of ventricular arrhythmias during 24 hour ambulatory monitoring in patients undergoing cardiac catheterization for coronary artery disease. Am J Card, 50, 23–31.
Chang, M. (2016). Principles of Scientific Methods. Chapman and Hall/CRC.
Chen, Q., Nian, H., Zhu, Y., Talbot, H. K., Griffin, M. R., & Harrell, F. E. (2016). Too many covariates and too few cases? - a comparative study. Stat Med, 35(25), 4546–4558.
Choi, L., Blume, J. D., & Dupont, W. D. (2015). Elucidating the Foundations of Statistical Inference with 2 x 2 Tables. PLoS ONE, 10(4), e0121263+.
Chotai, S., Devin, C. J., Archer, K. R., Bydon, M., McGirt, M. J., Nian, H., Harrell, F. E., Dittus, R. S., Asher, A. L., & QOD Vanguard Sites. (2017). Effect of patients’ functional status on satisfaction with outcomes 12 months after elective spine surgery for lumbar degenerative disease. Spine J, 17(12), 1783–1793.
Cleveland, W. S. (1984). Graphs in scientific publications. Am Statistician, 38, 261–269.
Cleveland, W. S. (1994). The Elements of Graphing Data. Hobart Press.
Committee for Proprietary Medicinal Products. (2004). Points to consider on adjustment for baseline covariates. Stat Med, 23, 701–709.
Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations in the design and analysis of clinical trials. J Roy Stat Soc A, 159, 93–110.
argues that if results are intended to be interpreted marginally, there may be no need for controlling experimentwise error rate.  FH phrasing: Cook and Farewell point out that when a strong priority order is pre-specified for separate clinical questions, and that same order is also the reporting order (no cherry picking), there is no need for multiplicity adjustment.  This is in contrast with a study whose aim is to find an endpoint or find a patient subgroup that is benefited by treatment, a situation requiring conservative multiplicity adjustment.
Davis, C. S. (2002). Statistical Methods for the Analysis of Repeated Measurements. Springer.
Diggle, P. J., Heagerty, P., Liang, K.-Y., & Zeger, S. L. (2002). Analysis of Longitudinal Data (second). Oxford University Press.
Dupont, W. D. (2008). Statistical Modeling for Biomedical Researchers (second). Cambridge University Press.
Edwards, D. (1999). On model pre-specification in confirmatory randomized studies. Stat Med, 18, 771–785.
Efron, B., & Morris, C. (1977). Stein’s paradox in statistics. Sci Am, 236(5), 119–127.
Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of dichotomization. Pharm Stat, 8, 50–61.
optimal cutpoint depends on unknown parameters;should only entertain dichotomization when "estimating a value of the cumulative distribution and when the assumed model is very different from the true model";nice graphics
Ford, I., Norrie, J., & Ahmadi, S. (1995). Model inconsistency, illustrated by the Cox proportional hazards model. Stat Med, 14, 735–746.
Friedman, J. H. (1984). A variable span smoother (Technical Report No. 5). Laboratory for Computational Statistics, Department of Statistics, Stanford University.
Gail, Mitchell H. (1986). Adjusting for covariates that have the same distribution in exposed and unexposed cohorts. In S. H. Moolgavkar & R. L. Prentice (Eds.), Modern Statistical Methods in Chronic Disease Epidemiology (pp. 3–18). Wiley.
unadjusted test can have larger type I error than nominal
Gail, M. H., Wieand, S., & Piantadosi, S. (1984). Biased estimates of treatment effect in randomized experiments with nonlinear regressions and omitted covariates. Biometrika, 71, 431–444.
bias if omitted covariables and model is nonlinear
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Paperback; Cambridge University Press.
Giannoni, A., Baruah, R., Leong, T., Rehman, M. B., Pastormerlo, L. E., Harrell, F. E., Coats, A. J., & Francis, D. P. (2014). Do optimal prognostic thresholds in continuous physiological variables really exist? Analysis of origin of apparent thresholds, with systematic review for peak oxygen consumption, ejection fraction and BNP. PLoS ONE, 9(1).
Glass, D. J. (2014). Experimental Design for Biologists (2 edition). Cold Spring Harbor Laboratory Press.
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc, 102, 359–378.
wonderful review article except missing references from Scandanavian and German medical decision making literature
Govers, T. M., Rovers, M. M., Brands, M. T., Dronkers, E. A. C., Jong, R. J. B. de, Merkx, M. A. W., Takes, R. P., & Grutters, J. P. C. (2018). Integrated prediction and decision models are valuable in informing personalized decision making. Journal of Clinical Epidemiology, 0(0).
Greenland, S. (2000). When should epidemiologic regressions use random coefficients? Biometrics, 56, 915–921.
use of statistics in epidemiology is largely primitive;stepwise variable selection on confounders leaves important confounders uncontrolled;composition matrix;example with far too many significant predictors with many regression coefficients absurdly inflated when overfit;lack of evidence for dietary effects mediated through constituents;shrinkage instead of variable selection;larger effect on confidence interval width than on point estimates with variable selection;uncertainty about variance of random effects is just uncertainty about prior opinion;estimation of variance is pointless;instead the analysis should be repeated using different values;"if one feels compelled to estimate $\tau{̂2}$, I would recommend giving it a proper prior concentrated amount contextually reasonable values";claim about ordinary MLE being unbiased is misleading because it assumes the model is correct and is the only model entertained;shrinkage towards compositional model;"models need to be complex to capture uncertainty about the honest uncertainty assessment requires parameters for all effects that we know may be present. This advice is implicit in an antiparsimony principle often attributed to L. J. Savage ’All models should be as big as an elephant (see Draper, 1995)’". See also gus06per.
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epi, 31(4), 337–350.
Best article on misinterpretation of p-values. Pithy summaries.
Hackam, D. G., & Redelmeier, D. A. (2006). Translation of research evidence from animals to humans. JAMA, 296, 1731–1732.
review of basic science literature that documents systemic methodologic shortcomings. In a personal communication on 20Oct06 the authors reported that they found a few more biostatistical problems that could not make it into the JAMA article (for space constraints);none of the articles contained a sample size calculation;none of the articles identified a primary outcome measure;none of the articles mentioned whether they tested assumptions or did distributional testing (though a few used non-parametric tests);most articles had more than 30 endpoints (but few adjusted for multiplicity, as noted in the article)
Harrell, F. E. (2020a). Hmisc: A package of miscellaneous R functions.
Harrell, F. E. (2020b). rms: R functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit.
Harrell, F. E., Margolis, P. A., Gove, S., Mason, K. E., Mulholland, E. K., Lehmann, D., Muhe, L., Gatchalian, S., & Eichenwald, H. F. (1998). Development of a clinical prediction model for an ordinal outcome: The World Health Organization ARI Multicentre Study of clinical signs and etiologic agents of pneumonia, sepsis, and meningitis in young infants. Stat Med, 17, 909–944.;2-O/abstract
Hauck, W. W., Anderson, S., & Marcus, S. M. (1998). Should we adjust for covariates in nonlinear regression analyses of randomized trials? Controlled Clin Trials, 19, 249–256.
"For use in a clinician-patient context, there is only a single person, that patient, of interest. The subject-specific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population-averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates."
Herschtal, A. (2023). The effect of dichotomization of skewed adjustment covariates in the analysis of clinical trials. BMC Medical Research Methodology, 23(1), 60.
Hlatky, M. A., Greenland, P., Arnett, D. K., Ballantyne, C. M., Criqui, M. H., Elkind, M. S., Go, A. S., Harrell, F. E., Hong, Y., Howard, B. V., Howard, V. J., Hsue, P. Y., Kramer, C. M., McConnell, J. P., Normand, S. L., O’Donnell, C. J., Smith, S. C., & Wilson, P. W. (2009). Criteria for evaluation of novel markers of cardiovascular risk: A scientific statement from the American Heart Association. Circ, 119(17), 2408–2416.
graph with different symbols for diseased and non-diseased
Hlatky, M. A., Pryor, D. B., Harrell, F. E., Califf, R. M., Mark, D. B., & Rosati, R. A. (1984). Factors affecting the sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med, 77, 64–71.
Hoeffding, W. (1948). A class of statistics with asymptotically normal distributions. Ann Math Stat, 19, 293–325.
Partially reprinted in: Kotz, S., Johnson, N.L. (1992) Breakthroughs in Statistics, Vol I, pp 308-334. Springer-Verlag. ISBN 0-387-94037-5
Holländer, N., Sauerbrei, W., & Schumacher, M. (2004). Confidence intervals for the effect of a prognostic factor after selection of an “optimal” cutpoint. Stat Med, 23, 1701–1713.
true type I error can be much greater than nominal level;one example where nominal is 0.05 and true is 0.5;minimum P-value method;CART;recursive partitioning;bootstrap method for correcting confidence interval;based on heuristic shrinkage coefficient;"It should be noted, however, that the optimal cutpoint approach has disadvantages. One of these is that in almost every study where this method is applied, another cutpoint will emerge. This makes comparisons across studies extremely difficult or even impossible. Altman et al. point out this problem for studies of the prognostic relevance of the S-phase fraction in breast cancer published in the literature. They identified 19 different cutpoints used in the literature; some of them were solely used because they emerged as the “optimal” cutpoint in a specific data set. In a meta-analysis on the relationship between cathepsin-D content and disease-free survival in node-negative breast cancer patients, 12 studies were in included with 12 different cutpoints ... Interestingly, neither cathepsin-D nor the S-phase fraction are recommended to be used as prognostic markers in breast cancer in the recent update of the American Society of Clinical Oncology."; dichotomization; categorizing continuous variables; refs alt94dan, sch94out, alt98sub
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D. G., & Newman, T. B. (2013). Designing Clinical Research (Fourth edition). LWW.
Investigators, C. (1989). Preliminary report: Effect of Encainide and Flecainide on mortality in a randomized trial of arrhythmia suppression after myocardial infarction. NEJM, 321(6), 406–412.
Ioannidis, J. P. A., & Lau, J. (1997). The impact of high-risk patients on the results of clinical trials. J Clin Epi, 50, 1089–1098.
high risk patients can dominate clinical trials results;high risk patients may be imbalanced even if overall study is balanced;magnesium;differential treatment effect by patient risk;GUSTO;small vs. large trials vs. meta-analysis
Ionnidis, J. P. A. (2010). Expectations, validity, and reality in omics. J Clin Epi, 63, 945–949.
"Each new field has a rapid exponential growth of its literature over 5–8 years (“new field phase”), followed by an “established field” phase when growth rates are more modest, and then an “over-maturity” phase, where the rates of growth are similar to the growth of the scientific literature at large or even smaller. There is a parallel in the spread of an infectious epidemic that emerges rapidly and gets established when a large number of scientists (and articles) are infected with these concepts. Then momentum decreases, although many scientists remain infected and continue to work on this field. New omics infections continuously arise in the scientific community.";"A large number of personal genomic tests are already sold in the market, mostly with direct to consumer advertisement and for “recreational genomics” purposes (translate: information for the fun of information)."
Kent, D. M., & Hayward, R. (2007). Limitations of applying summary results of clinical trials to individual patients. JAMA, 298, 1209–1212.
variation in absolute risk reduction in RCTs;failure of subgroup analysis;covariable adjustment;covariate adjustment;nice summary of individual patient absolute benefit vs. patient risk
Knaus, W. A., Harrell, F. E., Fisher, C. J., Wagner, D. P., Opan, S. M., Sadoff, J. C., Draper, E. A., Walawander, C. A., Conboy, K., & Grasela, T. H. (1993). The clinical evaluation of new drugs for sepsis: A prospective study design based on survival analysis. JAMA, 270, 1233–1241.
Koenker, R., & Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Kornbrot, D. E. (1990). The rank difference test: A new and meaningful alternative to the Wilcoxon signed ranks test for ordinal data. British Journal of Mathematical and Statistical Psychology, 43(2), 241–264.
Kotz, S., & Johnson, N. L. (Eds.). (1988). Encyclopedia of Statistical Sciences (Vol. 9). Wiley.
Krouwer, J. S. (2008). Why Bland-Altman plots should use X, not (Y+X)/2 when X is a reference method. Stat Med, 27(5), 778–780.
Leek, J. T., & Peng, R. D. (2015). What is the question? Science, 347(6228), 1314–1315.
Lenth, R. V. (2001). Some practical guidelines for effective sample size determination. Am Statistician, 55, 187–193.
problems with Cohen’s method
MacKay, R. J., & Oldford, R. W. (2000). Scientific Method, Statistical Method and the Speed of Light. Statist. Sci., 15(3), 254–278.
Matthews, J. N. S., Altman, D. G., Campbell, M. J., & Royston, P. (1990). Analysis of serial measurements in medical research. BMJ, 300, 230–235.
Matthews, J. N. S., & Badi, N. H. (2015). Inconsistent treatment estimates from mis-specified logistic regression analyses of randomized trials. Stat Med, 34(19), 2681–2694.
Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer outcome with microarrays: A multiple random validation strategy. Lancet, 365, 488–492.
comment on p. 454;validation;microarray;bioinformatics;machine learning;nearest centroid;severe problems with data splitting;high variability of list of genes;problems with published studies;nice results for effect of training sample size on misclassification error;nice use of confidence intervals on accuracy estimates;unstable molecular signatures;high instability due to dependence on selection of training sample
Moons, K. G. M., & Harrell, F. E. (2003). Sensitivity and specificity should be de-emphasized in diagnostic accuracy studies. Acad Rad, 10, 670–672.
Moons, K. G. M., van Es, G.-A., Deckers, J. W., Habbema, J. D. F., & Grobbee, D. E. (1997). Limitations of sensitivity, specificity, likelihood ratio, and Bayes’ theorem in assessing diagnostic probabilities: A clinical example. Epi, 8(1), 12–17.
non-constancy of sensitivity, specificity, likelihood ratio in a real example
Moore, T. J. (1995). Deadly Medicine: Why Tens of Thousands of Patients Died in America’s Worst Drug Disaster. Simon & Shuster.
Multicenter Postinfarction Research Group. (1983). Risk stratification and survival after myocardial infarction. NEJM, 309, 331–336.
terrible example of dichotomizing continuous variables;figure ins Papers/modelingPredictors
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nat Hum Behav, 1(1), 0021+.
Murrell, P. (2013). InfoVis and statistical graphics: Comment. J Comp Graph Stat, 22(1), 33–37.
Excellent brief how-to list; incorporated into graphscourse
Naggara, O., Raymond, J., Guilbert, F., Roy, D., Weill, A., & Altman, D. G. (2011). Analysis by categorizing or dichotomizing continuous variables is inadvisable: An example from the natural history of unruptured aneurysms. Am J Neuroradiol, 32(3), 437–440.
Neuhaus, J. M. (1998). Estimation efficiency with omitted covariates in generalized linear models. J Am Stat Assoc, 93, 1124–1129.
"to improve the efficiency of estimated covariate effects of interest, analysts of randomized clinical trial data should adjust for covariates that are strongly associated with the outcome, and ... analysts of observational data should not adjust for covariates that do not confound the association of interest"
Newman, T. B., & Kohn, M. A. (2009). Evidence-Based Diagnosis. Cambridge University Press.
Nuzzo, R. (2015). How scientists fool themselves — and how they can stop. Nature, 526(7572), 182–185.
O’Brien, P. C. (1988). Comparing two samples: Extensions of the t, rank-sum, and log-rank test. J Am Stat Assoc, 83, 52–61.
see Hauck WW, Hyslop T, Anderson S (2000) Stat in Med 19:887-899
Ohman, E. M., Armstrong, P. W., Christenson, R. H., Granger, C. B., Katus, H. A., Hamm, C. W., O’Hannesian, M. A., Wagner, G. S., Kleiman, N. S., Harrell, F. E., Califf, R. M., Topol, E. J., Lee, K. L., & Investigators, T. G. (1996). Cardiac troponin T levels for risk stratification in acute myocardial ischemia. NEJM, 335, 1333–1341.
Paré, G., Mehta, S. R., Yusuf, S., & Others. (2010). Effects of CYP2C19 genotype on outcomes of clopidogrel treatment. NEJM, online.
Paul, D., Bair, E., Hastie, T., & Tibshirani, R. (2008). Preconditioning for feature selection and regression in high-dimensional problems. Ann Stat, 36(4), 1595–1619.
develop consistent Y using a latent variable structure, using for example supervised principal components. Then run stepwise regression or lasso predicting Y (lasso worked better). Can run into problems when a predictor has importance in an adjusted sense but has no marginal correlation with Y;model approximation;model simplification
Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B., & Vasan, R. S. (2008). Evaluating the added predictive ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27, 157–172.
small differences in ROC area can still be very meaningful;example of insignificant test for difference in ROC areas with very significant results from new method;Yates’ discrimination slope;reclassification table;limiting version of this based on whether and amount by which probabilities rise for events and lower for non-events when compare new model to old;comparing two models;see letter to the editor by Van Calster and Van Huffel, Stat in Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97, 2012
Platt, J. R. (1964). Strong inference. Science, 146(3642), 347–353.
Pryor, D. B., Harrell, F. E., Lee, K. L., Califf, R. M., & Rosati, R. A. (1983). Estimating the likelihood of significant coronary artery disease. Am J Med, 75, 771–780.
R Development Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing;
Raab, G. M., Day, S., & Sales, J. (2004). How to select covariates to include in the analysis of a clinical trial. Controlled Clin Trials, 21, 330–342.
how correlated with outcome must a variable before adding it helps more than hurts, as a function of sample size;planning;design;variable selection
Robinson, L. D., & Jewell, N. P. (1991). Some surprising results about covariate adjustment in logistic regression models. Int Stat Rev, 59, 227–240.
Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Stat Med, 25, 127–141.
destruction of statistical inference when cutpoints are chosen using the response variable; varying effect estimates when change cutpoints;difficult to interpret effects when dichotomize;nice plot showing effect of categorization; PBC data
Rubin, D. B. (2007). The design versus the analysis of observational studies for causal effects: Parallels with the design of randomized studies. Stat Med, 26, 20–36.
Ruxton, Graeme D., & Colegrave, Nick. (2017). Experimental Design for the Life Sciences (Fourth Edition). Oxford University Press.
Sargent, D. J., & Hodges, J. S. (1996). A hierarchical model method for subgroup analysis of time-to-event data in the Cox regression setting.
Schoenfeld, D. A. (1983). Sample size formulae for the proportional hazards regression model. Biometrics, 39, 499–503.
Schwemer, G. (2000). General linear models for multicenter clinical trials. Controlled Clin Trials, 21, 21–29.
Senn, S. (2004). Controversies concerning randomization and additivity in clinical trials. Stat Med, 23, 3729–3753.
p. 3735: "in the pharmaceutical industry, in analyzing the data, if a linear model is employed, it is usual to fit centre as a factor but unusual to fit block.";p. 3739: a large trial "is not less vulnerable to chance covariate imbalance";p. 3741:"There is no place, in my view, for classical minimization" (vs. the method of Atkinson);"If an investigator uses such [allocation based on covariates] schemes, she or he is honour bound, in my opinion, as a very minimum, to adjust for the factors used to balance, since the fact that they are being used to balance is an implicit declaration that they will have prognostic value.";"The point of view is sometimes defended that analyses that ignore covariates are superior because they are simpler. I do not accept this. A value of $\pi=3$ is a simple one and accurate to one significant figure ... However very few would seriously maintain that if should generally be adopted by engineers.";p. 3742: "as Fisher pointed out ... if we balance by a predictive covariate but do not fit the covariate in the model, not only do we not exploit the covariate, we actually increase the expected declared standard error."; p. 3744:"I would like to see standard errors for group means abolished."; p. 3744:"A common habit, however, in analyzing trials with three or more arms is to pool the variances from all arms when calculating the standard error of a given contrast. In my view this is a curious practice ... it relies on an assumption of additivity of <i>all</all> treatments when comparing only <i>two</i>. ... a classical t-test is robust to heteroscedasticity provide that sample sizes are equal in the groups being compared and that the variance is internal to those two groups but is not robust where an external estimate is being used."; p. 3745: "By adjusting main effects for interactions a type III analysis is similarly illogical to Neyman’s hypothesis test."; "Guyatt <i>et al.</i> ... found a ’method for estimating the proportion of patients who benefit from a treatment ... In fact they had done no such thing."; p. 3746: "When I checked the Web of Science on 29 June 2003, the paper by Horwitz <i>et al.</i> had been cited 28 times and that by Guyatt <i>et al.</i> had been cited 79 times. The letters pointing out the fallacies had been cited only 8 and 5 times respectively."; "if we pool heterogeneous strata, the odds ratio of the treatment effect will be different from that in every stratum, even if from stratum to stratum it does not vary."; p. 3747: "Part of the problem with Poisson, proportional hazard and logistic regression approaches is that they use a single parameter, the linear predictor, with no equivalent of the variance parameter in the Normal case. This means that lack of fit impacts on the estimate of the predictor. ... what is the value of randomization if, in all except the Normal case, we cannot guarantee to have unbiased estimates. My view ... was that the form of analysis envisaged (that is to say, which factors and covariates should be fitted) justified the allocation and <i>not vice versa</i>."; "use the additive measure at the point of analysis and transform to the relevant scale at the point of implementation. This transformation at the point of medical decision-making will require auxiliary information on the level of background risk of the patient."; p. 3748:"The decision to fit prognostic factors has a far more dramatic effect on the precision of our inferences than the choice of an allocation based on covariates or randomization approach and one of my chief objections to the allocation based on covariates approach is that trialists have tended to use the fact that they have balanced as an excuse for not fitting. This is a grave mistake."
Senn, S. (2008). Statistical Issues in Drug Development (Second). Wiley.
Senn, S. J. (2005). Dichotomania: An obsessive compulsive disorder that is badly affecting the quality of analysis of pharmaceutical trials. Proceedings of the International Statistical Institute, 55th Session.
Senn, S., Anisimov, V. V., & Fedorov, V. V. (2010). Comparisons of minimization and Atkinson’s algorithm. Stat Med, 29, 721–730.
"fitting covariates may make a more valuable and instructive contribution to inferences about treatment effects than only balancing them"
Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated measures in clinical trials: Simple strategies for analysis using summary measures. Stat Med, 19, 861–877.<861::AID-SIM407>3.0.CO;2-F
Shen, X., Huang, H.-C., & Ye, J. (2004). Inference after model selection. J Am Stat Assoc, 99, 751–762.
uses optimal approximation for estimating mean and variance of complex statistics adjusting for model selection
Sigueira, A. L., & Taylor, J. M. G. (1999). Treatment effects in a logistic model involving the Box-Cox transformation. J Am Stat Assoc, 94, 240–246.
Box-Cox transformation of a covariable;validity of inference for treatment effect when treat exponent for covariable as fixed
Spanos, A., Harrell, F. E., & Durack, D. T. (1989). Differential diagnosis of acute meningitis: An analysis of the predictive value of initial observations. JAMA, 262, 2700–2707.
Steyerberg, E. W. (2018). Validation in prediction research: The waste by data-splitting. Journal of Clinical Epidemiology, 0(0).
Steyerberg, E. W., Bossuyt, P. M. M., & Lee, K. L. (2000). Clinical trials in acute myocardial infarction: Should we adjust for baseline characteristics? Am Heart J, 139, 745–751.
Subramanian, J., & Simon, R. (2010). Gene expression-based prognostic signatures in lung cancer: Ready for clinical use? J Nat Cancer Inst, 102, 464–474.
none demonstrated to have clinical utility;bioinformatics;quality scoring of papers
Teresi, J. A., Yu, X., Stewart, A. L., & Hays, R. D. (2022). Guidelines for Designing and Evaluating Feasibility Pilot Studies. Medical Care, 60(1), 95.
Tukey, J. W. (1993). Tightening the Clinical Trial. Controlled Clin Trials, 14, 266–285.
showed that asking clinicians to make up regression coefficients out of thin air is better than not adjusting for covariables
van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern modelling techniques are data hungry: A simulation study for predicting dichotomous endpoints. BMC Medical Research Methodology, 14(1), 137+.
Would be better to use proper accuracy scores in the assessment. Too much emphasis on optimism as opposed to final discrimination measure. But much good practical information. Recursive partitioning fared poorly.
van Klaveren, D., Vergouwe, Y., Farooq, V., Serruys, P. W., & Steyerberg, E. W. (2015). Estimates of absolute treatment benefit for individual patients required careful modeling of statistical interactions. J Clin Epi, 68(11), 1366–1374.
Vickers, A. J. (2008). Decision analysis for the evaluation of diagnostic tests, prediction models, and molecular markers. Am Statistician, 62(4), 314–320.
limitations of accuracy metrics;incorporating clinical consequences;nice example of calculation of expected outcome;drawbacks of conventional decision analysis, especially because of the difficulty of eliciting the expected harm of a missed diagnosis;use of a threshold on the probability of disease for taking some action;decision curve;has other good references to decision analysis
Vickers, A. J., Basch, E., & Kattan, M. W. (2008). Against diagnosis. Ann Int Med, 149, 200–203.
"The act of diagnosis requires that patients be placed in a binary category of either having or not having a certain disease. Accordingly, the diseases of particular concern for industrialized countries—such as type 2 diabetes, obesity, or depression—require that a somewhat arbitrary cut-point be chosen on a continuous scale of measurement (for example, a fasting glucose level >6.9 mmol/L [>125 mg/dL] for type 2 diabetes). These cut-points do not ade- quately reflect disease biology, may inappropriately treat patients on either side of the cut-point as 2 homogenous risk groups, fail to incorporate other risk factors, and are invariable to patient preference."
Wainer, H. (2006). Finding what is not there through the unfortunate binning of results: The Mendel effect. Chance, 19(1), 49–56.
can find bins that yield either positive or negative association;especially pertinent when effects are small;"With four parameters, I can fit an elephant; with five, I can make it wiggle its trunk." - John von Neumann
White, I. R., Morris, T. P., & Williamson, E. (2021). Covariate adjustment in randomised trials: Canonical link functions protect against model mis-specification.
Comment: 10 pages, 1 figure
White, I. R., & Thompson, S. G. (2005). Adjusting for partially missing baseline measurements in randomized trials. Stat Med, 24, 993–1007.
Whitehead, J. (1993). Sample size calculations for ordered categorical data. Stat Med, 12, 2257–2271.
Wilcox, R., Carlson, M., Azen, S., & Clark, F. (2013). Avoid lost discoveries, because of violations of standard assumptions, by using modern robust statistical methods. Journal of Clinical Epidemiology, 66(3), 319–329.
Xie, Y. (2015). Dynamic Documents with R and knitr, second edition (second). Chapman and Hall.
Yamaguchi, T., & Ohashi, Y. (1999). Investigating centre effects in a multi-centre clinical trial of superficial bladder cancer. Stat Med, 18, 1961–1971.