# Annotated References

Adcock, C. J. (1997). Sample size determination: A review.

*The Statistician*,*46*, 261–283.
Akazawa, K., Nakamura, T., & Palesch, Y. (1997). Power of logrank
test and Cox regression model in clinical trials with
heterogeneous samples.

*Stat Med*,*16*, 583–597.
Aliferis, C. F., Statnikov, A., Tsamardinos, I., Schildcrout, J. S.,
Shepherd, B. E., & Harrell, F. E. (2009). Factors influencing the
statistical power of complex data analysis protocols for molecular
signature development from microarray data.

*PLoS ONE*,*4*(3).refutation of mic05pre

Altman, D. G., & Bland, J. M. (1995). Absence of evidence is not
evidence of absence.

*BMJ*,*311*, 485.
Ambler, G., Brady, A. R., & Royston, P. (2002). Simplifying a
prognostic model: A simulation study based on clinical data.

*Stat Med*,*21*(24), 3803–3822. https://doi.org/10.1002/sim.1422ordinary backward stepdown worked well when there was
a large fraction of truly irrelevant predictors

Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene
extraction on the basis of microarray gene-expression data.

*PNASs*,*99*(10), 6562–6566. https://doi.org/10.1073/pnas.102102699Relied on an improper accuracy score (proportion
classified correct) so had to use the .632 bootstrap unnecessarily

Andersen, P. K., Klein, J. P., & Zhang, M.-J. (1999). Testing for
centre effects in multi-centre survival studies: A monte
carlo comparison of fixed and random effects tests.

*Stat Med*,*18*, 1489–1500.
Best, W. R., Becktel, J. M., Singleton, J. W., & Kern, F. (1976).
Development of a Crohn’s disease activity index.

*Gastroent*,*70*, 439–444.development of CDAI

Bland, J. M., & Altman, D. G. (2011). Comparisons against baseline
within randomised groups are often used and can be highly misleading.

*Trials*,*12*(1), 264. https://doi.org/10.1186/1745-6215-12-264
Bordley, R. (2007). Statistical decisionmaking without math.

*Chance*,*20*(3), 39–44.
Brazer, S. R., Pancotto, F. S., Long III, T. T., Harrell, F. E., Lee, K.
L., Tyor, M. P., & Pryor, D. B. (1991). Using ordinal logistic
regression to estimate the likelihood of colorectal neoplasia.

*J Clin Epi*,*44*, 1263–1270.
Briggs, W. M., & Zaretzki, R. (2008). The skill plot: A
graphical technique for evaluating continuous diagnostic tests (with
discussion).

*Biometrics*,*64*, 250–261."statistics such as the AUC are not especially
relevant to someone who must make a decision about a particular x_c. ...
ROC curves lack or obscure several quantities that are necessary for
evaluating the operational effectiveness of diagnostic tests. ... ROC
curves were first used to check how radio <i>receivers</i> (like radar receivers) operated
over a range of frequencies. ... This is not how most ROC curves are
used now, particularly in medicine. The receiver of a diagnostic
measurement ... wants to make a decision based on some x_c, and is not
especially interested in how well he would have done had he used some
different cutoff."; in the discussion David Hand states "when
integrating to yield the overall AUC measure, it is necessary to decide
what weight to give each value in the integration. The AUC implicitly
does this using a weighting derived empirically from the data. This is
nonsensical. The relative importance of misclassifying a case as a
noncase, compared to the reverse, cannot come from the data itself. It
must come externally, from considerations of the severity one attaches
to the different kinds of misclassifications."; see Lin, Kvam, Lu Stat
in Med 28:798-813;2009

Califf, R. M., Harrell, F. E., Lee, K. L., Rankin, J. S., & Others.
(1989). The evolution of medical and surgical therapy for coronary
artery disease.

*JAMA*,*261*, 2077–2086.
Califf, R. M., McKinnis, R. A., Burks, J., Lee, K. L., Harrell FE, V.
S., Pryor, D. B., Wagner, G. S., & Rosati, R. A. (1982). Prognostic
implications of ventricular arrhythmias during 24 hour ambulatory
monitoring in patients undergoing cardiac catheterization for coronary
artery disease.

*Am J Card*,*50*, 23–31.
Chang, M. (2016).

*Principles of Scientific Methods*. Chapman and Hall/CRC. https://doi.org/10.1201/b17167
Chen, Q., Nian, H., Zhu, Y., Talbot, H. K., Griffin, M. R., &
Harrell, F. E. (2016). Too many covariates and too few cases? - a
comparative study.

*Stat Med*,*35*(25), 4546–4558. https://doi.org/10.1002/sim.7021
Choi, L., Blume, J. D., & Dupont, W. D. (2015). Elucidating the
Foundations of Statistical Inference with 2 x
2 Tables.

*PLoS ONE*,*10*(4), e0121263+. https://doi.org/10.1371/journal.pone.0121263
Chotai, S., Devin, C. J., Archer, K. R., Bydon, M., McGirt, M. J., Nian,
H., Harrell, F. E., Dittus, R. S., Asher, A. L., & QOD Vanguard
Sites. (2017). Effect of patients’ functional status on satisfaction
with outcomes 12 months after elective spine surgery for lumbar
degenerative disease.

*Spine J*,*17*(12), 1783–1793. https://doi.org/10.1016/j.spinee.2017.05.027
Cleveland, W. S. (1984). Graphs in scientific publications.

*Am Statistician*,*38*, 261–269.
Cleveland, W. S. (1994).

*The Elements of Graphing Data*. Hobart Press.
Committee for Proprietary Medicinal Products. (2004). Points to consider
on adjustment for baseline covariates.

*Stat Med*,*23*, 701–709.
Cook, R. J., & Farewell, V. T. (1996). Multiplicity considerations
in the design and analysis of clinical trials.

*J Roy Stat Soc A*,*159*, 93–110.argues that if
results are intended to be interpreted marginally, there may be no need
for controlling experimentwise error rate. FH phrasing: Cook and
Farewell point out that when a strong priority order is pre-specified
for separate clinical questions, and that same order is also the
reporting order (no cherry picking), there is no need for multiplicity
adjustment. This is in contrast with a study whose aim is to find an
endpoint or find a patient subgroup that is benefited by treatment, a
situation requiring conservative multiplicity adjustment.

Davis, C. S. (2002).

*Statistical Methods for the Analysis of Repeated Measurements*. Springer.
Diggle, P. J., Heagerty, P., Liang, K.-Y., & Zeger, S. L. (2002).

*Analysis of Longitudinal Data*(second). Oxford University Press.
Dupont, W. D. (2008).

*Statistical Modeling for Biomedical Researchers*(second). Cambridge University Press.
Edwards, D. (1999). On model pre-specification in confirmatory
randomized studies.

*Stat Med*,*18*, 771–785.
Efron, B., & Morris, C. (1977). Stein’s paradox in statistics.

*Sci Am*,*236*(5), 119–127.
Fedorov, V., Mannino, F., & Zhang, R. (2009). Consequences of
dichotomization.

*Pharm Stat*,*8*, 50–61. https://doi.org/10.1002/pst.331optimal cutpoint depends on unknown parameters;should
only entertain dichotomization when "estimating a value of the
cumulative distribution and when the assumed model is very different
from the true model";nice graphics

Ford, I., Norrie, J., & Ahmadi, S. (1995). Model inconsistency,
illustrated by the Cox proportional hazards model.

*Stat Med*,*14*, 735–746.
Friedman, J. H. (1984).

*A variable span smoother*(Technical Report No. 5). Laboratory for Computational Statistics, Department of Statistics, Stanford University.
Gail, Mitchell H. (1986). Adjusting for covariates that have the same
distribution in exposed and unexposed cohorts. In S. H. Moolgavkar &
R. L. Prentice (Eds.),

*Modern Statistical Methods in Chronic Disease Epidemiology*(pp. 3–18). Wiley.unadjusted test can have
larger type I error than nominal

Gail, M. H., Wieand, S., & Piantadosi, S. (1984). Biased estimates
of treatment effect in randomized experiments with nonlinear regressions
and omitted covariates.

*Biometrika*,*71*, 431–444.bias if omitted covariables and model is
nonlinear

Gelman, A., & Hill, J. (2006).

*Data Analysis Using Regression and Multilevel/Hierarchical Models*(1st ed.). Paperback; Cambridge University Press. http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/052168689X
Giannoni, A., Baruah, R., Leong, T., Rehman, M. B., Pastormerlo, L. E.,
Harrell, F. E., Coats, A. J., & Francis, D. P. (2014). Do optimal
prognostic thresholds in continuous physiological variables really
exist? Analysis of origin of apparent thresholds, with
systematic review for peak oxygen consumption, ejection fraction and
BNP.

*PLoS ONE*,*9*(1). https://doi.org/10.1371/journal.pone.0081699
Glass, D. J. (2014).

*Experimental Design for Biologists*(2 edition). Cold Spring Harbor Laboratory Press.
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring
rules, prediction, and estimation.

*J Am Stat Assoc*,*102*, 359–378.wonderful review article
except missing references from Scandanavian and German medical decision
making literature

Govers, T. M., Rovers, M. M., Brands, M. T., Dronkers, E. A. C., Jong,
R. J. B. de, Merkx, M. A. W., Takes, R. P., & Grutters, J. P. C.
(2018). Integrated prediction and decision models are valuable in
informing personalized decision making.

*Journal of Clinical Epidemiology*,*0*(0). https://doi.org/10.1016/j.jclinepi.2018.08.016
Greenland, S. (2000). When should epidemiologic regressions use random
coefficients?

*Biometrics*,*56*, 915–921. https://doi.org/10.1111/j.0006-341X.2000.00915.xuse of statistics in epidemiology is largely
primitive;stepwise variable selection on confounders leaves important
confounders uncontrolled;composition matrix;example with far too many
significant predictors with many regression coefficients absurdly
inflated when overfit;lack of evidence for dietary effects mediated
through constituents;shrinkage instead of variable selection;larger
effect on confidence interval width than on point estimates with
variable selection;uncertainty about variance of random effects is just
uncertainty about prior opinion;estimation of variance is
pointless;instead the analysis should be repeated using different
values;"if one feels compelled to estimate $\tau{̂2}$, I would recommend
giving it a proper prior concentrated amount contextually reasonable
values";claim about ordinary MLE being unbiased is misleading because it
assumes the model is correct and is the only model entertained;shrinkage
towards compositional model;"models need to be complex to capture
uncertainty about the relations...an honest uncertainty assessment
requires parameters for all effects that we know may be present. This
advice is implicit in an antiparsimony principle often attributed to L.
J. Savage ’All models should be as big as an elephant (see Draper,
1995)’". See also gus06per.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C.,
Goodman, S. N., & Altman, D. G. (2016). Statistical tests,
P values, confidence intervals, and power: A guide to
misinterpretations.

*Eur J Epi*,*31*(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3Best article on misinterpretation of p-values. Pithy
summaries.

Hackam, D. G., & Redelmeier, D. A. (2006). Translation of research
evidence from animals to humans.

*JAMA*,*296*, 1731–1732.review of basic science literature that
documents systemic methodologic shortcomings. In a personal
communication on 20Oct06 the authors reported that they found a few more
biostatistical problems that could not make it into the JAMA article
(for space constraints);none of the articles contained a sample size
calculation;none of the articles identified a primary outcome
measure;none of the articles mentioned whether they tested assumptions
or did distributional testing (though a few used non-parametric
tests);most articles had more than 30 endpoints (but few adjusted for
multiplicity, as noted in the article)

Harrell, F. E. (2020a).

*Hmisc: A package of miscellaneous R functions*. https://hbiostat.org/R/Hmisc
Harrell, F. E. (2020b).

*rms: R functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit*. https://hbiostat.org/R/rms
Harrell, F. E., Margolis, P. A., Gove, S., Mason, K. E., Mulholland, E.
K., Lehmann, D., Muhe, L., Gatchalian, S., & Eichenwald, H. F.
(1998). Development of a clinical prediction model for an ordinal
outcome: The World Health Organization ARI Multicentre
Study of clinical signs and etiologic agents of pneumonia,
sepsis, and meningitis in young infants.

*Stat Med*,*17*, 909–944. http://onlinelibrary.wiley.com/doi/10.1002/(SICI)1097-0258(19980430)17:8%3C909::AID-SIM753%3E3.0.CO;2-O/abstract
Hauck, W. W., Anderson, S., & Marcus, S. M. (1998). Should we adjust
for covariates in nonlinear regression analyses of randomized trials?

*Controlled Clin Trials*,*19*, 249–256. https://doi.org/10.1016/S0197-2456(97)00147-5"For use in a clinician-patient context, there is only
a single person, that patient, of interest. The subject-specific measure
then best reflects the risks or benefits for that patient. Gail has
noted this previously [ENAR Presidential Invited Address, April 1990],
arguing that one goal of a clinical trial ought to be to predict the
direction and size of a treatment benefit for a patient with specific
covariate values. In contrast, population-averaged estimates of
treatment effect compare outcomes in groups of patients. The groups
being compared are determined by whatever covariates are included in the
model. The treatment effect is then a comparison of average outcomes,
where the averaging is over all omitted covariates."

Hlatky, M. A., Greenland, P., Arnett, D. K., Ballantyne, C. M., Criqui,
M. H., Elkind, M. S., Go, A. S., Harrell, F. E., Hong, Y., Howard, B.
V., Howard, V. J., Hsue, P. Y., Kramer, C. M., McConnell, J. P.,
Normand, S. L., O’Donnell, C. J., Smith, S. C., & Wilson, P. W.
(2009). Criteria for evaluation of novel markers of cardiovascular risk:
A scientific statement from the American Heart Association.

*Circ*,*119*(17), 2408–2416.graph
with different symbols for diseased and non-diseased

Hlatky, M. A., Pryor, D. B., Harrell, F. E., Califf, R. M., Mark, D. B.,
& Rosati, R. A. (1984). Factors affecting the sensitivity and
specificity of exercise electrocardiography. Multivariable
analysis.

*Am J Med*,*77*, 64–71. http://www.sciencedirect.com/science/article/pii/0002934384904376#
Hoeffding, W. (1948). A class of statistics with asymptotically normal
distributions.

*Ann Math Stat*,*19*, 293–325.Partially reprinted in: Kotz, S., Johnson, N.L. (1992)
Breakthroughs in Statistics, Vol I, pp 308-334. Springer-Verlag. ISBN
0-387-94037-5

Holländer, N., Sauerbrei, W., & Schumacher, M. (2004). Confidence
intervals for the effect of a prognostic factor after selection of an
“optimal” cutpoint.

*Stat Med*,*23*, 1701–1713. https://doi.org/10.1002/sim.1611true type I error can be much greater than nominal
level;one example where nominal is 0.05 and true is 0.5;minimum P-value
method;CART;recursive partitioning;bootstrap method for correcting
confidence interval;based on heuristic shrinkage coefficient;"It should
be noted, however, that the optimal cutpoint approach has disadvantages.
One of these is that in almost every study where this method is applied,
another cutpoint will emerge. This makes comparisons across studies
extremely difficult or even impossible. Altman et al. point out this
problem for studies of the prognostic relevance of the S-phase fraction
in breast cancer published in the literature. They identified 19
different cutpoints used in the literature; some of them were solely
used because they emerged as the “optimal” cutpoint in a
specific data set. In a meta-analysis on the relationship between
cathepsin-D content and disease-free survival in node-negative breast
cancer patients, 12 studies were in included with 12 different cutpoints
... Interestingly, neither cathepsin-D nor the S-phase fraction are
recommended to be used as prognostic markers in breast cancer in the
recent update of the American Society of Clinical Oncology.";
dichotomization; categorizing continuous variables; refs alt94dan,
sch94out, alt98sub

Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D. G., &
Newman, T. B. (2013).

*Designing Clinical Research*(Fourth edition). LWW.
Investigators, C. (1989). Preliminary report: Effect of
Encainide and Flecainide on mortality in a
randomized trial of arrhythmia suppression after myocardial infarction.

*NEJM*,*321*(6), 406–412.
Ioannidis, J. P. A., & Lau, J. (1997). The impact of high-risk
patients on the results of clinical trials.

*J Clin Epi*,*50*, 1089–1098. https://doi.org/10.1016/S0895-4356(97)00149-2high risk patients can dominate clinical trials
results;high risk patients may be imbalanced even if overall study is
balanced;magnesium;differential treatment effect by patient
risk;GUSTO;small vs. large trials vs. meta-analysis

Ionnidis, J. P. A. (2010). Expectations, validity, and reality in omics.

*J Clin Epi*,*63*, 945–949."Each
new field has a rapid exponential growth of its literature over 5–8
years (“new field phase”), followed by an
“established field” phase when growth rates are more
modest, and then an “over-maturity” phase, where the rates
of growth are similar to the growth of the scientific literature at
large or even smaller. There is a parallel in the spread of an
infectious epidemic that emerges rapidly and gets established when a
large number of scientists (and articles) are infected with these
concepts. Then momentum decreases, although many scientists remain
infected and continue to work on this field. New omics infections
continuously arise in the scientific community.";"A large number of
personal genomic tests are already sold in the market, mostly with
direct to consumer advertisement and for “recreational
genomics” purposes (translate: information for the fun of
information)."

Kent, D. M., & Hayward, R. (2007). Limitations of applying summary
results of clinical trials to individual patients.

*JAMA*,*298*, 1209–1212. https://doi.org/10.1001/jama.298.10.1209variation in absolute risk reduction in RCTs;failure
of subgroup analysis;covariable adjustment;covariate adjustment;nice
summary of individual patient absolute benefit vs. patient risk

Knaus, W. A., Harrell, F. E., Fisher, C. J., Wagner, D. P., Opan, S. M.,
Sadoff, J. C., Draper, E. A., Walawander, C. A., Conboy, K., &
Grasela, T. H. (1993). The clinical evaluation of new drugs for sepsis:
A prospective study design based on survival analysis.

*JAMA*,*270*, 1233–1241. https://doi.org/10.1001/jama.270.10.1233
Koenker, R., & Bassett, G. (1978). Regression quantiles.

*Econometrica*,*46*, 33–50.
Kornbrot, D. E. (1990). The rank difference test: A new and
meaningful alternative to the Wilcoxon signed ranks test
for ordinal data.

*British Journal of Mathematical and Statistical Psychology*,*43*(2), 241–264. https://doi.org/10.1111/j.2044-8317.1990.tb00939.x
Kotz, S., & Johnson, N. L. (Eds.). (1988).

*Encyclopedia of Statistical Sciences*(Vol. 9). Wiley.
Krouwer, J. S. (2008). Why Bland-Altman plots should use
X, not (Y+X)/2 when
X is a reference method.

*Stat Med*,*27*(5), 778–780. https://doi.org/10.1002/sim.3086
Leek, J. T., & Peng, R. D. (2015). What is the question?

*Science*,*347*(6228), 1314–1315. https://doi.org/10.1126/science.aaa6146
Lenth, R. V. (2001). Some practical guidelines for effective sample size
determination.

*Am Statistician*,*55*, 187–193. https://doi.org/10.1198/000313001317098149problems with Cohen’s method

MacKay, R. J., & Oldford, R. W. (2000). Scientific
Method, Statistical Method and the
Speed of Light.

*Statist. Sci.*,*15*(3), 254–278. https://doi.org/10.1214/ss/1009212817
Matthews, J. N. S., Altman, D. G., Campbell, M. J., & Royston, P.
(1990). Analysis of serial measurements in medical research.

*BMJ*,*300*, 230–235. https://doi.org/10.1136/bmj.300.6719.230
Matthews, J. N. S., & Badi, N. H. (2015). Inconsistent treatment
estimates from mis-specified logistic regression analyses of randomized
trials.

*Stat Med*,*34*(19), 2681–2694. https://doi.org/10.1002/sim.6508
Michiels, S., Koscielny, S., & Hill, C. (2005). Prediction of cancer
outcome with microarrays: A multiple random validation strategy.

*Lancet*,*365*, 488–492.comment on
p. 454;validation;microarray;bioinformatics;machine learning;nearest
centroid;severe problems with data splitting;high variability of list of
genes;problems with published studies;nice results for effect of
training sample size on misclassification error;nice use of confidence
intervals on accuracy estimates;unstable molecular signatures;high
instability due to dependence on selection of training sample

Moons, K. G. M., & Harrell, F. E. (2003). Sensitivity and
specificity should be de-emphasized in diagnostic accuracy studies.

*Acad Rad*,*10*, 670–672.
Moons, K. G. M., van Es, G.-A., Deckers, J. W., Habbema, J. D. F., &
Grobbee, D. E. (1997). Limitations of sensitivity, specificity,
likelihood ratio, and Bayes’ theorem in assessing
diagnostic probabilities: A clinical example.

*Epi*,*8*(1), 12–17.non-constancy of
sensitivity, specificity, likelihood ratio in a real example

Moore, T. J. (1995).

*Deadly Medicine: Why Tens of Thousands of Patients Died in America’s Worst Drug Disaster*. Simon & Shuster.
Multicenter Postinfarction Research Group. (1983). Risk stratification
and survival after myocardial infarction.

*NEJM*,*309*, 331–336.terrible example of dichotomizing
continuous variables;figure ins Papers/modelingPredictors

Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers,
C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J.
J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible
science.

*Nat Hum Behav*,*1*(1), 0021+. https://doi.org/10.1038/s41562-016-0021
Murrell, P. (2013). InfoVis and statistical graphics:
Comment.

*J Comp Graph Stat*,*22*(1), 33–37. https://doi.org/10.1080/10618600.2012.751875Excellent brief how-to list; incorporated into
graphscourse

Naggara, O., Raymond, J., Guilbert, F., Roy, D., Weill, A., &
Altman, D. G. (2011). Analysis by categorizing or dichotomizing
continuous variables is inadvisable: An example from the
natural history of unruptured aneurysms.

*Am J Neuroradiol*,*32*(3), 437–440. https://doi.org/10.3174/ajnr.A2425
Neuhaus, J. M. (1998). Estimation efficiency with omitted covariates in
generalized linear models.

*J Am Stat Assoc*,*93*, 1124–1129."to improve the efficiency of estimated
covariate effects of interest, analysts of randomized clinical trial
data should adjust for covariates that are strongly associated with the
outcome, and ... analysts of observational data should not adjust for
covariates that do not confound the association of interest"

Newman, T. B., & Kohn, M. A. (2009).

*Evidence-Based Diagnosis*. Cambridge University Press.
Nuzzo, R. (2015). How scientists fool themselves — and how they can
stop.

*Nature*,*526*(7572), 182–185.
O’Brien, P. C. (1988). Comparing two samples: Extensions of
the t, rank-sum, and log-rank test.

*J Am Stat Assoc*,*83*, 52–61. https://doi.org/10.1080/01621459.1988.10478564see Hauck WW, Hyslop T, Anderson S (2000) Stat in Med
19:887-899

Ohman, E. M., Armstrong, P. W., Christenson, R. H., Granger, C. B.,
Katus, H. A., Hamm, C. W., O’Hannesian, M. A., Wagner, G. S., Kleiman,
N. S., Harrell, F. E., Califf, R. M., Topol, E. J., Lee, K. L., &
Investigators, T. G. (1996). Cardiac troponin T levels for
risk stratification in acute myocardial ischemia.

*NEJM*,*335*, 1333–1341.
Paré, G., Mehta, S. R., Yusuf, S., & Others. (2010). Effects of
CYP2C19 genotype on outcomes of clopidogrel treatment.

*NEJM*,*online*.
Paul, D., Bair, E., Hastie, T., & Tibshirani, R. (2008).
“Preconditioning” for feature selection and
regression in high-dimensional problems.

*Ann Stat*,*36*(4), 1595–1619. https://doi.org/10.1214/009053607000000578develop consistent Y using a latent variable
structure, using for example supervised principal components. Then run
stepwise regression or lasso predicting Y (lasso worked better). Can run
into problems when a predictor has importance in an adjusted sense but
has no marginal correlation with Y;model approximation;model
simplification

Pencina, M. J., D’Agostino Sr, R. B., D’Agostino Jr, R. B., & Vasan,
R. S. (2008). Evaluating the added predictive ability of a new marker:
From area under the ROC curve to
reclassification and beyond.

*Stat Med*,*27*, 157–172.small differences in ROC area can still
be very meaningful;example of insignificant test for difference in ROC
areas with very significant results from new method;Yates’
discrimination slope;reclassification table;limiting version of this
based on whether and amount by which probabilities rise for events and
lower for non-events when compare new model to old;comparing two
models;see letter to the editor by Van Calster and Van Huffel, Stat in
Med 29:318-319, 2010 and by Cook and Paynter, Stat in Med 31:93-97,
2012

Platt, J. R. (1964). Strong inference.

*Science*,*146*(3642), 347–353.
Pryor, D. B., Harrell, F. E., Lee, K. L., Califf, R. M., & Rosati,
R. A. (1983). Estimating the likelihood of significant coronary artery
disease.

*Am J Med*,*75*, 771–780.
R Development Team. (2020).

*R: A language and environment for statistical computing*. R Foundation for Statistical Computing; www.r-project.org. http://www.R-project.org
Raab, G. M., Day, S., & Sales, J. (2004). How to select covariates
to include in the analysis of a clinical trial.

*Controlled Clin Trials*,*21*, 330–342.how correlated
with outcome must a variable before adding it helps more than hurts, as
a function of sample size;planning;design;variable selection

Robinson, L. D., & Jewell, N. P. (1991). Some surprising results
about covariate adjustment in logistic regression models.

*Int Stat Rev*,*59*, 227–240.
Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing
continuous predictors in multiple regression: A bad idea.

*Stat Med*,*25*, 127–141. https://doi.org/10.1002/sim.2331destruction of statistical inference when cutpoints
are chosen using the response variable; varying effect estimates when
change cutpoints;difficult to interpret effects when dichotomize;nice
plot showing effect of categorization; PBC data

Rubin, D. B. (2007). The design versus the analysis of observational
studies for causal effects: Parallels with the design of
randomized studies.

*Stat Med*,*26*, 20–36.
Ruxton, Graeme D., & Colegrave, Nick. (2017).

*Experimental Design for the Life Sciences*(Fourth Edition). Oxford University Press.
Sargent, D. J., & Hodges, J. S. (1996).

*A hierarchical model method for subgroup analysis of time-to-event data in the Cox regression setting*.
Schoenfeld, D. A. (1983). Sample size formulae for the proportional
hazards regression model.

*Biometrics*,*39*, 499–503.
Schwemer, G. (2000). General linear models for multicenter clinical
trials.

*Controlled Clin Trials*,*21*, 21–29.
Senn, S. (2004). Controversies concerning randomization and additivity
in clinical trials.

*Stat Med*,*23*, 3729–3753. https://doi.org/10.1002/sim.2074p. 3735: "in the pharmaceutical industry, in analyzing
the data, if a linear model is employed, it is usual to fit centre as a
factor but unusual to fit block.";p. 3739: a large trial "is not less
vulnerable to chance covariate imbalance";p. 3741:"There is no place, in
my view, for classical minimization" (vs. the method of Atkinson);"If an
investigator uses such [allocation based on covariates] schemes, she or
he is honour bound, in my opinion, as a very minimum, to adjust for the
factors used to balance, since the fact that they are being used to
balance is an implicit declaration that they will have prognostic
value.";"The point of view is sometimes defended that analyses that
ignore covariates are superior because they are simpler. I do not accept
this. A value of $\pi=3$ is a simple one and accurate to one significant
figure ... However very few would seriously maintain that if should
generally be adopted by engineers.";p. 3742: "as Fisher pointed out ...
if we balance by a predictive covariate but do not fit the covariate in
the model, not only do we not exploit the covariate, we actually
increase the expected declared standard error."; p. 3744:"I would like
to see standard errors for group means abolished."; p. 3744:"A common
habit, however, in analyzing trials with three or more arms is to pool
the variances from all arms when calculating the standard error of a
given contrast. In my view this is a curious practice ... it relies on
an assumption of additivity of <i>all</all> treatments when comparing only
<i>two</i>. ... a classical t-test is robust
to heteroscedasticity provide that sample sizes are equal in the groups
being compared and that the variance is internal to those two groups but
is not robust where an external estimate is being used."; p. 3745: "By
adjusting main effects for interactions a type III analysis is similarly
illogical to Neyman’s hypothesis test."; "Guyatt <i>et al.</i> ... found a ’method for
estimating the proportion of patients who benefit from a treatment ...
In fact they had done no such thing."; p. 3746: "When I checked the Web
of Science on 29 June 2003, the paper by Horwitz <i>et al.</i> had been cited 28 times and that
by Guyatt <i>et al.</i> had been cited 79 times. The
letters pointing out the fallacies had been cited only 8 and 5 times
respectively."; "if we pool heterogeneous strata, the odds ratio of the
treatment effect will be different from that in every stratum, even if
from stratum to stratum it does not vary."; p. 3747: "Part of the
problem with Poisson, proportional hazard and logistic regression
approaches is that they use a single parameter, the linear predictor,
with no equivalent of the variance parameter in the Normal case. This
means that lack of fit impacts on the estimate of the predictor. ...
what is the value of randomization if, in all except the Normal case, we
cannot guarantee to have unbiased estimates. My view ... was that the
form of analysis envisaged (that is to say, which factors and covariates
should be fitted) justified the allocation and <i>not vice versa</i>."; "use the additive measure at
the point of analysis and transform to the relevant scale at the point
of implementation. This transformation at the point of medical
decision-making will require auxiliary information on the level of
background risk of the patient."; p. 3748:"The decision to fit
prognostic factors has a far more dramatic effect on the precision of
our inferences than the choice of an allocation based on covariates or
randomization approach and one of my chief objections to the allocation
based on covariates approach is that trialists have tended to use the
fact that they have balanced as an excuse for not fitting. This is a
grave mistake."

Senn, S. (2008).

*Statistical Issues in Drug Development*(Second). Wiley.
Senn, S. J. (2005). Dichotomania: An obsessive compulsive disorder that
is badly affecting the quality of analysis of pharmaceutical trials.

*Proceedings of the International Statistical Institute, 55th Session*. http://hbiostat.org/papers/Senn/dichotomania.pdf
Senn, S., Anisimov, V. V., & Fedorov, V. V. (2010). Comparisons of
minimization and Atkinson’s algorithm.

*Stat Med*,*29*, 721–730."fitting covariates may make
a more valuable and instructive contribution to inferences about
treatment effects than only balancing them"

Senn, S., Stevens, L., & Chaturvedi, N. (2000). Repeated measures in
clinical trials: Simple strategies for analysis using
summary measures.

*Stat Med*,*19*, 861–877. https://doi.org/10.1002/(SICI)1097-0258(20000330)19:6<861::AID-SIM407>3.0.CO;2-F
Shen, X., Huang, H.-C., & Ye, J. (2004). Inference after model
selection.

*J Am Stat Assoc*,*99*, 751–762.uses optimal approximation for estimating mean and
variance of complex statistics adjusting for model selection

Sigueira, A. L., & Taylor, J. M. G. (1999). Treatment effects in a
logistic model involving the Box-Cox transformation.

*J Am Stat Assoc*,*94*, 240–246.Box-Cox
transformation of a covariable;validity of inference for treatment
effect when treat exponent for covariable as fixed

Spanos, A., Harrell, F. E., & Durack, D. T. (1989). Differential
diagnosis of acute meningitis: An analysis of the
predictive value of initial observations.

*JAMA*,*262*, 2700–2707. https://doi.org/10.1001/jama.262.19.2700
Steyerberg, E. W. (2018). Validation in prediction research: The waste
by data-splitting.

*Journal of Clinical Epidemiology*,*0*(0). https://doi.org/10.1016/j.jclinepi.2018.07.010
Steyerberg, E. W., Bossuyt, P. M. M., & Lee, K. L. (2000). Clinical
trials in acute myocardial infarction: Should we adjust for
baseline characteristics?

*Am Heart J*,*139*, 745–751. https://doi.org/10.1016/S0002-8703(00)90001-2
Subramanian, J., & Simon, R. (2010). Gene expression-based
prognostic signatures in lung cancer: Ready for clinical
use?

*J Nat Cancer Inst*,*102*, 464–474.none demonstrated to have clinical
utility;bioinformatics;quality scoring of papers

Tukey, J. W. (1993). Tightening the Clinical Trial.

*Controlled Clin Trials*,*14*, 266–285. https://doi.org/10.1016/0197-2456(93)90225-3showed that asking clinicians to make up regression
coefficients out of thin air is better than not adjusting for
covariables

van der Ploeg, T., Austin, P. C., & Steyerberg, E. W. (2014). Modern
modelling techniques are data hungry: A simulation study for predicting
dichotomous endpoints.

*BMC Medical Research Methodology*,*14*(1), 137+. https://doi.org/10.1186/1471-2288-14-137Would be better to use proper accuracy scores in the
assessment. Too much emphasis on optimism as opposed to final
discrimination measure. But much good practical information. Recursive
partitioning fared poorly.

van Klaveren, D., Vergouwe, Y., Farooq, V., Serruys, P. W., &
Steyerberg, E. W. (2015). Estimates of absolute treatment benefit for
individual patients required careful modeling of statistical
interactions.

*J Clin Epi*,*68*(11), 1366–1374. https://doi.org/10.1016/j.jclinepi.2015.02.012
Vickers, A. J. (2008). Decision analysis for the evaluation of
diagnostic tests, prediction models, and molecular markers.

*Am Statistician*,*62*(4), 314–320.limitations of accuracy metrics;incorporating clinical
consequences;nice example of calculation of expected outcome;drawbacks
of conventional decision analysis, especially because of the difficulty
of eliciting the expected harm of a missed diagnosis;use of a threshold
on the probability of disease for taking some action;decision curve;has
other good references to decision analysis

Vickers, A. J., Basch, E., & Kattan, M. W. (2008). Against
diagnosis.

*Ann Int Med*,*149*, 200–203."The act of diagnosis requires that patients be placed
in a binary category of either having or not having a certain disease.
Accordingly, the diseases of particular concern for industrialized
countries—such as type 2 diabetes, obesity, or depression—require that a
somewhat arbitrary cut-point be chosen on a continuous scale of
measurement (for example, a fasting glucose level >6.9 mmol/L [>125 mg/dL] for type 2 diabetes).
These cut-points do not ade- quately reflect disease biology, may
inappropriately treat patients on either side of the cut-point as 2
homogenous risk groups, fail to incorporate other risk factors, and are
invariable to patient preference."

Wainer, H. (2006). Finding what is not there through the unfortunate
binning of results: The Mendel effect.

*Chance*,*19*(1), 49–56.can find bins that yield
either positive or negative association;especially pertinent when
effects are small;"With four parameters, I can fit an elephant; with
five, I can make it wiggle its trunk." - John von Neumann

White, I. R., Morris, T. P., & Williamson, E. (2021).

*Covariate adjustment in randomised trials: Canonical link functions protect against model mis-specification*. http://arxiv.org/abs/2107.07278Comment: 10 pages, 1 figure

White, I. R., & Thompson, S. G. (2005). Adjusting for partially
missing baseline measurements in randomized trials.

*Stat Med*,*24*, 993–1007.
Whitehead, J. (1993). Sample size calculations for ordered categorical
data.

*Stat Med*,*12*, 2257–2271.
Wilcox, R., Carlson, M., Azen, S., & Clark, F. (2013). Avoid lost
discoveries, because of violations of standard assumptions, by using
modern robust statistical methods.

*Journal of Clinical Epidemiology*,*66*(3), 319–329. https://doi.org/10.1016/j.jclinepi.2012.09.003
Xie, Y. (2015).

*Dynamic Documents with R and knitr, second edition*(second). Chapman and Hall.
Yamaguchi, T., & Ohashi, Y. (1999). Investigating centre effects in
a multi-centre clinical trial of superficial bladder cancer.

*Stat Med*,*18*, 1961–1971.