New Year Goals

Methodologic goals and wishes for research and clinical practice for 2018

Vanderbilt University
School of Medicine
Department of Biostatistics


December 29, 2017

Here are some goals related to scientific research and clinical medicine that I’d like to see accomplished in 2018. These are followed by some updates for 2019.

Some wishes expressed by others on Twitter:

Updates for 2019

My goals for 2018 were lofty so it’s not surprising that I’m disappointed overall with how little progress has been made on many of the fronts. But I am heartened by seven things:

  • Clinicians are getting noticeably more dubious about personalized/precision medicine
  • Researchers and clinicians are more dubious about benefits of machine learning
  • Researchers are more enlightened about problems with p-values and dichotomous thinking that usually comes with them, and are especially starting to understand what’s wrong with “significant”
  • Researchers are more enlightened about harm caused by dichotomania in general
  • We successfully launched and have created in-depth discussion in the community about many of the issues listed under goals for 2018
  • More researchers are seeing what a waste of ink ROC curves are
  • More high-profile Bayesian analysis of clinical trials are being published

Areas that remain particularly frustrating are:

  • Too many clinicians still believe that randomized clinical trials do not provide valuable efficacy data outside of the types of patients enrolled in the trials
  • Clinical researchers are still computing change from baseline
  • Sequential clinical trials are not being done (trials in which the sample size is not pretended to be known)
  • A failure to understand conditioning (as in what is assumed when computing a conditional probability)

If I had to make just one plea for 2019, a general one is this: Recognize that actionable statistical information comes from thinking in a predictive mode. Condition on what you already know to predict what you don’t. Use forward-time, complete, conditioning. As opposed to type-I errors, p-values, sensitivity, specificity, and marginal (sample averaged) estimates.

Discussion Archive (2018)

AS: Interested in some more details for some of these!

“Physicians come to know that precision/personalized medicine for the most part is based on a false premise” Do you mind stating the false premise you seem to have in mind? Is it the seemingly indistinguishable nature of natural variance in outcomes vs variance in treatment effects due to individual/genetic differences?

“Machine learning/deep learning is understood to not find previously unknown information in data in the majority of cases, and tends to work better than traditional statistical models only when dominant non-additive effects are present and the signal:noise ratio is decently high” While you state this as a goal for the future, are there any existing works that seem to point in this direction? I’d be interested in reading relevant work.

“Clinical quality improvement initiatives will rely on randomized trial evidence and de-emphasize purely observational evidence; learning health systems will learn things that are actually true” What about systems that try to leverage observational evidence by using causal inference techniques (e.g., accounting for selection bias through inverse propensity scores or related methods)?

“Classification accuracy will be mistrusted as a measure of predictive accuracy” Any example measures you have in mind?

Frank Harrell: Thanks for the good questions. Regarding precision medicine, I was referring to the general lack of evidence for heterogeneity of treatment effect when effects are measured on the appropriate relative scale (log odds, log hazard, etc.). Very few demonstrations of treatment interactions have been published and validated. And a different way of looking at this was in a survey of clinical trials that found that outcome variation was less in the treatment arm at follow-up than at baseline. This is good indirect evidence against heterogeneity of treatment effect - at least an interesting way to look at it.

Regarding machine learning, this is based on personal experiences and reading. The experiences are related to lack of rigorous validation of ML prediction rules and seeing some rules validate very poorly. Related to signal:noise ratio, many ML practitioners use classification methods, and classification is not appropriate for medical outcomes. But for pattern recognition (voice, images, etc.), classifiers can work well. What distinguishes the two situations is that your can train a pattern recognition algorithm on test images/sound where you absolutely know the truth and you can repeat training to a single image as many times as you want. That’s not the case with medical outcomes, and the \(R^2\) in our typical outcomes study is very low.

On quality improvement initiatives, propensity adjustment and related methods strongly assume you’ve captured the confounder variables. With casual data collection (say with EHRs) that is often not the case.

Re: classification accuracy see this.