New Year Goals

2018

2019

Methodologic goals and wishes for research and clinical practice for 2018

Author

Affiliation

Frank Harrell

Vanderbilt University
School of Medicine
Department of Biostatistics

Published

December 29, 2017

Here are some goals related to scientific research and clinical medicine that I’d like to see accomplished in 2018. These are followed by some updates for 2019.

Comments

Physicians come to know that precision/personalized medicine for the most part is based on a false premise
Machine learning/deep learning is understood to not find previously unknown information in data in the majority of cases, and tends to work better than traditional statistical models only when dominant non-additive effects are present and the signal:noise ratio is decently high
Practitioners will make more progress in correctly using “old” statistical tools such as regression models
Medical diagnosis is finally understood as a task in probabilistic thinking, and sensitivity and specificity (which are characteristics not only of tests but also of patients) are seldom used
Practitioners using cutpoints/thresholds for inherently continuous measurements will finally go back to primary references and find that the thresholds were never supported by data
Dichotomania is seen as a failure to understand utility/loss/cost functions and as a tragic loss of information
Clinical quality improvement initiatives will rely on randomized trial evidence and de-emphasize purely observational evidence; learning health systems will learn things that are actually true
Clinicians will give up on the idea that randomized clinical trials do not generalize to real-world settings
Fewer pre-post studies will be done
More research will be reproducible with sounder sample size calculations, all data manipulation and analysis fully scripted, and data available for others to analyze in different ways
Fewer sample size calculations will be based on a ‘miracle’ effect size
Non-inferiority studies will no longer use non-inferiority margins that are far beyond clinically significant
Fewer sample size calculations will be undertaken and more sequential experimentation done
More Bayesian studies will be designed and executed
Classification accuracy will be mistrusted as a measure of predictive accuracy
More researchers will realize that estimation rather than hypothesis testing is the goal
Change from baseline will seldom be computed, not to mention not used in an analysis
Percents will begin to be replaced with fractions and ratios
Fewer researchers will draw any conclusion from large p-values other than “the money was spent”
Fewer researchers will draw conclusions from small p-values

Some wishes expressed by others on Twitter:

No more ROC curves
No more bar plots
Ban the term ‘statistical significance’ and ‘statistically insignificant’

Updates for 2019

My goals for 2018 were lofty so it’s not surprising that I’m disappointed overall with how little progress has been made on many of the fronts. But I am heartened by seven things:

Clinicians are getting noticeably more dubious about personalized/precision medicine
Researchers and clinicians are more dubious about benefits of machine learning
Researchers are more enlightened about problems with p-values and dichotomous thinking that usually comes with them, and are especially starting to understand what’s wrong with “significant”
Researchers are more enlightened about harm caused by dichotomania in general
We successfully launched datamethods.org and have created in-depth discussion in the community about many of the issues listed under goals for 2018
More researchers are seeing what a waste of ink ROC curves are
More high-profile Bayesian analysis of clinical trials are being published

Areas that remain particularly frustrating are:

Too many clinicians still believe that randomized clinical trials do not provide valuable efficacy data outside of the types of patients enrolled in the trials
Clinical researchers are still computing change from baseline
Sequential clinical trials are not being done (trials in which the sample size is not pretended to be known)
A failure to understand conditioning (as in what is assumed when computing a conditional probability)

If I had to make just one plea for 2019, a general one is this: Recognize that actionable statistical information comes from thinking in a predictive mode. Condition on what you already know to predict what you don’t. Use forward-time, complete, conditioning. As opposed to type-I errors, p-values, sensitivity, specificity, and marginal (sample averaged) estimates.

Discussion Archive (2018)

AS: Interested in some more details for some of these!

“Physicians come to know that precision/personalized medicine for the most part is based on a false premise” Do you mind stating the false premise you seem to have in mind? Is it the seemingly indistinguishable nature of natural variance in outcomes vs variance in treatment effects due to individual/genetic differences?

“Machine learning/deep learning is understood to not find previously unknown information in data in the majority of cases, and tends to work better than traditional statistical models only when dominant non-additive effects are present and the signal:noise ratio is decently high” While you state this as a goal for the future, are there any existing works that seem to point in this direction? I’d be interested in reading relevant work.

“Clinical quality improvement initiatives will rely on randomized trial evidence and de-emphasize purely observational evidence; learning health systems will learn things that are actually true” What about systems that try to leverage observational evidence by using causal inference techniques (e.g., accounting for selection bias through inverse propensity scores or related methods)?

“Classification accuracy will be mistrusted as a measure of predictive accuracy” Any example measures you have in mind?

Frank Harrell: Thanks for the good questions. Regarding precision medicine, I was referring to the general lack of evidence for heterogeneity of treatment effect when effects are measured on the appropriate relative scale (log odds, log hazard, etc.). Very few demonstrations of treatment interactions have been published and validated. And a different way of looking at this was in a survey of clinical trials that found that outcome variation was less in the treatment arm at follow-up than at baseline. This is good indirect evidence against heterogeneity of treatment effect - at least an interesting way to look at it.

Regarding machine learning, this is based on personal experiences and reading. The experiences are related to lack of rigorous validation of ML prediction rules and seeing some rules validate very poorly. Related to signal:noise ratio, many ML practitioners use classification methods, and classification is not appropriate for medical outcomes. But for pattern recognition (voice, images, etc.), classifiers can work well. What distinguishes the two situations is that your can train a pattern recognition algorithm on test images/sound where you absolutely know the truth and you can repeat training to a single image as many times as you want. That’s not the case with medical outcomes, and the $R^2$ in our typical outcomes study is very low.

On quality improvement initiatives, propensity adjustment and related methods strongly assume you’ve captured the confounder variables. With casual data collection (say with EHRs) that is often not the case.

Re: classification accuracy see this.

Reuse

CC BY 4.0

--- title: New Year Goals author: - name: Frank Harrell url: https://hbiostat.org affiliation: Vanderbilt University<br>School of Medicine<br>Department of Biostatistics date: 2017-12-29 modified: 2019-01-03 categories: [2018, 2019] description: "Methodologic goals and wishes for research and clinical practice for 2018" --- Here are some goals related to scientific research and clinical medicine that I'd like to see accomplished in 2018. These are followed by some updates for 2019. [[Comments](https://hbiostat.org/comment.html)]{.aside} * Physicians come to know that precision/personalized medicine for the most part is based on a false premise * Machine learning/deep learning is understood to not find previously unknown information in data in the majority of cases, and tends to work better than traditional statistical models only when dominant non-additive effects are present and the signal:noise ratio is decently high * Practitioners will make more progress in correctly using "old" statistical tools such as regression models * Medical diagnosis is finally understood as a task in probabilistic thinking, and sensitivity and specificity (which are characteristics not only of tests but also of patients) are seldom used * Practitioners using cutpoints/thresholds for inherently continuous measurements will finally go back to primary references and find that the thresholds were never supported by data * Dichotomania is seen as a failure to understand utility/loss/cost functions and as a tragic loss of information * Clinical quality improvement initiatives will rely on randomized trial evidence and de-emphasize purely observational evidence; learning health systems will learn things that are actually true * Clinicians will give up on the idea that randomized clinical trials do not generalize to real-world settings * Fewer pre-post studies will be done * More research will be reproducible with sounder sample size calculations, all data manipulation and analysis fully scripted, and data available for others to analyze in different ways * Fewer sample size calculations will be based on a 'miracle' effect size * Non-inferiority studies will no longer use non-inferiority margins that are far beyond clinically significant * Fewer sample size calculations will be undertaken and more sequential experimentation done * More Bayesian studies will be designed and executed * Classification accuracy will be mistrusted as a measure of predictive accuracy * More researchers will realize that estimation rather than hypothesis testing is the goal * Change from baseline will seldom be *computed,* not to mention not used in an analysis * Percents will begin to be replaced with fractions and ratios * Fewer researchers will draw **any** conclusion from large p-values other than "the money was spent" * Fewer researchers will draw conclusions from small p-values Some wishes expressed by others on Twitter: * No more ROC curves * No more bar plots * Ban the term 'statistical significance' and 'statistically insignificant' # Updates for 2019 My goals for 2018 were lofty so it's not surprising that I'm disappointed overall with how little progress has been made on many of the fronts. But I am heartened by seven things: * Clinicians are getting noticeably more dubious about personalized/precision medicine * Researchers and clinicians are more dubious about benefits of machine learning * Researchers are more enlightened about problems with p-values and dichotomous thinking that usually comes with them, and are especially starting to understand what's wrong with "significant" * Researchers are more enlightened about harm caused by dichotomania in general * We successfully launched [datamethods.org](http://datamethods.org) and have created in-depth discussion in the community about many of the issues listed under goals for 2018 * More researchers are seeing what a waste of ink ROC curves are * More high-profile Bayesian analysis of clinical trials are being published Areas that remain particularly frustrating are: * Too many clinicians still believe that randomized clinical trials do not provide valuable efficacy data outside of the types of patients enrolled in the trials * Clinical researchers are still computing change from baseline * Sequential clinical trials are not being done (trials in which the sample size is not pretended to be known) * A failure to understand conditioning (as in what is assumed when computing a conditional probability) If I had to make just one plea for 2019, a general one is this: Recognize that actionable statistical information comes from thinking in a predictive mode. Condition on what you already know to predict what you don't. Use forward-time, complete, conditioning. As opposed to type-I errors, p-values, sensitivity, specificity, and marginal (sample averaged) estimates. ---- ## Discussion Archive (2018) **AS**: Interested in some more details for some of these! "Physicians come to know that precision/personalized medicine for the most part is based on a false premise" Do you mind stating the false premise you seem to have in mind? Is it the seemingly indistinguishable nature of natural variance in outcomes vs variance in treatment effects due to individual/genetic differences? "Machine learning/deep learning is understood to not find previously unknown information in data in the majority of cases, and tends to work better than traditional statistical models only when dominant non-additive effects are present and the signal:noise ratio is decently high" While you state this as a goal for the future, are there any existing works that seem to point in this direction? I'd be interested in reading relevant work. "Clinical quality improvement initiatives will rely on randomized trial evidence and de-emphasize purely observational evidence; learning health systems will learn things that are actually true" What about systems that try to leverage observational evidence by using causal inference techniques (e.g., accounting for selection bias through inverse propensity scores or related methods)? "Classification accuracy will be mistrusted as a measure of predictive accuracy" Any example measures you have in mind? **Frank Harrell**: Thanks for the good questions. Regarding precision medicine, I was referring to the general lack of evidence for heterogeneity of treatment effect when effects are measured on the appropriate relative scale (log odds, log hazard, etc.). Very few demonstrations of treatment interactions have been published and validated. And a different way of looking at this was in a survey of clinical trials that found that outcome variation was less in the treatment arm at follow-up than at baseline. This is good indirect evidence against heterogeneity of treatment effect - at least an interesting way to look at it. Regarding machine learning, this is based on personal experiences and reading. The experiences are related to lack of rigorous validation of ML prediction rules and seeing some rules validate very poorly. Related to signal:noise ratio, many ML practitioners use classification methods, and classification is not appropriate for medical outcomes. But for pattern recognition (voice, images, etc.), classifiers can work well. What distinguishes the two situations is that your can train a pattern recognition algorithm on test images/sound where you absolutely know the truth and you can repeat training to a single image as many times as you want. That's not the case with medical outcomes, and the $R^2$ in our typical outcomes study is very low. On quality improvement initiatives, propensity adjustment and related methods strongly assume you've captured the confounder variables. With casual data collection (say with EHRs) that is often not the case. Re: classification accuracy see [this](../class-damage).