A Litany of Problems With p-values

decision-making

bayes

multiplicity

p-value

hypothesis-testing

2017

p-values are very often misinterpreted. p-values and null hypothesis significant testing have hurt science. This article attempts to catalog all the ways in which these happen.

Author

Affiliation

Frank Harrell

Vanderbilt University
School of Medicine
Department of Biostatistics

Published

February 5, 2017

With the many problems that p-values have, and the temptation to “bless” research when the p-value falls below an arbitrary threshold such as 0.05 or 0.005, researchers using p-values should at least be fully aware of what they are getting. They need to know exactly what a p-value means and what are the assumptions required for it to have that meaning. ♦ A p-value is the probability of getting, in another study, a test statistic that is more extreme than the one obtained in your study if a series of assumptions hold. It is strictly a probability about data, not a probability about a hypothesis or about the effect of a variable. ♦ The study must be capable of being repeated infinitely often, or one must play a mind game in which this is so. ♦ The repeated studies have data generated by exactly the same data model as the original study with one exception: the null hypothesis is forced to be exactly true. [And we don’t know how to count negative treatment benefit.] ♦ To do the repetitions, you must know the exact design and sampling plan that was in effect for your study. [It’s not even clear that one would want to use the same sample size in repeating a study.] ♦ You must know all the investigators’ intensions for testing, including intended timing and frequency of looks at the data. ♦ You must know the exact stopping rule for the study, or to pretend that the actual final sample size was magical and would be chosen again and again. Note: all of these design features must be the ones actually used in conducting the study, not those in the original study plan. If calculations are based on the original study plan, no deviations from that plan during the study are allowed. ♦ The study repetitions used to compute the p-value must be executed using exactly this sampling plan, stopping rule, data look schedule, and investigator intentions.

In my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

Comments

The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values. Now it’s time to act. To create the needed motivation to change, we need to fully describe the depth of the problem.

It is important to note that no statistical paradigm is perfect. Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.

Consider an assertion such as “the coin is fair”, “treatment A yields the same blood pressure as treatment B”, “B yields lower blood pressure than A”, or “B lowers blood pressure at least 5mmHg before A.” Consider also a compound assertion such as “A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke.”

A. Problems With Conditioning

p-values condition on what is unknown (the assertion of interest; H₀) and do not condition on what is known (the data).
This conditioning does not respect the flow of time and information; p-values are backward probabilities.

B. Indirectness

Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics. They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making.
p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion.
As detailed here, the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground.
Because of A, p-values are difficult to interpret and very few non-statisticians get it right. The best article on misinterpretations I’ve found is here.

C. Problem Defining the Event Whose Probability is Computed

In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result more extreme than that observed. Is this the correct point of reference?
How does more extreme get defined if there are sequential analyses and multiple endpoints or subgroups? For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not?

D. Problems Actually Computing p-values

In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated. In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0. And many statisticians do not realize that Fisher’s so-called “exact” test is not very accurate in many cases.
Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases.
One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible. One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value.

E. The Multiplicity Mess

Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments. A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested. By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect.
There remains controversy over the choice of 1-tailed vs. 2-tailed tests. The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment. But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm. So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value?
Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments.
Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex. Scientific flexibility is discouraged. The p-value for an early data look must be adjusted for future looks. The p-value at the final data look must be adjusted for the earlier inconsequential looks. Unblinded sample size re-estimation is another case in point. If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects. How can that make any scientific sense?
Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present.

F. Problems With Non-Trivial Hypotheses

It is difficult to test non-point hypotheses such as “drug A is similar to drug B”.
There is no straightforward way to test compound hypotheses coming from logical unions and intersections.

G. Inability to Incorporate Context and Other Information

Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value’s inability to incorporate context or prior evidence. A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a “significant” p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see here for more).
p-values are unable to incorporate outside evidence. As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress. Nate Silver in The Signal and the Noise beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect. A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data. Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate.
p-values require subjective input from the producer of the data rather than from the consumer of the data.

H. Problems Interpreting and Acting on “Positive” Findings

With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance).
Statisticians and subject matter researchers (especially the latter) sought a “seal of approval” for their research by naming a cutoff on what should be considered “statistically significant”, and a cutoff of p=0.05 is most commonly used. Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant. Hypotheses are exchanged if the original H₀ is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something “significant” emerges.
When the p-value is small, researchers act as though the point estimate of the effect is a population value.
When the p-value is small, researchers believe that their conceptual framework has been validated.

I. Problems Interpreting and Acting on “Negative” Findings

Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means “get more data”).

J. Distortion of Scientific Conclusions

Greenwald, Gonzalez, Harris, and Guthrie’s paper Effect sizes and p values: What should be reported and what should be replicated? nicely describes subtle distortions in the scientific research process caused by the usage of null hypotheses:

One of the more important varieties of prejudince against the null hypothesis … comes about as a consequence of researchers much more identifying their own theoretical predictions with rejections (rather than with acceptances) of the null hypothesis. The consequence is an ego involvement with rejection of the null hypothesis that often leads researchers to interpret null hypothesis rejections as valid confirmations of their theoretical beliefs while interpreting nonrejections as uninformative and possibly the result of flawed mehods.

Discussion Archive (2017)

Jeffrey Blume: One interesting point is that likelihoodists don’t really care about the proof of the LP from the CP or SP. This is because the LP is implied by the Law of Likelihood. Likewise if one defines the measure of the strength of evidence to be a Bayes factor, posterior probability, or some distance between the two hypothesis. However, if the measure of the strength of evidence is defined to be a probability or some other metric that depends on the sample space, then the LP will not apply. It all boils down to the fundamental building blocks: (1) what is the measure of the strength of evidence, (2) what is the probability that a study will generate misleading evidence, and (3) what is the probability that an observed measure is misleading. This is how systems for measuring evidence ought to be evaluated and compared.

I posted because I think it is good to see alternative viewpoints and because I think it illustrates an important issue: The class of evidence functions being considered must be large enough to include functions that depend on the sample space and those that do not depend on the sample space. Otherwise the argument is effectively tautological.

Deborah Mayo: No, Greg doesn’t even purport to disprove me nor can he. As a logic professor, I’m clear on the logical mistake that led people to think Birnbaum had proved the LP–though it’s quite subtle, and took a long time to explain (not to spot). My disproof is even deeper than Evan’s disproof, because, as I explain in my paper, it requires more than a mere counterexample.

Jeffrey Blume: The solution here is to be specific about what is communicated. If we were to only report that the result of the hypothesis test (Reject or Accept), then the false discovery rates would indeed be direct functions of the Type I and Type II error rates. That is, if you only tell me that you rejected the null, that information is more likely misleading if you used a design with a large Type I Error rate. However, if you report the data, or some summary of it, then the above argument no longer holds. The probability that the null hypothesis is true given the data (this is now the false discovery rate if the test rejects) does not depend on the Type I and Type II errors. Why? Because here the likelihood function for the observations depends on the data and the model (and not the sample space), whereas in the first example the likelihood function is for the test result (not the data) and that likelihood depends on a binomial model where the error rates determine the likelihood function. So, really, it’s all about the likelihood.

Just a quick point on something alludes to above. Much of the discordance between the three schools of inference comes from how composite hypotheses are handled. For what it is worth, Neyman and Pearson acknowledged in their 1933 paper that there is no general solution to this problem. Their approach was to take the best supported alternative hypothesis in the alternative space and let that data-chosen alternative represent the alternative hypothesis when they applied their optimal solution to the simple -versus-simple case. It is a bit of a cheat, which they acknowledged, but a decent practical solution. The problem with this solution is that it can break down when the null hypothesis is true, because in that case the best supported alternative ends up being virtually identical to the null, but still regarded as an “alternative”. The Bayesian solution, of course, was to average over the alternative space to come up with a new simple hypothesis. Then the two simple hypotheses are compared. This has pros and cons too. So Savage is right on the mark here.

Re the concern about errors rates outside of formal NP theory: I think this is overblown. Just because NP theory is not used does not imply that all resulting inferences are suspect. Likelihood methods are an excellent example. In the likelihood paradigm, both the Type I and Type II error rates go to zero. In fact, if Neyman and Pearson had chosen to minimize the average error rate (instead of holding one constant), then they would have been likelihoodists, since that solution is given by the Law of Likelihood. From here, one could make a strong argument that likelihoodist have better frequency properties than those afforded by hypothesis testing, solely because it makes little sense to hold the type I error fixed over the sample size. In many cases, this is what is causes hypothesis tests to go awry. Bayesian analyses benefit from this behavior as long as the prior does not change too quickly and it relatively smooth.

I don’t buy the argument that the user is the problem. These procedures have been in use for almost a century by many disciplines. Is the claim really that all the problems, counterexamples and unintuitiveness that arise are due to user error? Frank’s list might seem daunting, and there are perhaps quibbles to be made, but the theme is on target: hypothesis and significance test procedures are flawed when it comes to measuring and communicating the strength of evidence in a given body of observations. I would not claim that that they are useless, although a strong case can be made when only the p-value is reported, but rather that they have serious shortcoming that require attention.

For a disproof of the alleged disproof of the alleged proof of the likelihood principle see http://gandenberger.org/research.

Gandenberger G. “A New Proof of the Likelihood Principle.” The British Journal for the Philosophy of Science 66, 3 (2015): 475-503.

FH: The article is now posted.

I didn’t mean to imply that frequentists use priors. But sometimes you can solve for the prior that is consistent with how they operate. Regarding changing the prior I’m more skeptical that that is OK but I am influenced by working in a regulatory environment where pre-specification is all important.

Deborah Mayo: First off, thanks. The frequentist doesn’t assign priors to the hypotheses you mention; it is a fallacy to spoze that a match between numbers (error probabilities and posterior) means the frequentist makes those prior assignments. But on cheating, I don’t see how you can say “a Bayesian can cheat by changing the prior after observing data”. You need a notion of cheating. Error statisticians have one, what’s the Bayesian’s? Bayesians, by and large, aren’t troubled by changing their prior post data as you can see from this post: “Can you change your Bayesian prior?”. I think only subjective Bayesians may say no, but even Dawid says yes. I’d like to know what you think.

FH: Very well written article and interesting history. I think that consideration of what constitutes cheating is a very useful exercise. It is also useful to back-construct a prior that makes certain things possible or likely. For example, sampling to a foregone conclusion happens when a statistician uses a smooth prior but her critic uses a prior with an absorbing state (point mass) at the null. Bonferroni correction is equivalent to a prior that specifies that the probability that all null hypotheses are true is the same no matter how many hypotheses are tested (a very strange assumption). A Bayesian can cheat by changing the prior after observing data or by improper conditioning, e.g., acquiring more data, finding the cumulative result to be less impressive than it was before, and rolling back the data to only analyze the smaller sample. But choosing a smooth prior before looking at the data and having the prior at least as skeptical as the critic’s implied prior, will result in a stream of posterior probabilities that are well calibrated independent of how aggressive were the ‘data looks’. Not only are the posterior probabilities calibrated, but the posterior mean is perfectly calibrated, discounted by the prior more when stopping is very early. The frequentist correction for bias in the sample mean upon early stopping is quite complex. Frequentists tend to be very good at correcting p-values for multiplicity but very bad at correcting point estimates for same.

Deborah Mayo: If I can’t interest you in learning that someone tried and tried again to achieve a stat sig result (or a HPD interval excluding the true value), even though with high or max probability this can be achieved erroneously, then your view of “being cheated” and mine are very different. But I’m glad you stick with this, it’s the Bayesians who try to wrangle out of the consequences of accepting the LP that bother me. Please see this blog post.

FH: You’ll have a hard time convincing me of the relevance of things that might have happened that didn’t. And simulation studies for the frameworks I envision demonstrate that the stopping rule is really irrelevant. The simplest example is a one-sample normal problem with known variance where one tests after each observation is acquired, resulting in n tests for an ultimate sample size n. If you are thinking of a particular issue you might sketch the flow of a simulation that would demonstrate it.

Deborah Mayo: Of course it comes from increasing the chances you give for the data to be more extreme, but the relevance of such outcomes that didn’t occur is just what’s denied by Bayesians who endorse the Likelihood Principle. Sequential trials were advocated by frequentists long ago (Armitage). He also argued that optional stopping also results in posteriors being wrong with high probability. But Savage switched to a simple point against point hypothesis to defend the LP. That is still going on today as the latest reforms champion the irrelevance of optional stopping––to them.

Aside from inability to control error probabilities, the key problem with all Bayesian accounts is they never quite tell us what they’re talking about (and they certainly don’t agree with eachother), except perhaps empirical Bayesians. Is the prior/posterior an expression of degree of belief in various values of parameters? how frequently they occur in universes of parameters? As the number of parameters increases, the assessments–generally default priors–move further and further from anything we can get a grip on. They are not representing background information. And if we’re going to test them, we’ll have to report to something like significance tests. But given what you said earlier, about only caring to match the beliefs of “the judge” the whole business of outsiders critically appriasing you may not matter.

FH: I appreciate that Steven, and only take issue with your “find fault” sentence. True, more fault lies with unscientific work than with statistical paradigms, but there are major problems with p-values, and p-values lead to many downstream problems as I’ve tried to catalog. The paradigm really matters. Not all of the fault lies with practitioners. This becomes more clear for those like me (a follower of David Spiegelhalter) who embrace Bayesian posterior probabilities and favor skeptical priors. Once you do the right simulations or grasp the theory (the former being easier for me) you’ll see things such as the fact that multiplicity comes from the chances you give data to be more extreme, not the chances you give assertions to be true. And the fact that frequentist thinking leads usually to fixed sample size designs turns out to be a huge issue in experimental work.

You’re giving me the idea to post a separate article on my journey and how this relates to RMS, which I hope to complete in the next few days.

Steven McKinney: I apologize for my choice of words. I am not threatened by the amount of work we have ahead of us to improve the situation, my entire career has been working to improve the situation. That’s why I own two copies of your RMS book and regularly use your software. I am threatened when an avalanche of pop-culture blog posts appear, inappropriately attributing fault to a statistic or paradigm when the fault lies with people mishandling and misinterpreting that statistic, or paradigm.

As you say, you can do unscientific non-reproducible research using any paradigm and statistics thereof. As you state in the introduction above, “As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.” I hope that the title and opening sentence, and further general discussion of this topic, progresses and becomes

“A Litany of Problems With Misinterpretation of p-values”

“In my opinion, inappropriate handling and interpretation of null hypothesis testing and p-values have done significant harm to science.”

Precisely. This is why I do not understand this odd stance you have adopted. Bayesian methods require the same discipline in handling analyses and interpreting findings as any other paradigm.

When you and Gelman and others lead the way in demonstrating Bayesian analytical methods, I predict other blog postings such as “The litany of problems with Bayes factors” after groups with no statistical discipline misinterpret and mangle findings from such analyses.

But for those of us still striving to find truth in data, I do genuinely look forward to seeing more Bayesian-based approaches in future revisions of Regression Modeling Strategies, along with the attendant steps needed to interpret findings in a disciplined manner.

FH: I take offense are your choice of words. The fact that I have not been a Bayesian my whole career and hence do not have a plethora of Bayesian examples doesn’t hold me back from seeking better approaches. You seem to be threatened by the amount of work we have ahead of us to improve the situation. I am not, which is one reason I’m working closely with FDA on this very problem. You can do unscientific, non-reproducible research using any paradigm. I want to do good science and have results that have sensible interpretations.

Steven McKinney: This is exactly my point. Whether you use Bayesian methods, as the Duke group did, or you use frequentist methods, as Baggerly and Coombes did in reviewing several of the Duke analyses, you need to exercise discipline, using many ideas some of which are very ably described in Regression Modeling Strategies.

If you want to start using more Bayesian based methods, have at it. I look forward to seeing your examples.

What I find disingenuous and concerning are your statements “Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.” Where do you show this quantification, that Bayesian and likelihood paradigms solve the greatest number of real problems and have the fewest number of faults. That certainly wasn’t apparent in the Duke fiasco.

I found some course notes on line, “An Introduction to Bayesian Methods with Clinical Applications” from July 8, 1998, by Frank Harrell and Mario Peruggia. That’s nearly 20 years ago. Why does it take so long to bring Bayesian methods into practice? If you haven’t been able to do it in nearly 20 years, how are other mere mortals to do it?

FH: The statistical paradigm had almost nothing to do with this. Ask the forensic biostatisticians Baggerly and Coombs who uncovered the whole problem. In their wonderful paper (Annals of Applied Statistics 2009) neither Bayes nor prior appear.

Deborah Mayo: Following McKinney on this, it was their confidence that their prior freed them from having to have genuine hold out data, that they thought made it permissible. McKinney knows more about the specifics. The blatant ignoring of unwelcome results was brought out by the whistleblowers. Of course if you’re not in the business of ensuring error control, actions that alter them will be irrelevant

FH: Ask yourself how many investigative journalism articles mentioned Bayes when writing about the Duke fiasco. I think the answer is zero, because Bayesian modeling had nothing at all to do with the problem. Your comments are most curious. I can’t think of any analytical method that cannot be abused, even just using descriptive statistics. You might also look at why Mike West resigned from the collaboration early on.

Steven McKinney: Anil Potti and Joseph Nevins used Bayesian methods recommended by Mike West during the Duke genomics fiasco a few years back. They overfitted models and committed many other analysis gaffes. Bayesian modelers are not immune from running afoul of proper interpretation of modeling findings to shed light on scientific phenomena.

Proper handling of statistics and interpretation of findings is needed in any statistical exercise, Bayesian, frequentist or otherwise.

Given the wealth of great advice in Regression Modeling Strategies and other writings, I am taken aback to see Frank Harrell blame a statistic when the problem is people misinterpreting a statistic.

A p-value is just a statistic, with certain knowable distributional properties under this and that condition. A Bayesian Highest Posterior Density region can be improperly obtained and misinterpreted just as readily.

omaclaren.com: I glad we can agree on A. I don’t think this is a satisfactory argument against pvalues and neither is it satisfactory against likelihood.

We can leave the other argument for another day. For now though I’ll note that while I have quite a lot of sympathy for the ‘pure’ likelihood approach and/or evidential approaches, I don’t find your axiom satisfactory.

There is of course a basic axiom behind pvalues - Fisher’s disjunction. But I don’t find this fully satisfactory either.

Jeffrey Blume: As for details on the evidential framework I alluded to above, a reference is: Blume JD. Likelihood and its evidential framework. In: Dov M. Gabbay and John Woods, editors, Handbook of The Philosophy of Science: Philosophy of Statistics. San Diego: North Holland, 2011, pp. 493-511.

Any system purporting to measure evidence for or against a hypothesis ought to be subjected to the same scrutiny. This entails identifying three things: (1) the metric that will be used for measuring evidence, (2) the probability of observing a misleading metric under certain experimental conditions (i.e., error probabilities), and (3) the probability than an observed metric is indeed misleading (e.g., this would be a false discovery rate). Systems for measuring evidence can then be compared on the basis of these three criteria and perhaps their axiomatic justification. These three concepts are distinct, so a single mathematical quantity like the tail area probability can’t possibly represent all three. The above paper uses the likelihood paradigm to illustrate the point.

Those mills are going to be running overtime. I simulated a string of 100,000 standard normal deviates and then computed the running z-statistic for testing the null hypothesis is zero (assuming the variance is known). In 10,000 simulations, only 74% rejected at some point (not bad for 100,000 looks at the data). The 25th, 50th, and 75th quartiles of the stopping time were 10, 98, 1533. The mean stopping time was ~5200. That means, for example, that 25% of the rejections occurred when the sample mean was around 0.05 (=1.96/sqrt(1533)). That’s 5% of a standard deviation. In practice, an observed difference that small is often a rounding error.

The point is not that these are desirable properties, but rather that when these tests make a Type I Error, they often support hypotheses very close to the null. So close, in fact, that they may be practically indistinguishable from the null. If we only counted the rejections where the difference was at least 25% of a standard deviation, the rejection probability drops to about 20%. Not great, but not terrible for 100,000 looks at the data.

Deborah Mayo: No time but to register: “absolutely absurd” though grist for my mills if Blume is for real. See my blog for details. errorstatisticscom

Jeffrey Blume: The key is that once data is observed, we compute the likelihood ratio to measure the evidence. This, of course, does not depend on the stopping rule. And if we are concerned that the data are likely to be misleading because we looked a lot during our study, we would compute the probability that our observed data are mistaken. The point is that the measure of the evidence (the LR) and the probability that the observed data are mistaken (written as P(H_0|LR>k)) , are not the same as the probability that the study design will generate misleading evidence (written as P(LR>k|H_0)). Error probabilities have an important role in statistics; they just don’t represent the strength of the evidence in the data or the probability that the observed data are misleading.

The third sentence is a nice example of a common misinterpretation of the LP. The LP only says that the stopping rule is irrelevant for the measurement of the strength of evidence in data. It does not say the stopping rule is irrelevant for everything. Confusion rains because we often fail to distinguish between the measure of the strength of evidence and the probability that the evidence will be misleading.

Likelihoodists follow the LP and use error probabilities. When we design a study, we compute its frequency properties. If we plan to look at the data many times, the probability of observing misleading evidence gets inflated (lucky for us, however, that inflation is bounded above). These frequency properties help us choose between designs. Modern Bayesian do the same.

I think it is important to understand why this happens. When the null hypothesis is true the likelihood masses right on top of the null value. Every now and then, the tails of the likelihood shift by an infinitesimal amount and this tiny tiny shift causes the classical hypothesis test to reject because the benchmark for rejecting is measured in standard errors (which are rapidly shrinking to zero) and not standard deviations. It leaves us in the odd position of claiming to reject the null hypothesis when data support hypotheses that are arbitrary close the null. If instead we decided to only count the statistical rejections that also supported clinically meaningful differences, the resulting probability would not approach one or even be all that high.

Overly broad generalizations can lead to confusion. So let’s be specific and consider the points. The first clause is true for routine usage of classical hypothesis and significance tests when the sample size is allowed to grow forever. I have yet to see a study where this was possible, so the force of this point is diminished I think. If we take the case where the sample size is finite, the claim is false. The probably might be high in some cases, but it will not be one, and certain cases can be constructed so the probably is not so high. Regardless, this is an excellent reason to avoid using p-values in my opinion. Why not use a tool that does not have this property (e.g., a likelihood ratio)?

And…I would not be ‘ok’ with “inference based on the p-value function” largely because there is not axiom to support its use. The axiom would be something like “A set of data supports the hypotheses that do a better job at prediction the data and data more extreme”. I don’t find this compelling because of the inclusion of “more extreme”, which means different things to different people. The NP hypothesis testing framework is clear that “more extreme” means large likelihood ratios. In contrast, significance testing often defines “more extreme” as further away in hypothesis space (tail areas). These two definitions are not the same; which confuses matters. Large likelihood ratios can lead to instances where the rejection regions are near the null hypothesis and not in the tails (e.g., comparing means of two normal models with different variances).

So I’ll jump in as a Likelihoodist. Likelihood methods for measuring statistical evidence have very precise framework. The basic axiom is this: the hypothesis that does a better job predicting the observed data is better supported by the data.

For (a), I don’t see the “reverse” conditioning here as problematic; this is the natural way to compare predictions.

For (b), no issues here for the likelihood approach. The predictions from each hypothesis are directly compared and we just report which hypotheses did the better job of predicting the observed events. No need to rely on proof by contradiction, which Frank correctly points out is problematic when the direct implication is replaced by probabilistic tendency.

For (G): By design, likelihood methods report which hypotheses are better supported than others given the observed data and model. You need a model under which to specify the predictions of each hypothesis and that model is often prescribed by context. But I would guess this is not what Frank is referring to, if only because everyone generally agrees on the form of the likelihood function. Additional information, such as data from a previous study, would simply be combined in the likelihood (e.g., if the studies are independent, one could multiply the likelihoods). Information that represents personal believe or some other hunch would best be incorporated in the Bayesain framework. Likelihoodists want to know ‘what the data themselves say’, not ‘what the data say after I add in prior information’.

The first sentence is a little ironic, no? Many people have thought long and hard about these issues and we’ve been debating them for over a century. And there are plenty of examples that don’t assume a Bayesian solution that make hypothesis tests look downright insane (Pratt’s voltage detector example, in 1961 I think, is one).

p-values are often not reproducible because they confound the effect size and the level of precision. Also two equal p-values don’t imply the same amount of “evidence”, so its not clear why one would care or expect them to replicate. The thing to replicate is the effect size, not the p-value.

I’d agree but with a caveat. The scientific benchmark for what is discrepant should not change. We can use statistical tools to assess if that benchmark is achieved or not, but we should not be using statistical tools to set the benchmark, which is what hypothesis testing effectively does. The problem is that the scientific force of the discrepancy changes (it depends on standard errors), so we can end up with statistically significant results that are not actually scientifically discrepant. Personally, this is why I prefer other approaches (e.g., likelihood, Bayesian) that respect the original scale of the data upfront.

FH: When the analyst uses the same prior as the judge, posterior probabilities are perfectly calibrated independent of the stopping rule.

Deborah Mayo: Anyone who ignores the stopping rule can erroneously declare significance with probability 1, and with the corresponding Bayesian priors can erroneously leave the true parameter value out of the Bayesian HPD interval. See stopping rules on my blog. Anyone who obeys the LP cannot use error probabilities. I think only subjective Bayesians are still prepared to endorse that and staunch likelihoodists. Just what today’s practice needs. For that matter, let replicationists try and try again until they can get a stat sig effect Then the replication rate will be 100%

I have, incidentally, disproved alleged proofs of the LP.

FH: Not sure. I’m going to invite a likelihoodist to join the conversation.

omaclaren.com: Well, it is related to Fisher’s fiducial approach - see eg the reminisces of Fraser at the end here - but more generally is just standard Fisherian-style confidence theory as used by eg Cox, Fraser etc and as opposed to Neymanian confidence and decisions.

I assume the p-value function will depend on stopping rules etc as generally understood - ie they violate the strong likelihood principle.

But, my more general point - rather than advocating for likelihood, Bayes or confidence theory - is that points A and B seem to either (logically) apply to both likelihood and confidence theory/pvalue functions or to neither.

FH: Not sure. Are you trying to get at Fisher’s fiduciary method? I prefer having a full model for inference, but I do like the likelihood school of inference because it respects the likelihood principle. For example inference is independent of a stopping rule. How do you handle multiplicity/sequential stopping rules for a p-value function?

omaclaren.com: Fair enough.

But would you agree it is no more of less subject to A and B than Likelihood? Or do you also disagree with this?

FH: That’s the paper. I haven’t looked into the Box approach. I would use a prior that is eventually overwhelmed by data, get agreement on it, and not often revisit the prior.

Bill R: Are you referring to the “Bayesian Approaches to Randomized Trials” paper (JRSS A ,157, part 3, 357-416)? In section 6.3 Speigelhalter recommends using Box’s generalized p-value to check prior data compatibility. If I have a point prior (null) would that not simplify to Fisher’s p-value?

FH: That’s not satisfying. High-level view: Aside from non-study information, the p-value is monotonically related to what I need, but it is not calibrated to be the metric I need.

omaclaren.com: Would you then be OK with inference based on a pvalue function defined analogously to a likelihood function? That is,

PF(theta;y0) := prob(y>y0; theta)

considered as a function of theta for y0 fixed. Is this still subject to A or B?

FH: I believe this would be Edwards-Royall. I think that likelihood has a bit of a problem with G but not with A or B.

Reuse

CC BY 4.0

--- title: A Litany of Problems With p-values author: - name: Frank Harrell url: https://hbiostat.org affiliation: Vanderbilt University School of Medicine Department of Biostatistics date: 2017-02-05 modified: 2019-08-04 categories: [decision-making, bayes, multiplicity, p-value, hypothesis-testing, 2017] description: "p-values are very often misinterpreted. p-values and null hypothesis significant testing have hurt science. This article attempts to catalog all the ways in which these happen." --- ::: {.quoteit} With the many problems that p-values have, and the temptation to "bless" research when the p-value falls below an arbitrary threshold such as 0.05 or 0.005, researchers using p-values should at least be fully aware of what they are getting. They need to know exactly what a p-value means and what are the assumptions required for it to have that meaning. ♦ A p-value is the probability of getting, in another study, a test statistic that is **more** extreme than the one obtained in your study if a series of assumptions hold. It is strictly a probability about data, not a probability about a hypothesis or about the _effect_ of a variable. ♦ The study must be capable of being repeated infinitely often, or one must play a mind game in which this is so. ♦ The repeated studies have data generated by exactly the same data model as the original study with one exception: the null hypothesis is forced to be exactly true. [And we don't know how to count negative treatment benefit.] ♦ To do the repetitions, you must know the exact design and sampling plan that was in effect for your study. [It's not even clear that one would _want_ to use the same _sample size_ in repeating a study.] ♦ You must know all the investigators' intensions for testing, including intended timing and frequency of looks at the data. ♦ You must know the exact stopping rule for the study, or to pretend that the actual final sample size was magical and would be chosen again and again. **Note**: all of these design features must be the ones actually used in conducting the study, not those in the original study plan. If calculations are based on the original study plan, no deviations from that plan during the study are allowed. ♦ The study repetitions used to compute the p-value must be executed using exactly this sampling plan, stopping rule, data look schedule, and investigator intentions. ::: In my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress. [[Comments](https://hbiostat.org/comment.html)]{.aside} The American Statistical Association has done a great service by issuing its [Statement on Statistical Significance and P-values](http://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf). Now it's time to act. To create the needed motivation to change, we need to fully describe the depth of the problem. It is important to note that no statistical paradigm is perfect. Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference. Consider an assertion such as "the coin is fair", "treatment A yields the same blood pressure as treatment B", "B yields lower blood pressure than A", or "B lowers blood pressure at least 5mmHg before A." Consider also a compound assertion such as "A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke." ### A. Problems With Conditioning 1. p-values condition on what is unknown (the assertion of interest; H0) and do not condition on what is known (the data). 2. This conditioning does not respect the flow of time and information; p-values are backward probabilities. ### B. Indirectness 1. Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics. They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making. 2. p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion. 3. As detailed [here](http://www.fharrell.com/2017/01/null-hypothesis-significance-testing.html), the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground. 4. Because of A, p-values are difficult to interpret and very few non-statisticians get it right. The best article on misinterpretations I've found is [here](http://dx.doi.org/10.1007/s10654-016-0149-3). ### C. Problem Defining the Event Whose Probability is Computed 1. In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result *more extreme* than that observed. Is this the correct point of reference? 2. How does *more extreme* get defined if there are sequential analyses and multiple endpoints or subgroups? For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not? ### D. Problems Actually Computing p-values 1. In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated. In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0. And many statisticians do not realize that Fisher's so-called "exact" test is not very accurate in many cases. 2. Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases. 3. One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible. One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value. ### E. The Multiplicity Mess 1. Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments. A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested. By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect. 2. There remains controversy over the choice of 1-tailed vs. 2-tailed tests. The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment. But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm. So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value? 3. Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments. 4. Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex. Scientific flexibility is discouraged. The p-value for an early data look must be adjusted for future looks. The p-value at the final data look must be adjusted for the earlier inconsequential looks. Unblinded sample size re-estimation is another case in point. If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects. How can that make any scientific sense? 5. Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present. ### F. Problems With Non-Trivial Hypotheses 1. It is difficult to test non-point hypotheses such as "drug A is similar to drug B". 2. There is no straightforward way to test compound hypotheses coming from logical unions and intersections. ### G. Inability to Incorporate Context and Other Information 1. Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value's inability to incorporate context or prior evidence. A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a "significant" p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see [here](http://andrewgelman.com/) for more). 2. p-values are unable to incorporate outside evidence. As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress. Nate Silver in *The Signal and the Noise* beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect. A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data. Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate. 3. p-values require subjective input from the producer of the data rather than from the consumer of the data. ### H. Problems Interpreting and Acting on "Positive" Findings 1. With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance). 2. Statisticians and subject matter researchers (especially the latter) sought a "seal of approval" for their research by naming a cutoff on what should be considered "statistically significant", and a cutoff of p=0.05 is most commonly used. Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant. Hypotheses are exchanged if the original H0 is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something "significant" emerges. 3. When the p-value is small, researchers act as though the point estimate of the effect is a population value. 4. When the p-value is small, researchers believe that their conceptual framework has been validated. ### I. Problems Interpreting and Acting on "Negative" Findings 1. Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means "get more data"). ### J. Distortion of Scientific Conclusions 1. Greenwald, Gonzalez, Harris, and Guthrie's paper [Effect sizes and p values: What should be reported and what should be replicated?](https://faculty.washington.edu/agg/pdf/Gwald_Gonz_Har_Guth_Psychophys_1996.OCR.pdf) nicely describes subtle distortions in the scientific research process caused by the usage of null hypotheses: ::: {.quoteit} One of the more important varieties of prejudince against the null hypothesis ... comes about as a consequence of researchers much more identifying their own theoretical predictions with rejections (rather than with acceptances) of the null hypothesis. The consequence is an ego involvement with rejection of the null hypothesis that often leads researchers to interpret null hypothesis rejections as valid confirmations of their theoretical beliefs while interpreting nonrejections as uninformative and possibly the result of flawed mehods. ::: ------ More recommended reading: * William Briggs' [Everything Wrong With P-values Under One Roof](http://wmbriggs.com/post/9338) -------- ## Discussion Archive (2017) **Jeffrey Blume**: One interesting point is that likelihoodists don’t really care about the proof of the LP from the CP or SP. This is because the LP is implied by the Law of Likelihood. Likewise if one defines the measure of the strength of evidence to be a Bayes factor, posterior probability, or some distance between the two hypothesis. However, if the measure of the strength of evidence is defined to be a probability or some other metric that depends on the sample space, then the LP will not apply. It all boils down to the fundamental building blocks: (1) what is the measure of the strength of evidence, (2) what is the probability that a study will generate misleading evidence, and (3) what is the probability that an observed measure is misleading. This is how systems for measuring evidence ought to be evaluated and compared. I posted because I think it is good to see alternative viewpoints and because I think it illustrates an important issue: The class of evidence functions being considered must be large enough to include functions that depend on the sample space and those that do not depend on the sample space. Otherwise the argument is effectively tautological. **Deborah Mayo**: No, Greg doesn't even purport to disprove me nor can he. As a logic professor, I'm clear on the logical mistake that led people to think Birnbaum had proved the LP--though it's quite subtle, and took a long time to explain (not to spot). My disproof is even deeper than Evan's disproof, because, as I explain in my paper, it requires more than a mere counterexample. **Jeffrey Blume**: The solution here is to be specific about what is communicated. If we were to only report that the result of the hypothesis test (Reject or Accept), then the false discovery rates would indeed be direct functions of the Type I and Type II error rates. That is, if you only tell me that you rejected the null, that information is more likely misleading if you used a design with a large Type I Error rate. However, if you report the data, or some summary of it, then the above argument no longer holds. The probability that the null hypothesis is true given the data (this is now the false discovery rate if the test rejects) does not depend on the Type I and Type II errors. Why? Because here the likelihood function for the observations depends on the data and the model (and not the sample space), whereas in the first example the likelihood function is for the test result (not the data) and that likelihood depends on a binomial model where the error rates determine the likelihood function. So, really, it’s all about the likelihood. Just a quick point on something alludes to above. Much of the discordance between the three schools of inference comes from how composite hypotheses are handled. For what it is worth, Neyman and Pearson acknowledged in their 1933 paper that there is no general solution to this problem. Their approach was to take the best supported alternative hypothesis in the alternative space and let that data-chosen alternative represent the alternative hypothesis when they applied their optimal solution to the simple -versus-simple case. It is a bit of a cheat, which they acknowledged, but a decent practical solution. The problem with this solution is that it can break down when the null hypothesis is true, because in that case the best supported alternative ends up being virtually identical to the null, but still regarded as an “alternative”. The Bayesian solution, of course, was to average over the alternative space to come up with a new simple hypothesis. Then the two simple hypotheses are compared. This has pros and cons too. So Savage is right on the mark here. Re the concern about errors rates outside of formal NP theory: I think this is overblown. Just because NP theory is not used does not imply that all resulting inferences are suspect. Likelihood methods are an excellent example. In the likelihood paradigm, both the Type I and Type II error rates go to zero. In fact, if Neyman and Pearson had chosen to minimize the average error rate (instead of holding one constant), then they would have been likelihoodists, since that solution is given by the Law of Likelihood. From here, one could make a strong argument that likelihoodist have better frequency properties than those afforded by hypothesis testing, solely because it makes little sense to hold the type I error fixed over the sample size. In many cases, this is what is causes hypothesis tests to go awry. Bayesian analyses benefit from this behavior as long as the prior does not change too quickly and it relatively smooth. I don’t buy the argument that the user is the problem. These procedures have been in use for almost a century by many disciplines. Is the claim really that all the problems, counterexamples and unintuitiveness that arise are due to user error? Frank’s list might seem daunting, and there are perhaps quibbles to be made, but the theme is on target: hypothesis and significance test procedures are flawed when it comes to measuring and communicating the strength of evidence in a given body of observations. I would not claim that that they are useless, although a strong case can be made when only the p-value is reported, but rather that they have serious shortcoming that require attention. For a disproof of the alleged disproof of the alleged proof of the likelihood principle see <http://gandenberger.org/research>. Gandenberger G. “A New Proof of the Likelihood Principle.” The British Journal for the Philosophy of Science 66, 3 (2015): 475-503. **FH**: The article is now posted. I didn't mean to imply that frequentists use priors. But sometimes you can solve for the prior that is consistent with how they operate. Regarding changing the prior I'm more skeptical that that is OK but I am influenced by working in a regulatory environment where pre-specification is all important. **Deborah Mayo**: First off, thanks. The frequentist doesn't assign priors to the hypotheses you mention; it is a fallacy to spoze that a match between numbers (error probabilities and posterior) means the frequentist makes those prior assignments. But on cheating, I don't see how you can say "a Bayesian can cheat by changing the prior after observing data". You need a notion of cheating. Error statisticians have one, what's the Bayesian's? Bayesians, by and large, aren't troubled by changing their prior post data as you can see from this post: ["Can you change your Bayesian prior?"](https://errorstatistics.com/2015/06/18/can-you-change-your-bayesian-prior-i). I think only subjective Bayesians may say no, but even Dawid says yes. I'd like to know what you think. **FH**: Very well written article and interesting history. I think that consideration of what constitutes cheating is a very useful exercise. It is also useful to back-construct a prior that makes certain things possible or likely. For example, sampling to a foregone conclusion happens when a statistician uses a smooth prior but her critic uses a prior with an absorbing state (point mass) at the null. Bonferroni correction is equivalent to a prior that specifies that the probability that all null hypotheses are true is the same no matter how many hypotheses are tested (a very strange assumption). A Bayesian can cheat by changing the prior after observing data or by improper conditioning, e.g., acquiring more data, finding the cumulative result to be less impressive than it was before, and rolling back the data to only analyze the smaller sample. But choosing a smooth prior before looking at the data and having the prior at least as skeptical as the critic's implied prior, will result in a stream of posterior probabilities that are well calibrated independent of how aggressive were the 'data looks'. Not only are the posterior probabilities calibrated, but the posterior mean is perfectly calibrated, discounted by the prior more when stopping is very early. The frequentist correction for bias in the sample mean upon early stopping is quite complex. Frequentists tend to be very good at correcting p-values for multiplicity but very bad at correcting point estimates for same. **Deborah Mayo**: If I can't interest you in learning that someone tried and tried again to achieve a stat sig result (or a HPD interval excluding the true value), even though with high or max probability this can be achieved erroneously, then your view of "being cheated" and mine are very different. But I'm glad you stick with this, it's the Bayesians who try to wrangle out of the consequences of accepting the LP that bother me. Please see [this](https://errorstatistics.com/2014/04/05/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour-2) blog post. **FH**: You'll have a hard time convincing me of the relevance of things that might have happened that didn't. And simulation studies for the frameworks I envision demonstrate that the stopping rule is really irrelevant. The simplest example is a one-sample normal problem with known variance where one tests after each observation is acquired, resulting in n tests for an ultimate sample size n. If you are thinking of a particular issue you might sketch the flow of a simulation that would demonstrate it. **Deborah Mayo**: Of course it comes from increasing the chances you give for the data to be more extreme, but the relevance of such outcomes that didn't occur is just what's denied by Bayesians who endorse the Likelihood Principle. Sequential trials were advocated by frequentists long ago (Armitage). He also argued that optional stopping also results in posteriors being wrong with high probability. But Savage switched to a simple point against point hypothesis to defend the LP. That is still going on today as the latest reforms champion the irrelevance of optional stopping––to them. Aside from inability to control error probabilities, the key problem with all Bayesian accounts is they never quite tell us what they're talking about (and they certainly don't agree with eachother), except perhaps empirical Bayesians. Is the prior/posterior an expression of degree of belief in various values of parameters? how frequently they occur in universes of parameters? As the number of parameters increases, the assessments–generally default priors–move further and further from anything we can get a grip on. They are not representing background information. And if we're going to test them, we'll have to report to something like significance tests. But given what you said earlier, about only caring to match the beliefs of "the judge" the whole business of outsiders critically appriasing you may not matter. **FH**: I appreciate that Steven, and only take issue with your "find fault" sentence. True, more fault lies with unscientific work than with statistical paradigms, but there are major problems with p-values, and p-values lead to many downstream problems as I've tried to catalog. The paradigm really matters. Not all of the fault lies with practitioners. This becomes more clear for those like me (a follower of David Spiegelhalter) who embrace Bayesian posterior probabilities and favor skeptical priors. Once you do the right simulations or grasp the theory (the former being easier for me) you'll see things such as the fact that multiplicity comes from the chances you give data to be more extreme, not the chances you give assertions to be true. And the fact that frequentist thinking leads usually to fixed sample size designs turns out to be a huge issue in experimental work. You're giving me the idea to post a separate article on my journey and how this relates to RMS, which I hope to complete in the next few days. **Steven McKinney**: I apologize for my choice of words. I am not threatened by the amount of work we have ahead of us to improve the situation, my entire career has been working to improve the situation. That’s why I own two copies of your RMS book and regularly use your software. I am threatened when an avalanche of pop-culture blog posts appear, inappropriately attributing fault to a statistic or paradigm when the fault lies with people mishandling and misinterpreting that statistic, or paradigm. As you say, you can do unscientific non-reproducible research using any paradigm and statistics thereof. As you state in the introduction above, “As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.” I hope that the title and opening sentence, and further general discussion of this topic, progresses and becomes “A Litany of Problems With Misinterpretation of p-values” “In my opinion, inappropriate handling and interpretation of null hypothesis testing and p-values have done significant harm to science.” Precisely. This is why I do not understand this odd stance you have adopted. Bayesian methods require the same discipline in handling analyses and interpreting findings as any other paradigm. When you and Gelman and others lead the way in demonstrating Bayesian analytical methods, I predict other blog postings such as "The litany of problems with Bayes factors" after groups with no statistical discipline misinterpret and mangle findings from such analyses. But for those of us still striving to find truth in data, I do genuinely look forward to seeing more Bayesian-based approaches in future revisions of Regression Modeling Strategies, along with the attendant steps needed to interpret findings in a disciplined manner. **FH**: I take offense are your choice of words. The fact that I have not been a Bayesian my whole career and hence do not have a plethora of Bayesian examples doesn't hold me back from seeking better approaches. You seem to be threatened by the amount of work we have ahead of us to improve the situation. I am not, which is one reason I'm working closely with FDA on this very problem. You can do unscientific, non-reproducible research using any paradigm. I want to do good science and have results that have sensible interpretations. **Steven McKinney**: This is exactly my point. Whether you use Bayesian methods, as the Duke group did, or you use frequentist methods, as Baggerly and Coombes did in reviewing several of the Duke analyses, you need to exercise discipline, using many ideas some of which are very ably described in Regression Modeling Strategies. If you want to start using more Bayesian based methods, have at it. I look forward to seeing your examples. What I find disingenuous and concerning are your statements “Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.” Where do you show this quantification, that Bayesian and likelihood paradigms solve the greatest number of real problems and have the fewest number of faults. That certainly wasn’t apparent in the Duke fiasco. I found some course notes on line, “An Introduction to Bayesian Methods with Clinical Applications” from July 8, 1998, by Frank Harrell and Mario Peruggia. That’s nearly 20 years ago. Why does it take so long to bring Bayesian methods into practice? If you haven’t been able to do it in nearly 20 years, how are other mere mortals to do it? **FH**: The statistical paradigm had almost nothing to do with this. Ask the forensic biostatisticians Baggerly and Coombs who uncovered the whole problem. In their wonderful paper (Annals of Applied Statistics 2009) neither Bayes nor prior appear. **Deborah Mayo**: Following McKinney on this, it was their confidence that their prior freed them from having to have genuine hold out data, that they thought made it permissible. McKinney knows more about the specifics. The blatant ignoring of unwelcome results was brought out by the whistleblowers. Of course if you're not in the business of ensuring error control, actions that alter them will be irrelevant **FH**: Ask yourself how many investigative journalism articles mentioned Bayes when writing about the Duke fiasco. I think the answer is zero, because Bayesian modeling had nothing at all to do with the problem. Your comments are most curious. I can't think of any analytical method that cannot be abused, even just using descriptive statistics. You might also look at why Mike West resigned from the collaboration early on. **Steven McKinney**: Anil Potti and Joseph Nevins used Bayesian methods recommended by Mike West during the Duke genomics fiasco a few years back. They overfitted models and committed many other analysis gaffes. Bayesian modelers are not immune from running afoul of proper interpretation of modeling findings to shed light on scientific phenomena. Proper handling of statistics and interpretation of findings is needed in any statistical exercise, Bayesian, frequentist or otherwise. Given the wealth of great advice in Regression Modeling Strategies and other writings, I am taken aback to see Frank Harrell blame a statistic when the problem is people misinterpreting a statistic. A p-value is just a statistic, with certain knowable distributional properties under this and that condition. A Bayesian Highest Posterior Density region can be improperly obtained and misinterpreted just as readily. **omaclaren.com**: I glad we can agree on A. I don't think this is a satisfactory argument against pvalues and neither is it satisfactory against likelihood. We can leave the other argument for another day. For now though I'll note that while I have quite a lot of sympathy for the 'pure' likelihood approach and/or evidential approaches, I don't find your axiom satisfactory. There is of course a basic axiom behind pvalues - Fisher's disjunction. But I don't find this fully satisfactory either. **Jeffrey Blume**: As for details on the evidential framework I alluded to above, a reference is: Blume JD. Likelihood and its evidential framework. In: Dov M. Gabbay and John Woods, editors, Handbook of The Philosophy of Science: Philosophy of Statistics. San Diego: North Holland, 2011, pp. 493-511. Any system purporting to measure evidence for or against a hypothesis ought to be subjected to the same scrutiny. This entails identifying three things: (1) the metric that will be used for measuring evidence, (2) the probability of observing a misleading metric under certain experimental conditions (i.e., error probabilities), and (3) the probability than an observed metric is indeed misleading (e.g., this would be a false discovery rate). Systems for measuring evidence can then be compared on the basis of these three criteria and perhaps their axiomatic justification. These three concepts are distinct, so a single mathematical quantity like the tail area probability can’t possibly represent all three. The above paper uses the likelihood paradigm to illustrate the point. Those mills are going to be running overtime. I simulated a string of 100,000 standard normal deviates and then computed the running z-statistic for testing the null hypothesis is zero (assuming the variance is known). In 10,000 simulations, only 74% rejected at some point (not bad for 100,000 looks at the data). The 25th, 50th, and 75th quartiles of the stopping time were 10, 98, 1533. The mean stopping time was ~5200. That means, for example, that 25% of the rejections occurred when the sample mean was around 0.05 (=1.96/sqrt(1533)). That’s 5% of a standard deviation. In practice, an observed difference that small is often a rounding error. The point is not that these are desirable properties, but rather that when these tests make a Type I Error, they often support hypotheses very close to the null. So close, in fact, that they may be practically indistinguishable from the null. If we only counted the rejections where the difference was at least 25% of a standard deviation, the rejection probability drops to about 20%. Not great, but not terrible for 100,000 looks at the data. **Deborah Mayo**: No time but to register: "absolutely absurd" though grist for my mills if Blume is for real. See my blog for details. errorstatisticscom **Jeffrey Blume**: The key is that once data is observed, we compute the likelihood ratio to measure the evidence. This, of course, does not depend on the stopping rule. And if we are concerned that the data are likely to be misleading because we looked a lot during our study, we would compute the probability that our observed data are mistaken. The point is that the measure of the evidence (the LR) and the probability that the observed data are mistaken (written as P(H_0|LR>k)) , are not the same as the probability that the study design will generate misleading evidence (written as P(LR>k|H_0)). Error probabilities have an important role in statistics; they just don’t represent the strength of the evidence in the data or the probability that the observed data are misleading. The third sentence is a nice example of a common misinterpretation of the LP. The LP only says that the stopping rule is irrelevant for the measurement of the strength of evidence in data. It does not say the stopping rule is irrelevant for everything. Confusion rains because we often fail to distinguish between the measure of the strength of evidence and the probability that the evidence will be misleading. Likelihoodists follow the LP and use error probabilities. When we design a study, we compute its frequency properties. If we plan to look at the data many times, the probability of observing misleading evidence gets inflated (lucky for us, however, that inflation is bounded above). These frequency properties help us choose between designs. Modern Bayesian do the same. I think it is important to understand why this happens. When the null hypothesis is true the likelihood masses right on top of the null value. Every now and then, the tails of the likelihood shift by an infinitesimal amount and this tiny tiny shift causes the classical hypothesis test to reject because the benchmark for rejecting is measured in standard errors (which are rapidly shrinking to zero) and not standard deviations. It leaves us in the odd position of claiming to reject the null hypothesis when data support hypotheses that are arbitrary close the null. If instead we decided to only count the statistical rejections that also supported clinically meaningful differences, the resulting probability would not approach one or even be all that high. Overly broad generalizations can lead to confusion. So let’s be specific and consider the points. The first clause is true for routine usage of classical hypothesis and significance tests when the sample size is allowed to grow forever. I have yet to see a study where this was possible, so the force of this point is diminished I think. If we take the case where the sample size is finite, the claim is false. The probably might be high in some cases, but it will not be one, and certain cases can be constructed so the probably is not so high. Regardless, this is an excellent reason to avoid using p-values in my opinion. Why not use a tool that does not have this property (e.g., a likelihood ratio)? And...I would not be 'ok' with “inference based on the p-value function” largely because there is not axiom to support its use. The axiom would be something like “A set of data supports the hypotheses that do a better job at prediction the data and data more extreme”. I don’t find this compelling because of the inclusion of “more extreme”, which means different things to different people. The NP hypothesis testing framework is clear that “more extreme” means large likelihood ratios. In contrast, significance testing often defines “more extreme” as further away in hypothesis space (tail areas). These two definitions are not the same; which confuses matters. Large likelihood ratios can lead to instances where the rejection regions are near the null hypothesis and not in the tails (e.g., comparing means of two normal models with different variances). So I'll jump in as a Likelihoodist. Likelihood methods for measuring statistical evidence have very precise framework. The basic axiom is this: the hypothesis that does a better job predicting the observed data is better supported by the data. For (a), I don’t see the “reverse” conditioning here as problematic; this is the natural way to compare predictions. For (b), no issues here for the likelihood approach. The predictions from each hypothesis are directly compared and we just report which hypotheses did the better job of predicting the observed events. No need to rely on proof by contradiction, which Frank correctly points out is problematic when the direct implication is replaced by probabilistic tendency. For (G): By design, likelihood methods report which hypotheses are better supported than others given the observed data and model. You need a model under which to specify the predictions of each hypothesis and that model is often prescribed by context. But I would guess this is not what Frank is referring to, if only because everyone generally agrees on the form of the likelihood function. Additional information, such as data from a previous study, would simply be combined in the likelihood (e.g., if the studies are independent, one could multiply the likelihoods). Information that represents personal believe or some other hunch would best be incorporated in the Bayesain framework. Likelihoodists want to know ‘what the data themselves say’, not ‘what the data say after I add in prior information’. The first sentence is a little ironic, no? Many people have thought long and hard about these issues and we’ve been debating them for over a century. And there are plenty of examples that don’t assume a Bayesian solution that make hypothesis tests look downright insane (Pratt’s voltage detector example, in 1961 I think, is one). p-values are often not reproducible because they confound the effect size and the level of precision. Also two equal p-values don’t imply the same amount of “evidence”, so its not clear why one would care or expect them to replicate. The thing to replicate is the effect size, not the p-value. I’d agree but with a caveat. The scientific benchmark for what is discrepant should not change. We can use statistical tools to assess if that benchmark is achieved or not, but we should not be using statistical tools to set the benchmark, which is what hypothesis testing effectively does. The problem is that the scientific force of the discrepancy changes (it depends on standard errors), so we can end up with statistically significant results that are not actually scientifically discrepant. Personally, this is why I prefer other approaches (e.g., likelihood, Bayesian) that respect the original scale of the data upfront. **FH**: When the analyst uses the same prior as the judge, posterior probabilities are perfectly calibrated independent of the stopping rule. **Deborah Mayo**: Anyone who ignores the stopping rule can erroneously declare significance with probability 1, and with the corresponding Bayesian priors can erroneously leave the true parameter value out of the Bayesian HPD interval. See stopping rules on my blog. Anyone who obeys the LP cannot use error probabilities. I think only subjective Bayesians are still prepared to endorse that and staunch likelihoodists. Just what today's practice needs. For that matter, let replicationists try and try again until they can get a stat sig effect Then the replication rate will be 100% I have, incidentally, disproved alleged proofs of the LP. **FH**: Not sure. I'm going to invite a likelihoodist to join the conversation. **omaclaren.com**: Well, it is related to Fisher's fiducial approach - see eg the reminisces of Fraser at the end [here](https://www.utstat.toronto.edu/dfraser/documents/ARST04-Fraser-copyedited.pdf) - but more generally is just standard Fisherian-style confidence theory as used by eg Cox, Fraser etc and as opposed to Neymanian confidence and decisions. I assume the p-value function will depend on stopping rules etc as generally understood - ie they violate the strong likelihood principle. But, my more general point - rather than advocating for likelihood, Bayes or confidence theory - is that points A and B seem to either (logically) apply to both likelihood and confidence theory/pvalue functions or to neither. **FH**: Not sure. Are you trying to get at Fisher's fiduciary method? I prefer having a full model for inference, but I do like the likelihood school of inference because it respects the likelihood principle. For example inference is independent of a stopping rule. How do you handle multiplicity/sequential stopping rules for a p-value function? **omaclaren.com**: Fair enough. But would you agree it is no more of less subject to A and B than Likelihood? Or do you also disagree with this? **FH**: That's the paper. I haven't looked into the Box approach. I would use a prior that is eventually overwhelmed by data, get agreement on it, and not often revisit the prior. **Bill R**: Are you referring to the "Bayesian Approaches to Randomized Trials" paper (JRSS A ,157, part 3, 357-416)? In section 6.3 Speigelhalter recommends using Box's generalized p-value to check prior data compatibility. If I have a point prior (null) would that not simplify to Fisher's p-value? **FH**: That's not satisfying. High-level view: Aside from non-study information, the p-value is monotonically related to what I need, but it is not calibrated to be the metric I need. **omaclaren.com**: Would you then be OK with inference based on a pvalue function defined analogously to a likelihood function? That is, PF(theta;y0) := prob(y>y0; theta) considered as a function of theta for y0 fixed. Is this still subject to A or B? **FH**: I believe this would be Edwards-Royall. I think that likelihood has a bit of a problem with G but not with A or B.