I am certainly interested in this and it is a part of a lecture I have been giving lately to clinicians on Understanding What Works in Medicine. But I think we might need to broaden the scope a bit. There is, as I am sure you know, a huge literature debating the proper role and value of the p value in science, much of it outside the statistics journals, and I think the fact that we are even talking about adding to that literature means that the discussion is not progressing. There are, I believe, at least four reasons for that. First, the p value flaws are counterbalanced by the fact that it seems to simplify something that otherwise is not easy, namely understanding what data say. In that regard, it is like the journal Impact Factor, or the US New Ranking of Hospitals or Universities. A very poor quality measure but nothing else of equal simplicity has been offered to compete. This is tough to fix. I have been trying to figure out how to upset the dominance of the Impact Factor for Journal ranking but find that even people who know better still worship at that shrine. Second, most people don’t actually understand what the p value is. (That is also true of the impact factor). It is so counter-intuitive that it is difficult for people to remember what it actually represents. The formal definition is not much help. The metaphor I used in my lecture last month at Frank’s old stomping grounds of UVA was that the p value is like a clinical diagnostic test for data, a serum p value of sorts. A comparable test in medicine is the ESR, or sed rate, which provides information about “inflammation” (another metaphor) and locates the patient in one of three spaces (no inflammation evident from the test, mild to moderate inflammation, severe inflammation), but provides absolutely no information as to the cause of the abnormality. So to figure out what is actually going on with the patient (data) after receiving an abnormal ESR (p value) test, one needs to do more testing and examination. Confidence intervals anybody? But even there, the interpretation is not so easy and many make the mistake of assuming that all values in the CI are equally likely given the data. Ken Rothman published some nice figures showing p value functions (a full stack of CIs) that illustrate the difference between the p<0.05 rule and what the data actually say that I think help illustrate that we have been looking at the tail (no pun intended) in order to understand what sort of dog we have. Actually, that might be a pretty good metaphor! Third, most do not understand the difference between the Fisher p value and the NP hypothesis test, and how those tools were intended for different purposes. I think the hybridization of the two has added to the confusion, particularly the conflation of the alpha/type I error rate with the observed p value and the resulting desire to adjust the p value for multiplicity. I have for a long time been thinking about writing about the confusion around the multiple comparisons issue, since I have been surprised at how many really thoughtful people seem to accept something that makes no sense to me – the assertion that looking at the data changes something about the data (it doesn’t of course, but that is often how it is presented). Finally, the whole p value debate gets tangled into the larger conflicts among the advocates of the Bayesian vs Frequentist vs Likelihood perspectives. All these debates focus on the data that has been collected and ignore the huge ambiguities outside the sphere of the data, the dark matter of science. We make the convenient but highly improbable assumptions that we somehow have obtained a representative sample of some theoretical underlying population and if we just do the correct form of analysis we will arrive at the truth for sure. And when it does not work out that way, we never fail to be surprised! And then there is the question of what the mathematical theory of probability corresponds to in the natural world, if anything.