The use of the p-value–with the five percent probability (or any other percentage) as a be-all and end-all in terms of establishing whether there is a significant effect or not lacks coherency.
In Easter, almost every Spanish family gets ready to prepare a delicious dessert called torrijas. Torrijas are basically the Spanish high-sugar and deep-fried version of French toast. After soaking the slices of bread in milk, dipping them in beaten eggs and frying them with boiling olive oil, one needs to cover those fried slices in a mixture of cinnamon and sugar (lots of both!). It seems simple, right? However, the last step was passionately discussed by my mother and her five sisters this last Easter. There were two teams: one defending that the sugar must be added before the cinnamon and the other one defending the reverse. And, then, it was only me who dared to say that the discussion was pointless because there is no way in the world to notice such a difference. I was actually ready to perform a sort of experiment to prove my point but my cousins were too hungry and I am afraid that, after years of pestering them with my scientific curiosity, they do not have any patience left.
Nevertheless, that discussion brought to my mind the actual topic I want to discuss in this post: the P-value and its use in hypothesis testing. Similar to the example with the torrijas, it all started more than 80 years ago with a British woman who claimed to be able to tell if milk or tea was added first to a cup. R.A. Fisher, a young and curious statistician, devised an experiment aiming to prove that the British woman could not do what she claimed (Figure 1). He presented her with 8 cups in random order – 4 with the milk poured in first and 4 with the tea poured in first. Then, she was challenged to pick the 4 cups in which the hot milk was added before the tea. It turned out that the British woman was actually able to notice the difference. Indeed, as the story goes, she successfully identified the cups and this, ladies and gentlemen, settled the basis of the P-value and its use in hypothesis testing – a concept deeply used in science in general and in ecology and biology in particular. At this point, I need to be honest and confess that I am not a ‘P-value lover’ so this post is by nature biased, as I will use it to actually criticize the P-value. However, there is still an ongoing debate around this topic. Those that want to form their own opinion, do check this forum in the journal Ecology. Avoiding mathematical formalities, here I will present only a simple example that supports my suspicions against this practice.
P-VALUE DEFINITION AND ITS USE
The American Statistical Association (ASA) has done the scientific community a great service by describing what a P-value is and what it is not and highlighting the misuses of the P-value. According to ASA: “Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g. the mean difference between two compared groups) would be equal or more extreme than its observed value”. In other words, the P-value is the probability of obtaining a result (e.g. the mean difference between two samples), at least as extreme as the observed result. For example, in experiments, P-values are used to identify whether there is significant difference between the control and the treatment samples, with a P-value ≤0.05 the most commonly used threshold. Indeed, those of us dealing with experiments have been tempted to coax another 0.001 out of our P-value=0.051 output (maybe if I just rounded to two decimal places…), raise a toast to P= 0.049, or desperately search for another tests to successfully cross the finish line to give credit to our experiments. The P-value ≤0.05 red line is what triggered my suspicion (and the criticism of a large part of the scientific community already in the 1990s) against this overused and fully accepted practice.
P-VALUE CAN BE BOGUS: AN EXAMPLE OF LACK OF COHERENCE
Let us come back to the sweet torrijas. Imagine that we run an experiment with two groups: Group 1 had the cinnamon added before the sugar and Group 2 had the sugar added before the cinnamon. We measure the amount of sugar per piece – as a sweetness indicator – of each torrija in both groups. For simplicity, we scale those values to have unit variance (i.e. the variance of the sample is equal to one). Then, we can compute the mean difference in sucrose solution between Group 1 and Group 2 samples. Say now that we want to test two hypotheses: (1) the difference in the amount of sugar per piece between groups ranges from -0.5 to 0.5 g/piece and (2) the difference in the amount of sugar per piece between groups ranges from -0.82 to 0.52 g/piece. Take that the actual value of this difference is 2.18 g/piece (Please note that the numbers used in this example are forced to coincide with the experiment performed by Schervis, 1996). Under these circumstances, Schervis (1996) shows that the P-value of the first hypothesis test is 0.0502, while the P-value of the second test is 0.0498. In other words, if we were to test each hypothesis at an error rate of 0.05, then we would reject that there is a difference of the second hypothesis (-0.82, 0.52), but would not reject a difference after testing for the first hypothesis (-0.5, 0.5)! This shows that the P-value can be bogus and its use for hypothesis testing lacks coherency. This is just one among thousands of examples.
WHAT CAN WE DO?
We need to accept that nature is messy and inference is just hard. I do not intend to make you stop using the P-value in hypothesis testing because this practice is somehow useful and has been used for ages. However, we need to be cautious and, more importantly, we need to move forward. I would like to encourage the whole scientific community to embrace or at least explore new techniques such as Bayesian and/or information-theoretic approaches (which might be the focus of future posts if I manage to find the perfect dessert to write about them!). Bottom line, do not follow the P-value rule without any sense of what you are doing. Just explore, read about it, and form your own opinion! Learning new things is always interesting and FUN!
Schervish, M. J. (1996). P values: what they are and what they are not. The American Statistician, 50(3), 203-206. DOI: https://doi.org/10.1080/00031305.1996.10474380