The Nerdiest Debate of the Century
Ah, the classic p-value. The concept of hypothesis testing and p-value is very familiar to all of us. We were first taught about it in high school and again, in the university if we took fundamental statistics. At first glance, it seems like all powerful and reliable tools that one can use to aid his/her decision making process — which can be true in certain applications. However little did I know, the dynamic duo are not infallible.
In the year 2016, a statement by American Statistical Association (ASA) centred around p values sparked one of the most important and nerdiest debate of the century¹. The crux of the problem? The reproducibility and replicability of scientific conclusions. Every scientific research has the same KPI — Does it have enough scientific evidence to support the hypotheses? It is only when there is “enough” scientific evidence that these hypotheses can regarded as the “correct” conclusions.
Classically, hypothesis testing and p-value are used in tandem as a proof that there is enough evidence to support the hypotheses. But, there are many examples where it leads to spurious conclusions instead. How could this happen? We will look at two of the most cited causes.
Cause no 1: Misinterpretation of P value
The p value is one of the most misinterpreted variable in the statistical world. As a university student that took statistics classes, I struggled with it too. The p value is defined as
Given the null hypothesis is true, p value is the probability of obtaining test statistics at least as extreme as the observed statistics.
Perhaps the convoluted definition of p value contributes to its misinterpretation. I initially thought that p value is equivalent to the probability that a given hypothesis is true. But they are not equivalent — they are two different things! The p value is always based on the assumption that the null hypothesis is true. So we can think of it as a prior believe that we hold. In the classical context of coin flipping for example, if a coin is fair, we know that on average, if we flip the coin many times, it will come up as heads 1/2 of the times. That will be our assumption or prior believe (although in this case we know that it is true). As such when we are doing hypothesis testing, we will be evaluating the observed values with our prior believe which means, any conclusion that we draw will always be in the context of our prior beliefs or hypotheses. For our coin flipping example, we can conclude that if we reject the null hypothesis that the coin is fair, then it is likely that the coin we use is an unfair coin.
Things however, are not clear cut when it comes to experiments. This is because we don’t actually know if our null hypothesis is correct in the first place. As such, just because we reject the null hypothesis, it does not mean the observed data supports the alternative. It could also mean that our assumptions about what is the null hypothesis is an utter nonsense. In a court room setting for example, when a judges rules that a defendant is not guilty, does it mean that he is innocent? Not necessarily. Maybe he is not found guilty because we are trialling him for the wrong crime.
In fact, when Ronald Fisher first developed the significance testing, he only intended for p value to be used to establish whether a further research could be justified — far from being a “yes or no” conclusive evidence².
Cause no 2: The imaginary threshold of 0.05
This is another contributing factor to the misuse of p value and hypothesis testing. The alpha level of 0.05, is a mere convention. As a result, in a context of experiment for example, it is always possible to adjust the threshold so that the hypothesis testing will be statistically significant. As if that is not enough, a phenomenon called “p value hacking” is also a common occurrence. P value hacking is described as conscious or subconscious data manipulation to produce a desired p value. One way to do this for example, is simply by increasing the sample size of the study which tend to produce small p value.
Does it mean we should abandon p value altogether?
No. I believe the traditional approach of hypothesis testing and p value is still valuable in today’s society. Abandoning p value completely will create a butterfly effect across industries — It will render findings from scientific exploration unverifiable and invalidate useful scientific conclusions. Take the covid19 situation for example. Prior to the vaccines roll-out, many were in a race against time to develop one. Once developed, we needed a way to validate the effectiveness. How do we do that? You have guessed it — hypothesis testing and p value.
The covid19 application is just one of many successful implementations of the classical approach. If that is the case, does that mean we should disregard the concern that sparked the debate? Absolutely not.
What can we do?
And that begs the question. What can we do? I believe balance is the key.
To prevent further misuse of hypothesis testing and p value, it is important to integrate both frequentists and bayesians’ approach. The frequentists think in terms of “black or white” — either yes or no. The bayesians on the other hand, think in terms of “black and white” — a neutral grey. This following article gives a comprehensive explanation of the difference between the two school of thoughts³. In our context, integrating the two different school of thoughts can look like the following:
- using p value and hypothesis testing only as a quick litmus test for your hypothesis — whether we should change the null hypothesis or dive deeper into the findings
- involving Subject Matter Experts (SMEs) in every step along the way
- using other metrics as conclusive significance testing
- complementing the standard Null Hypothesis Significance Testing with bayesians’ hypothesis testing⁴