p-value
What It Is
The p-value is derived from the results of a statistical test that assumes the null hypothesis is true. A test statistic is calculated based on the formula for that statistical test using the data collected. The p-value is the probability that the test statistic value is at least as extreme as the one obtained.
In classical statistical testing, if the p-value is small (the traditional cutoff being <5% or 0.05), we conclude that the test statistic observed would be unusual if the null hypothesis were true. In these cases, investigators may decide that the null hypothesis is false.
The p-value is a conditional probability, i.e., able to be determined in the situation where the null hypothesis is true. It does not, repeat, does not indicate the probability that the null hypothesis is true.
The graph below shows what to expect when the test statistic follows a standard normal distribution pattern with mean = 0 and standard deviation = 1 (the thick black line) in the situation of the null hypothesis being true. Most of the time the statistical test will return values close to zero (i.e., the “hump” in the middle). Sometimes, though, the statistical test will return a value “distant” from zero, for example, to the left of the blue line at -2 on the x-axis or right of the orange line at 2. The null hypothesis is still true, we just got an atypical result. If we get one of those atypical results in the context of a true null hypothesis but decide the null hypothesis is false, we have made a type 1 error.
There’s another complication with the test statistic and its p-value: it’s not influenced only by the difference in outcomes between your study groups. The inherent variability of the data partly determines the value of the test statistic, as does the number of people in each group. In particular, by collecting data on large numbers of people (or objects), you can guarantee that your test statistic will be extreme and the p-value very small, even if there is no relevant difference between your study groups.
On the other hand, if you collect data on too small a number of people, it will be very difficult for your test statistic to be extreme and your p-value small, even if there is an important difference between study groups. You may observe a pretty large difference between groups, but the p-value is large in spite of that. That presents a quandary because your results aren’t precise (read: easily repeatable). Another study with larger numbers of people might produce a much smaller difference between study groups and once again the p-value is “large,” or you could see a similar difference and this time the p-value is very small. There’s no way to tell based on the first study.
Why It’s Important
It may seem like the p-value isn’t much good to us, since we have to refer to a situation where the null hypothesis would be true and that’s what we’re trying to find out. It would be great if a statistical test would just give us the answer to that question. Unfortunately, it doesn’t work that way. We are looking at only a part (sample) of all the evidence (data) that is out there, so there will always be some uncertainty built into the analysis.
The statistical test and its p-value help to bring some objectivity to our investigation. People see what they want to see in data. The statistical test and its p-value allow us to set up a rule ahead of time, and then apply it to the data collected. It provides a basis for you and the readers of your study’s results to interpret them through the same lens. When you set your p-value cutoff pretty low, it also helps to minimize that type I error of deciding the null hypothesis is false when it’s really true.
Of course, there are shenanigans that go on with the p-value, but that’s not the p-value’s fault. It’s perhaps unfortunate that we call these statistical processes “tests,” as if they provide a final or one right answer. It’s better to think of the p-value as a “clue” found in your data, or a piece of the puzzle.