Common pitfalls in p-value and semi-solutions.

Requirement: This blog requires you at least heard about what is a hypothesis testing and what is a p-value before. If you have never heard about p-value and hypothesis testing before, this is a good place to start( The Wikipedia on p-value would also help(


The simplified common procedure in a hypothesis test is the following

  1. Define null and alternative hypotheses.
  2. Calculate test statistic and p-value based on the data.
  3. A type I error rate 0.05 is used to determine if the hypothesis should be rejected or not.

If the p-value is smaller than 0.05, then the null hypothesis is rejected.

If the p-value is bigger than 0.05, then there is not sufficient evidence to reject the null hypothesis.

If you pay attention to the bold part of my sentence, I never state that the null hypothesis is accepted, I only state that there is not sufficient evidence to reject the null hypothesis. The last statement is commonly rephrased as “fail to reject the null hypothesis”.

Yes. Yes. It is so confusing and frustrating. The statement makes us feel like we have done nothing going through all that trouble. I admit this statement has a pessimistic feeling to it, but I digress.

There is a sensible reason for that pessimistic statement.

The distinction between “fail to reject” and “accept” is quite different if you really think about it. This is a basic logical fallacy called “argument from ignorance” (see the wiki for an explanation of argument from ignorance

Is there anything we could do to determine if we could actually accept the null hypothesis with a certain level of confidence? Yes. It is called the power of the statistical test (sensitivity). 

The power of the statistical test (sensitivity) is the probability that we fail to reject the null hypothesis when the null hypothesis is true.

The calculation is simple, you do a number of statistical tests repeatedly with a true null hypothesis. Then, you calculate the number of times the test fails to reject the null hypothesis when the null hypothesis is true and divided by the total number of statistical tests you do. Do remember that this calculation depends on how you choose the Type I error rate as 0.05. You could definitely change it to improve the power of your statistical test.

How does the power of the statistical test (sensitivity) relate to Type I error? We will talk about ROC curves in the next blog with some examples in R code.

If you also know something about machine learning, then you will understand why some people use AUC (Area under the curve) as the performance metric instead of Accuracy in classification. The curve as referred in Area Under the Curve is the ROC curve.

To give you a peek of what ROC curve generally look like.



Added notes, there are some misunderstandings about p-value. See this blog to read more on p-value (


(Three days later I realized that I have made a mistake the true positive rate should be the sensitivity. I have been using the word “power of statistical test” more loosely as the probability of detection if the effect is there since some of the stationary tests I used recently have the contrary null hypothesis, which leads to my mistake here. I apologize for any inconvenience. )