# 17 Statistical Hypothesis Tests in Python (Cheat Sheet)

Last Updated on

#### Quick-reference guide to the 17 statistical hypothesis tests that you need in applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

• The name of the test.
• What the test is checking.
• The key assumptions of the test.
• How the test result is interpreted.
• Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

• Update Nov/2018: Added a better overview of the tests covered.
• Update Nov/2019: Added complete working examples of each test. Add time series tests. Statistical Hypothesis Tests in Python Cheat Sheet
Photo by davemichuda, some rights reserved.

## Tutorial Overview

This tutorial is divided into 5 parts; they are:

1. Normality Tests
1. Shapiro-Wilk Test
2. D’Agostino’s K^2 Test
3. Anderson-Darling Test
2. Correlation Tests
1. Pearson’s Correlation Coefficient
2. Spearman’s Rank Correlation
3. Kendall’s Rank Correlation
4. Chi-Squared Test
3. Stationary Tests
1. Augmented Dickey-Fuller
2. Kwiatkowski-Phillips-Schmidt-Shin
4. Parametric Statistical Hypothesis Tests
1. Student’s t-test
2. Paired Student’s t-test
3. Analysis of Variance Test (ANOVA)
4. Repeated Measures ANOVA Test
5. Nonparametric Statistical Hypothesis Tests
1. Mann-Whitney U Test
2. Wilcoxon Signed-Rank Test
3. Kruskal-Wallis H Test
4. Friedman Test

## 1. Normality Tests

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

### Shapiro-Wilk Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

• Observations in each sample are independent and identically distributed (iid).

Interpretation

• H0: the sample has a Gaussian distribution.
• H1: the sample does not have a Gaussian distribution.

Python Code

### D’Agostino’s K^2 Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

• Observations in each sample are independent and identically distributed (iid).

Interpretation

• H0: the sample has a Gaussian distribution.
• H1: the sample does not have a Gaussian distribution.

Python Code

### Anderson-Darling Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

• Observations in each sample are independent and identically distributed (iid).

Interpretation

• H0: the sample has a Gaussian distribution.
• H1: the sample does not have a Gaussian distribution.
Python Code

## 2. Correlation Tests

This section lists statistical tests that you can use to check if two samples are related.

### Pearson’s Correlation Coefficient

Tests whether two samples have a linear relationship.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample are normally distributed.
• Observations in each sample have the same variance.

Interpretation

• H0: the two samples are independent.
• H1: there is a dependency between the samples.

Python Code

### Spearman’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.

Interpretation

• H0: the two samples are independent.
• H1: there is a dependency between the samples.

Python Code

### Kendall’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.

Interpretation

• H0: the two samples are independent.
• H1: there is a dependency between the samples.

Python Code

### Chi-Squared Test

Tests whether two categorical variables are related or independent.

Assumptions

• Observations used in the calculation of the contingency table are independent.
• 25 or more examples in each cell of the contingency table.

Interpretation

• H0: the two samples are independent.
• H1: there is a dependency between the samples.

Python Code

## 3. Stationary Tests

This section lists statistical tests that you can use to check if a time series is stationary or not.

### Augmented Dickey-Fuller Unit Root Test

Tests whether a time series has a unit root, e.g. has a trend or more generally is autoregressive.

Assumptions

• Observations in are temporally ordered.

Interpretation

• H0: a unit root is present (series is non-stationary).
• H1: a unit root is not present (series is stationary).

Python Code

### Kwiatkowski-Phillips-Schmidt-Shin

Tests whether a time series is trend stationary or not.

Assumptions

• Observations in are temporally ordered.

Interpretation

• H0: the time series is not trend-stationary.
• H1: the time series is trend-stationary.

Python Code

## 4. Parametric Statistical Hypothesis Tests

This section lists statistical tests that you can use to compare data samples.

### Student’s t-test

Tests whether the means of two independent samples are significantly different.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample are normally distributed.
• Observations in each sample have the same variance.

Interpretation

• H0: the means of the samples are equal.
• H1: the means of the samples are unequal.

Python Code

### Paired Student’s t-test

Tests whether the means of two paired samples are significantly different.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample are normally distributed.
• Observations in each sample have the same variance.
• Observations across each sample are paired.

Interpretation

• H0: the means of the samples are equal.
• H1: the means of the samples are unequal.

Python Code

### Analysis of Variance Test (ANOVA)

Tests whether the means of two or more independent samples are significantly different.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample are normally distributed.
• Observations in each sample have the same variance.

Interpretation

• H0: the means of the samples are equal.
• H1: one or more of the means of the samples are unequal.

Python Code

### Repeated Measures ANOVA Test

Tests whether the means of two or more paired samples are significantly different.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample are normally distributed.
• Observations in each sample have the same variance.
• Observations across each sample are paired.

Interpretation

• H0: the means of the samples are equal.
• H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

## 5. Nonparametric Statistical Hypothesis Tests

### Mann-Whitney U Test

Tests whether the distributions of two independent samples are equal or not.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.

Interpretation

• H0: the distributions of both samples are equal.
• H1: the distributions of both samples are not equal.

Python Code

### Wilcoxon Signed-Rank Test

Tests whether the distributions of two paired samples are equal or not.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.
• Observations across each sample are paired.

Interpretation

• H0: the distributions of both samples are equal.
• H1: the distributions of both samples are not equal.

Python Code

### Kruskal-Wallis H Test

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.

Interpretation

• H0: the distributions of all samples are equal.
• H1: the distributions of one or more samples are not equal.

Python Code

### Friedman Test

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

• Observations in each sample are independent and identically distributed (iid).
• Observations in each sample can be ranked.
• Observations across each sample are paired.

Interpretation

• H0: the distributions of all samples are equal.
• H1: the distributions of one or more samples are not equal.

Python Code

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

• The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
• The key assumptions for each test and how to interpret the test result.
• How to implement the test using the Python API.

Do you have any questions?

Did I miss an important statistical test or key assumption for one of the listed tests?
Let me know in the comments below.

## Get a Handle on Statistics for Machine Learning! #### Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

### 41 Responses to 17 Statistical Hypothesis Tests in Python (Cheat Sheet)

1. Jonathan dunne August 17, 2018 at 7:17 am #

hi, the list looks good. a few omissions. fishers exact test and Bernards test (potentially more power than a fishers exact test)

one note on the anderson darling test. the use of p values to determine GoF has been discouraged in some fields .

• Jason Brownlee August 17, 2018 at 7:43 am #

Excellent note, thanks Jonathan.

Indeed, I think it was a journal of psychology that has adopted “estimation statistics” instead of hypothesis tests in reporting results.

2. Hitesh August 17, 2018 at 3:19 pm #

Very Very Good and Useful Article

• Jason Brownlee August 18, 2018 at 5:32 am #

Thanks, I’m happy to hear that.

3. Barrie August 17, 2018 at 9:38 pm #

Hi, thanks for this nice overview.

Some of these tests, like friedmanchisquare, expect that the quantity of events is the group to remain the same over time. But in practice this is not allways the case.

Lets say there are 4 observations on a group of 100 people, but the size of the response from this group changes over time with n1=100, n2=95, n3=98, n4=60 respondants.
n4 is smaller because some external factor like bad weather.
What would be your advice on how to tackle this different ‘respondants’ sizes over time?

• Jason Brownlee August 18, 2018 at 5:36 am #

Good question.

Perhaps check the literature for corrections to the degrees of freedom for this situation?

4. Fredrik August 21, 2018 at 5:44 am #

Shouldn’t it say that Pearson correlation measures the linear relationship between variables? I would say that monotonic suggests, a not necessarily linear, “increasing” or “decreasing” relationship.

• Jason Brownlee August 21, 2018 at 6:23 am #

Right, Pearson is a linear relationship, nonparametric methods like Spearmans are monotonic relationships.

Thanks, fixed.

• Fredrik August 23, 2018 at 8:59 pm #

No problem. Thank you for a great blog! It has introduced me to so many interesting and useful topics.

• Jason Brownlee August 24, 2018 at 6:07 am #

Happy to hear that!

5. Anthony The Koala August 22, 2018 at 2:47 am #

Two points/questions on testing for normality of data:
(1) In the Shapiro/Wilk, D’Agostino and Anderson/Darling tests, do you use all three to be sure that your data is likely to be normally distributed? Or put it another way, what if only one or two of the three test indicate that the data may be gaussian?

(2) What about using graphical means such as a histogram of the data – is it symmetrical? What about normal plots https://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm if the line is straight, then with the statistical tests described in (1), you can assess that the data may well come from a gaussian distribution.

Thank you,
Anthony of Sydney

6. Tej Yadav August 26, 2018 at 4:07 pm #

Thanks for sharing Jason.

• Jason Brownlee August 27, 2018 at 6:10 am #

I’m happy it helps!

7. Nithin November 7, 2018 at 11:23 pm #

Thanks a lot, Jason! You’re the best. I’ve been scouring the internet for a piece on practical implementation of Inferential statistics in Machine Learning for some time now!
Lots of articles with the same theory stuff going over and over again but none like this.

• Jason Brownlee November 8, 2018 at 6:08 am #

• Nithin November 8, 2018 at 11:12 pm #

Hi Jason, Statsmodels is another module that has got lots to offer but very little info on how to go about it on the web. The documentation is not as comprehensive either compared to scipy. Have you written anything on Statsmodels ? A similar article would be of great help.

8. Thomas March 29, 2019 at 10:02 pm #

Hey Jason, thank you for your awesome blog. Gave me some good introductions into unfamiliar topics!

If your seeking for completeness on easy appliable hypothesis tests like those, I suggest to add the Kolmogorov-Smirnov test which is not that different from the Shapiro-Wilk.

• Jason Brownlee March 30, 2019 at 6:27 am #

Thanks for the suggestion Thomas.

9. Paresh April 16, 2019 at 5:17 pm #

Which methods fits for classification or regression data sets? Which statistical tests are good for Semi-supervised/ un-supervised data sets?

• Jason Brownlee April 17, 2019 at 6:55 am #
10. Luc May 1, 2019 at 10:01 pm #

Hello,
Thank you very much for your blog !

I’m wondering how to check that “observations in each sample have the same variance” … Is there a test to check that ?

• Jason Brownlee May 2, 2019 at 8:03 am #

Great question.

You can calculate the mean and standard deviation for each interval.

You can also plot the series and visually look for increasing variance.

11. João Antônio Martins June 2, 2019 at 4:39 am #

Is there a test similar to the friedman test? which has the same characteristics “whether the distributions of two or more paired samples are equal or not”.

• Jason Brownlee June 2, 2019 at 6:45 am #

Yes, the paired student’s t-test.

12. MIAO June 27, 2019 at 3:37 pm #

HI, Jason, Thank you for your nice blog. I have one question. I have two samples with different size (one is 102, the other is 2482), as well as the variances are different, which statistical hypothesis method is appropriate? Thank you.

• Jason Brownlee June 28, 2019 at 5:57 am #

That is a very big difference.

The test depends on the nature of the question you’re trying to answer.

13. MIAO June 28, 2019 at 5:50 pm #

Thank you. Jason. The problem I process is that: I have results of two groups, 102 features for patient group and 2482 features for healthy group, and I would like to take a significant test for the features of two groups to test if the feature is appropriate for differentiate the two groups. I am not sure which method is right for this case. Could you give me some suggestions? Thank you.

• Jason Brownlee June 29, 2019 at 6:37 am #

Sounds like you want a classification (discrimination) model, not a statistical test?

• MIAO July 1, 2019 at 10:52 am #

Yeah, I think you are right. I will use SVM to classify the features. Thank you.

14. Veetee August 6, 2019 at 1:04 am #

Hi Jason, thanks for the very useful post. Is there a variant of Friedman’s test for only two sets of measurements? I have an experiment in which two conditions were tested on the same people. I expect a semi-constant change between the two conditions, such that the ranks within blocks are expected to stay very similar.

• Jason Brownlee August 6, 2019 at 6:40 am #

Yes: Wilcoxon Signed-Rank Test

15. wishy September 6, 2019 at 10:09 pm #

Dear Sir,

I have one question if we take subset of the huge data,and according to the Central limit theorem the ‘samples averages follow normal distribution’.So in that case is it should we consider Nonparametric Statistical Hypothesis Tests or parametric Statistical Hypothesis Tests

• Jason Brownlee September 7, 2019 at 5:29 am #

Generally nonparametric stats use ranking instead of gaussians.

16. gopal jamnal September 28, 2019 at 10:43 pm #

What is A-B testing, and how it can be useful in machine learning. Is it different then hypotheisis testing?

• Jason Brownlee September 29, 2019 at 6:12 am #

More on a/b testing:
https://en.wikipedia.org/wiki/A/B_testing

It is not related to machine learning.

Instead, in machine learning, we will evaluate the performance of different machine learning algorithms, and compare the samples of performance estimates to see if the difference in performance between algorithms is significant or not.

Does that help?

17. Peiran November 14, 2019 at 8:57 am #

You can’t imagine how happy I am to find a cheat sheet like this! Thank you for the links too.

• Jason Brownlee November 14, 2019 at 1:43 pm #

Thanks, I’m happy it helps!

18. Chris Winsor December 3, 2019 at 2:23 pm #

Hi Jason –

Thank you for helping to bring the theory of statistics to everyday application !

I’m wishing you had included an example of a t-test for equivalence. This is slightly different from the standard t-test and there are many applications – for example – demonstrating version 2.0 of the ml algorithm matches version 1.0. That is actually super important for customers that don’t want to re-validate their instruments, or manufacturers that would need to answer why/if those versions perform the same as one-another.

I observe a library at
http://www.statsmodels.org/0.9.0/generated/statsmodels.stats.weightstats.ttost_paired.html#statsmodels.stats.weightstats.ttost_paired
but it doesn’t explain how to establish reasonable low and high limits.

Anyway thank you for the examples !

• Jason Brownlee December 4, 2019 at 5:28 am #

Great suggestion, thanks Chris!