#### Quick-reference guide to the 15 statistical hypothesis tests that you need in

applied machine learning, with sample code in Python.

Although there are hundreds of statistical hypothesis tests that you could use, there is only a small subset that you may need to use in a machine learning project.

In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API.

Each statistical test is presented in a consistent way, including:

- The name of the test.
- What the test is checking.
- The key assumptions of the test.
- How the test result is interpreted.
- Python API for using the test.

Note, when it comes to assumptions such as the expected distribution of data or sample size, the results of a given test are likely to degrade gracefully rather than become immediately unusable if an assumption is violated.

Generally, data samples need to be representative of the domain and large enough to expose their distribution to analysis.

In some cases, the data can be corrected to meet the assumptions, such as correcting a nearly normal distribution to be normal by removing outliers, or using a correction to the degrees of freedom in a statistical test when samples have differing variance, to name two examples.

Finally, there may be multiple tests for a given concern, e.g. normality. We cannot get crisp answers to questions with statistics; instead, we get probabilistic answers. As such, we can arrive at different answers to the same question by considering the question in different ways. Hence the need for multiple different tests for some questions we may have about data.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

**Update Nov/2018**: Added a better overview of the tests covered.

## Tutorial Overview

This tutorial is divided into four parts; they are:

**Normality Tests**- Shapiro-Wilk Test
- D’Agostino’s K^2 Test
- Anderson-Darling Test

**Correlation Tests**- Pearson’s Correlation Coefficient
- Spearman’s Rank Correlation
- Kendall’s Rank Correlation
- Chi-Squared Test

**Parametric Statistical Hypothesis Tests**- Student’s t-test
- Paired Student’s t-test
- Analysis of Variance Test (ANOVA)
- Repeated Measures ANOVA Test

**Nonparametric Statistical Hypothesis Tests**- Mann-Whitney U Test
- Wilcoxon Signed-Rank Test
- Kruskal-Wallis H Test
- Friedman Test

## 1. Normality Tests

This section lists statistical tests that you can use to check if your data has a Gaussian distribution.

### Shapiro-Wilk Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

1 2 3 |
from scipy.stats import shapiro data = .... stat, p = shapiro(data) |

More Information

### D’Agostino’s K^2 Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

Python Code

1 2 3 |
from scipy.stats import normaltest data = .... stat, p = normaltest(data) |

More Information

### Anderson-Darling Test

Tests whether a data sample has a Gaussian distribution.

Assumptions

- Observations in each sample are independent and identically distributed (iid).

Interpretation

- H0: the sample has a Gaussian distribution.
- H1: the sample does not have a Gaussian distribution.

1 2 3 |
from scipy.stats import anderson data = .... result = anderson(data) |

More Information

## 2. Correlation Tests

This section lists statistical tests that you can use to check if two samples are related.

### Pearson’s Correlation Coefficient

Tests whether two samples have a linear relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

1 2 3 |
from scipy.stats import pearsonr data1, data2 = ... corr, p = pearsonr(data1, data2) |

More Information

### Spearman’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

1 2 3 |
from scipy.stats import spearmanr data1, data2 = ... corr, p = spearmanr(data1, data2) |

More Information

### Kendall’s Rank Correlation

Tests whether two samples have a monotonic relationship.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

1 2 3 |
from scipy.stats import kendalltau data1, data2 = ... corr, p = kendalltau(data1, data2) |

More Information

### Chi-Squared Test

Tests whether two categorical variables are related or independent.

Assumptions

- Observations used in the calculation of the contingency table are independent.
- 25 or more examples in each cell of the contingency table.

Interpretation

- H0: the two samples are independent.
- H1: there is a dependency between the samples.

Python Code

1 2 3 |
from scipy.stats import chi2_contingency table = ... stat, p, dof, expected = chi2_contingency(table) |

More Information

## 3. Parametric Statistical Hypothesis Tests

This section lists statistical tests that you can use to compare data samples.

### Student’s t-test

Tests whether the means of two independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

1 2 3 |
from scipy.stats import ttest_ind data1, data2 = ... stat, p = ttest_ind(data1, data2) |

More Information

### Paired Student’s t-test

Tests whether the means of two paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: the means of the samples are unequal.

Python Code

1 2 3 |
from scipy.stats import ttest_rel data1, data2 = ... stat, p = ttest_rel(data1, data2) |

More Information

### Analysis of Variance Test (ANOVA)

Tests whether the means of two or more independent samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

1 2 3 |
from scipy.stats import f_oneway data1, data2, ... = ... stat, p = f_oneway(data1, data2, ...) |

More Information

### Repeated Measures ANOVA Test

Tests whether the means of two or more paired samples are significantly different.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.
- Observations across each sample are paired.

Interpretation

- H0: the means of the samples are equal.
- H1: one or more of the means of the samples are unequal.

Python Code

Currently not supported in Python.

More Information

## 4. Nonparametric Statistical Hypothesis Tests

### Mann-Whitney U Test

Tests whether the distributions of two independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

1 2 3 |
from scipy.stats import mannwhitneyu data1, data2 = ... stat, p = mannwhitneyu(data1, data2) |

More Information

### Wilcoxon Signed-Rank Test

Tests whether the distributions of two paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of both samples are equal.
- H1: the distributions of both samples are not equal.

Python Code

1 2 3 |
from scipy.stats import wilcoxon data1, data2 = ... stat, p = wilcoxon(data1, data2) |

More Information

### Kruskal-Wallis H Test

Tests whether the distributions of two or more independent samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

1 2 3 |
from scipy.stats import kruskal data1, data2, ... = ... stat, p = kruskal(data1, data2, ...) |

More Information

### Friedman Test

Tests whether the distributions of two or more paired samples are equal or not.

Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample can be ranked.
- Observations across each sample are paired.

Interpretation

- H0: the distributions of all samples are equal.
- H1: the distributions of one or more samples are not equal.

Python Code

1 2 3 |
from scipy.stats import friedmanchisquare data1, data2, ... = ... stat, p = friedmanchisquare(data1, data2, ...) |

More Information

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Normality Tests in Python
- How to Use Correlation to Understand the Relationship Between Variables
- How to Use Parametric Statistical Significance Tests in Python
- A Gentle Introduction to Statistical Hypothesis Tests

## Summary

In this tutorial, you discovered the key statistical hypothesis tests that you may need to use in a machine learning project.

Specifically, you learned:

- The types of tests to use in different circumstances, such as normality checking, relationships between variables, and differences between samples.
- The key assumptions for each test and how to interpret the test result.
- How to implement the test using the Python API.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Did I miss an important statistical test or key assumption for one of the listed tests?

Let me know in the comments below.

hi, the list looks good. a few omissions. fishers exact test and Bernards test (potentially more power than a fishers exact test)

one note on the anderson darling test. the use of p values to determine GoF has been discouraged in some fields .

Excellent note, thanks Jonathan.

Indeed, I think it was a journal of psychology that has adopted “estimation statistics” instead of hypothesis tests in reporting results.

Very Very Good and Useful Article

Thanks, I’m happy to hear that.

Hi, thanks for this nice overview.

Some of these tests, like friedmanchisquare, expect that the quantity of events is the group to remain the same over time. But in practice this is not allways the case.

Lets say there are 4 observations on a group of 100 people, but the size of the response from this group changes over time with n1=100, n2=95, n3=98, n4=60 respondants.

n4 is smaller because some external factor like bad weather.

What would be your advice on how to tackle this different ‘respondants’ sizes over time?

Good question.

Perhaps check the literature for corrections to the degrees of freedom for this situation?

Shouldn’t it say that Pearson correlation measures the linear relationship between variables? I would say that monotonic suggests, a not necessarily linear, “increasing” or “decreasing” relationship.

Right, Pearson is a linear relationship, nonparametric methods like Spearmans are monotonic relationships.

Thanks, fixed.

No problem. Thank you for a great blog! It has introduced me to so many interesting and useful topics.

Happy to hear that!

Two points/questions on testing for normality of data:

(1) In the Shapiro/Wilk, D’Agostino and Anderson/Darling tests, do you use all three to be sure that your data is likely to be normally distributed? Or put it another way, what if only one or two of the three test indicate that the data may be gaussian?

(2) What about using graphical means such as a histogram of the data – is it symmetrical? What about normal plots https://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm if the line is straight, then with the statistical tests described in (1), you can assess that the data may well come from a gaussian distribution.

Thank you,

Anthony of Sydney

More on what normality tests to use here (graphical and otherwise):

https://machinelearningmastery.com/a-gentle-introduction-to-normality-tests-in-python/

Wow.. this is what I was looking for. Ready made thing for ready reference.

Thanks for sharing Jason.

I’m happy it helps!

Thanks a lot, Jason! You’re the best. I’ve been scouring the internet for a piece on practical implementation of Inferential statistics in Machine Learning for some time now!

Lots of articles with the same theory stuff going over and over again but none like this.

Thanks, I’m glad it helped.

Hi Jason, Statsmodels is another module that has got lots to offer but very little info on how to go about it on the web. The documentation is not as comprehensive either compared to scipy. Have you written anything on Statsmodels ? A similar article would be of great help.

Yes, I have many tutorials showing how to use statsmodels for time series:

https://machinelearningmastery.com/start-here/#timeseries

and statsmodels for general statistics:

https://machinelearningmastery.com/start-here/#statistical_methods

Hey Jason, thank you for your awesome blog. Gave me some good introductions into unfamiliar topics!

If your seeking for completeness on easy appliable hypothesis tests like those, I suggest to add the Kolmogorov-Smirnov test which is not that different from the Shapiro-Wilk.

– https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

– https://www.researchgate.net/post/Whats_the_difference_between_Kolmogorov-Smirnov_test_and_Shapiro-Wilk_test

Thanks for the suggestion Thomas.

Which methods fits for classification or regression data sets? Which statistical tests are good for Semi-supervised/ un-supervised data sets?

This post will help:

https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/

Hello,

Thank you very much for your blog !

I’m wondering how to check that “observations in each sample have the same variance” … Is there a test to check that ?

Great question.

You can calculate the mean and standard deviation for each interval.

You can also plot the series and visually look for increasing variance.

Is there a test similar to the friedman test? which has the same characteristics “whether the distributions of two or more paired samples are equal or not”.

Yes, the paired student’s t-test.

HI, Jason, Thank you for your nice blog. I have one question. I have two samples with different size (one is 102, the other is 2482), as well as the variances are different, which statistical hypothesis method is appropriate? Thank you.

That is a very big difference.

The test depends on the nature of the question you’re trying to answer.

Thank you. Jason. The problem I process is that: I have results of two groups, 102 features for patient group and 2482 features for healthy group, and I would like to take a significant test for the features of two groups to test if the feature is appropriate for differentiate the two groups. I am not sure which method is right for this case. Could you give me some suggestions? Thank you.

Sounds like you want a classification (discrimination) model, not a statistical test?

Yeah, I think you are right. I will use SVM to classify the features. Thank you.

Hi Jason, thanks for the very useful post. Is there a variant of Friedman’s test for only two sets of measurements? I have an experiment in which two conditions were tested on the same people. I expect a semi-constant change between the two conditions, such that the ranks within blocks are expected to stay very similar.

Yes: Wilcoxon Signed-Rank Test