# How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers

Last Updated on

The choice of a statistical hypothesis test is a challenging open problem for interpreting machine learning results.

In his widely cited 1998 paper, Thomas Dietterich recommended the McNemar’s test in those cases where it is expensive or impractical to train multiple copies of classifier models.

This describes the current situation with deep learning models that are both very large and are trained and evaluated on large datasets, often requiring days or weeks to train a single model.

In this tutorial, you will discover how to use the McNemar’s statistical hypothesis test to compare machine learning classifier models on a single test dataset.

After completing this tutorial, you will know:

• The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
• How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
• How to calculate the McNemar’s test in Python and interpret and report the result.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started. How to Calculate McNemar’s Test for Two Machine Learning Classifiers
Photo by Mark Kao, some rights reserved.

## Tutorial Overview

This tutorial is divided into five parts; they are:

1. Statistical Hypothesis Tests for Deep Learning
2. Contingency Table
3. McNemar’s Test Statistic
4. Interpret the McNemar’s Test for Classifiers
5. McNemar’s Test in Python

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Statistical Hypothesis Tests for Deep Learning

In his important and widely cited 1998 paper on the use of statistical hypothesis tests to compare classifiers titled “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms“, Thomas Dietterich recommends the use of the McNemar’s test.

Specifically, the test is recommended in those cases where the algorithms that are being compared can only be evaluated once, e.g. on one test set, as opposed to repeated evaluations via a resampling technique, such as k-fold cross-validation.

For algorithms that can be executed only once, McNemar’s test is the only test with acceptable Type I error.

Specifically, Dietterich’s study was concerned with the evaluation of different statistical hypothesis tests, some operating upon the results from resampling methods. The concern of the study was low Type I error, that is, the statistical test reporting an effect when in fact no effect was present (false positive).

Statistical tests that can compare models based on a single test set is an important consideration for modern machine learning, specifically in the field of deep learning.

Deep learning models are often large and operate on very large datasets. Together, these factors can mean that the training of a model can take days or even weeks on fast modern hardware.

This precludes the practical use of resampling methods to compare models and suggests the need to use a test that can operate on the results of evaluating trained models on a single test dataset.

The McNemar’s test may be a suitable test for evaluating these large and slow-to-train deep learning models.

## Contingency Table

The McNemar’s test operates upon a contingency table.

Before we dive into the test, let’s take a moment to understand how the contingency table for two classifiers is calculated.

A contingency table is a tabulation or count of two categorical variables. In the case of the McNemar’s test, we are interested in binary variables correct/incorrect or yes/no for a control and a treatment or two cases. This is called a 2×2 contingency table.

The contingency table may not be intuitive at first glance. Let’s make it concrete with a worked example.

Consider that we have two trained classifiers. Each classifier makes binary class prediction for each of the 10 examples in a test dataset. The predictions are evaluated and determined to be correct or incorrect.

We can then summarize these results in a table, as follows:

We can see that Classifier1 got 6 correct, or an accuracy of 60%, and Classifier2 got 5 correct, or 50% accuracy on the test set.

The table can now be reduced to a contingency table.

The contingency table relies on the fact that both classifiers were trained on exactly the same training data and evaluated on exactly the same test data instances.

The contingency table has the following structure:

In the case of the first cell in the table, we must sum the total number of test instances that Classifier1 got correct and Classifier2 got correct. For example, the first instance that both classifiers predicted correctly was instance number 5. The total number of instances that both classifiers predicted correctly was 4.

Another more programmatic way to think about this is to sum each combination of Yes/No in the results table above.

The results organized into a contingency table are as follows:

## McNemar’s Test Statistic

McNemar’s test is a paired nonparametric or distribution-free statistical hypothesis test.

It is also less intuitive than some other statistical hypothesis tests.

The McNemar’s test is checking if the disagreements between two cases match. Technically, this is referred to as the homogeneity of the contingency table (specifically the marginal homogeneity). Therefore, the McNemar’s test is a type of homogeneity test for contingency tables.

The test is widely used in medicine to compare the effect of a treatment against a control.

In terms of comparing two binary classification algorithms, the test is commenting on whether the two models disagree in the same way (or not). It is not commenting on whether one model is more or less accurate or error prone than another. This is clear when we look at how the statistic is calculated.

The McNemar’s test statistic is calculated as:

Where Yes/No is the count of test instances that Classifier1 got correct and Classifier2 got incorrect, and No/Yes is the count of test instances that Classifier1 got incorrect and Classifier2 got correct.

This calculation of the test statistic assumes that each cell in the contingency table used in the calculation has a count of at least 25. The test statistic has a Chi-Squared distribution with 1 degree of freedom.

We can see that only two elements of the contingency table are used, specifically that the Yes/Yes and No/No elements are not used in the calculation of the test statistic. As such, we can see that the statistic is reporting on the different correct or incorrect predictions between the two models, not the accuracy or error rates. This is important to understand when making claims about the finding of the statistic.

The default assumption, or null hypothesis, of the test is that the two cases disagree to the same amount. If the null hypothesis is rejected, it suggests that there is evidence to suggest that the cases disagree in different ways, that the disagreements are skewed.

Given the selection of a significance level, the p-value calculated by the test can be interpreted as follows:

• p > alpha: fail to reject H0, no difference in the disagreement (e.g. treatment had no effect).
• p <= alpha: reject H0, significant difference in the disagreement (e.g. treatment had an effect).

## Interpret the McNemar’s Test for Classifiers

It is important to take a moment to clearly understand how to interpret the result of the test in the context of two machine learning classifier models.

The two terms used in the calculation of the McNemar’s Test capture the errors made by both models. Specifically, the No/Yes and Yes/No cells in the contingency table. The test checks if there is a significant difference between the counts in these two cells. That is all.

If these cells have counts that are similar, it shows us that both models make errors in much the same proportion, just on different instances of the test set. In this case, the result of the test would not be significant and the null hypothesis would not be rejected.

Under the null hypothesis, the two algorithms should have the same error rate …

If these cells have counts that are not similar, it shows that both models not only make different errors, but in fact have a different relative proportion of errors on the test set. In this case, the result of the test would be significant and we would reject the null hypothesis.

So we may reject the null hypothesis in favor of the hypothesis that the two algorithms have different performance when trained on the particular training

We can summarize this as follows:

• Fail to Reject Null Hypothesis: Classifiers have a similar proportion of errors on the test set.
• Reject Null Hypothesis: Classifiers have a different proportion of errors on the test set.

After performing the test and finding a significant result, it may be useful to report an effect statistical measure in order to quantify the finding. For example, a natural choice would be to report the odds ratios, or the contingency table itself, although both of these assume a sophisticated reader.

It may be useful to report the difference in error between the two classifiers on the test set. In this case, be careful with your claims as the significant test does not report on the difference in error between the models, only the relative difference in the proportion of error between the models.

Finally, in using the McNemar’s test, Dietterich highlights two important limitations that must be considered. They are:

### 1. No Measure of Training Set or Model Variability.

Generally, model behavior varies based on the specific training data used to fit the model.

This is due to both the interaction of the model with specific training instances and the use of randomness during learning. Fitting the model on multiple different training datasets and evaluating the skill, as is done with resampling methods, provides a way to measure the variance of the model.

The test is appropriate if the sources of variability are small.

Hence, McNemar’s test should only be applied if we believe these sources of variability are small.

### 2. Less Direct Comparison of Models

The two classifiers are evaluated on a single test set, and the test set is expected to be smaller than the training set.

This is different from hypothesis tests that make use of resampling methods as more, if not all, of the dataset is made available as a test set during evaluation (which introduces its own problems from a statistical perspective).

This provides less of an opportunity to compare the performance of the models. It requires that the test set is an appropriately representative of the domain, often meaning that the test dataset is large.

## McNemar’s Test in Python

The McNemar’s test can be implemented in Python using the mcnemar() Statsmodels function.

The function takes the contingency table as an argument and returns the calculated test statistic and p-value.

There are two ways to use the statistic depending on the amount of data.

If there is a cell in the table that is used in the calculation of the test statistic that has a count of less than 25, then a modified version of the test is used that calculates an exact p-value using a binomial distribution. This is the default usage of the test:

Alternately, if all cells used in the calculation of the test statistic in the contingency table have a value of 25 or more, then the standard calculation of the test can be used.

We can calculate the McNemar’s on the example contingency table described above. This contingency table has a small count in both the disagreement cells and as such the exact method must be used.

The complete example is listed below.

Running the example calculates the statistic and p-value on the contingency table and prints the results.

We can see that the test strongly confirms that there is very little difference in the disagreements between the two cases. The null hypothesis not rejected.

As we are using the test to compare classifiers, we state that there is no statistically significant difference in the disagreements between the two models.

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

• Find a research paper in machine learning that makes use of the McNemar’s statistical hypothesis test.
• Update the code example such that the contingency table shows a significant difference in disagreement between the two cases.
• Implement a function that will use the correct version of the McNemar’s test based on the provided contingency table.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered how to use the McNemar’s test statistical hypothesis test to compare machine learning classifier models on a single test dataset.

Specifically, you learned:

• The recommendation of the McNemar’s test for models that are expensive to train, which suits large deep learning models.
• How to transform prediction results from two classifiers into a contingency table and how the table is used to calculate the statistic in the McNemar’s test.
• How to calculate the McNemar’s test in Python and interpret and report the result.

Do you have any questions?

## Get a Handle on Statistics for Machine Learning! #### Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

### 52 Responses to How to Calculate McNemar’s Test to Compare Two Machine Learning Classifiers

1. Elie Kawerk July 26, 2018 at 5:56 pm #

Hi Jason,

Thanks for this nice post. Any practical python post about 5×2 CV + paired t-test coming soon?

Best,
Elie

2. Maryam Poortarigh July 27, 2018 at 11:30 pm #

Thank you Jason.
I’ve learnt alot from you

• Jason Brownlee July 28, 2018 at 6:36 am #

3. Luke August 8, 2018 at 2:23 am #

Thank you for the post. I wanted to ask you about K x K contingency tables with K>2; so non-binary classifiers? Can this test be applied to that or is that a restriction, for which a generalised test like Cochrane Q should be used?

• Jason Brownlee August 8, 2018 at 6:22 am #

Not this test I believe.

• Darya January 11, 2019 at 9:59 am #

Could you reduce your multi-class labels to “correct”/”incorrect” by saying e.g. correct = the right label is in the top 3 and then use McNemar’s test like you described?

• Jason Brownlee January 12, 2019 at 5:34 am #

Perhaps, I’m not sure that the findings will be valid/sensible.

4. Erick May 28, 2019 at 9:57 am #

First of all, thank you Jason for your article. It’s helped me a lot! In addition, I wanted ask you a question:

Using the statsmodels library, could I change the condition value Alpha to 0.1, for instance, and evaluate if the pvalue is greater than or lesser than this new Alpha Value to reject or not the H0?

• Jason Brownlee May 28, 2019 at 2:43 pm #

You can use any alpha you wish, it is not coded in the statamodels library, it is in our code.

5. JOAO ANTONIO MARTINS May 30, 2019 at 7:28 am #

Is McNemar’s Test just for comparing 2 machine learning classifier models? Or can it be used to compare more than 2?

• Jason Brownlee May 30, 2019 at 9:08 am #

Yes, just pair-wise comparisons.

• João Antônio Martins June 2, 2019 at 3:48 am #

I am developing a scientific work in the area of ​​machine learning but I am having difficulty finding statistical tests for models of machine learning classifiers that compare more than two models.

In my research I found the Friedman test that meets the requirements, and if it is not uncomfortable you would know some other test that meets this requirement?

Thank you!

6. Wonga June 29, 2019 at 1:06 am #

Hello Jason, kindly help. How do I figure out f11, f12, f21, and f22 from my confusion matrix below. I know that in Remote Sensing, many authors have reduced a multi-class confusion matrix into a 2-by-2 matrix, but I don’t know how. See my R-code below.

classes = c(“Maize”,”Grass”,”Urban”,”Bare_soil”,”Water”,”Forest”)
Maize=c(130,13,12,0,0,12); Grass=c(40,4490,68,92,112,129); Urban=c(7,60,114,2,100,68)
Bare_soil=c(0,51,0,11,0,0); Water=c(0,5,3,4,1474,0); Forest=c(50,156,350,0,51,2396)

CM1 <- matrix(c(Maize, Grass, Urban, Bare_soil, Water, Forest), nrow = 6,
ncol = 6, byrow=TRUE, dimnames=list(classes,classes) )

Maize2=c(226,0,1,0,0,0); Grassland2=c(6,4870,4,1,0,1); Urban2=c(1,0,526,1,0,0)
Bare_soil2=c(0,2,4,137,0,0); Water2=c(0,0,1,0,1691,0); Forest2=c(0,0,0,0,0,2528)

CM2 <- matrix(c(Maize2, Grassland2, Urban2, Bare_soil2, Water2, Forest2), nrow = 6,
ncol = 6, byrow=TRUE, dimnames=list(classes,classes) )

I would like to figure out the mcnemar's 2-by-2 input matrix from this data, so I can do a statistical significance test between the 2 matrices (model predictions).

• Jason Brownlee June 29, 2019 at 6:58 am #

Sorry, I cannot debug your code for you, perhaps try posting to stackoveflow?

I don’t see how you could reduce a n-class result to a 2×2 matrix, unless you had multiple pairwise matrices.

7. Chen Wang July 13, 2019 at 7:32 am #

Hi Jason, thank you for this wonderful post! Truly helped me a lot.

• Jason Brownlee July 14, 2019 at 7:57 am #

You’re welcome, I’m happy to hear that.

8. Salomon August 8, 2019 at 1:21 am #

Hello Jason, you truly know how to explain clearly, and concisely. Out of all the articles/videos I saw explaining McNemar’s test, yours gets the price. Thank you so much!

• Jason Brownlee August 8, 2019 at 6:35 am #

9. zeinab August 15, 2019 at 11:31 pm #

Is the data in the contingency table is filled from validation results or the test results?

• Jason Brownlee August 16, 2019 at 7:54 am #

Test results.

10. zeinab August 15, 2019 at 11:34 pm #

What should I do when the “fail to reject H0” occur?

• Jason Brownlee August 16, 2019 at 7:55 am #

Try another model or config?

11. zeinab August 16, 2019 at 1:04 am #

Which case is better: “reject H0” or “fail to reject H0”?

• Jason Brownlee August 16, 2019 at 7:56 am #

Better for what?

12. zeinab August 16, 2019 at 4:47 am #

I work on a regression problem

I try using the paired t-test using the ttest_rel() on validation loss the result is “fail to reject H0”
however, when I use the ttest_rel() on validation correlation coefficient, the result is “reject H0”

Which one should I use; the validation loss or the validation correlation?

13. zeinab August 16, 2019 at 12:29 pm #

Thus, we can not compare two models, if their statistical test is “fail to reject H0”?

• zeinab August 16, 2019 at 12:44 pm #

Does “reject H0” means that the any difference in the two models is due to the two models are different?

• Jason Brownlee August 16, 2019 at 2:13 pm #

Yes, but it is a probabilistic answer, not crisp. E.g. still chance of 5% that results are not different.

• Jason Brownlee August 16, 2019 at 2:12 pm #

No. Models are compared and a failure to reject null suggests no statistical difference between the results.

14. zeinab August 16, 2019 at 12:49 pm #

I observed that two models with large difference in their accuracy gives “reject H0”.

however, models with small differences in their accuracy results give “fail to reject H0”.

Does it means that models must have big differences in their accuracy results inorder to compare them?

• Jason Brownlee August 16, 2019 at 2:13 pm #

No, only that the difference must be statistically significant. Smaller differences may require large data samples.

15. zeinab August 16, 2019 at 9:35 pm #

Is there is any reason to set alpha = 0.05?

• Jason Brownlee August 17, 2019 at 5:42 am #

Yes, to have 95% probability of no statistical fluke.

16. zeinab August 16, 2019 at 9:45 pm #

sorry, but i am confused between “reject H0” and “failure reject”. which one means that the two models are different?

• Jason Brownlee August 17, 2019 at 5:43 am #

fail to reject H0 suggests that the results are the same distribution, no difference.

Reject H0 suggests they are different.

17. zeinab August 16, 2019 at 11:15 pm #

when the test results is: “fail to reject” Should I run one model again with different seed till the result becomes “reject H0” inorder to compare two models?

• Jason Brownlee August 17, 2019 at 5:44 am #

No. This does not make any sense.

Perhaps email me directly and outline what you are trying to achieve:
https://machinelearningmastery.com/contact/

18. zeinab August 17, 2019 at 5:08 pm #

Hi Jason,

plz check the correctness of the following statement:

“In order to compare two regressors, they must have the same Gaussian distribution.”

• Jason Brownlee August 18, 2019 at 6:39 am #

Where is that written exactly?

19. zeinab August 18, 2019 at 12:31 am #

I need is to compare more than one regression model.

Can you please revise with me the steps for preforming this comparison:
1- use a statistical test on each two models, if the result is “fail to reject”, then the performance of two models are the same.
otherwise, if the result is “reject H0” then the performance of the two models are different. thus I have to compare them using the MSE

• Jason Brownlee August 18, 2019 at 6:46 am #

The statistical tests are performed on the MSE scores for each model.

A McNemar’s test would not be appropriate, consider a modified paired student’s test.

20. ntinos August 26, 2019 at 1:35 am #

This post is fantastic! Well done!!!
Simple question…..Say that two observers classify 100 images into 3 classes: cat, dog, deer. How can we statistically compare the agreement between the two observers?…I guess McNemar could be used to test separately (i.e. for each class) the observers agreement, but what about the overall agreement??

• Jason Brownlee August 26, 2019 at 6:21 am #

Good question. Perhaps a chi-squared test, or the distance between two discrete distributions, perhaps a cross entropy score?

• ntinos August 27, 2019 at 10:13 pm #

I found an answer, so to help out the community:
1. For categorical variables (like in the example given), Cohen’s kappa is a suitable test.
2. For ordinal variables (e.g. low, medium. high), Weighted kappa (a Cohen’s kappa variation).
3. To compare >2 observers, Fleiss’ kappa (either for ordinal or categorical variables).

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5654219/

• ntinos August 27, 2019 at 10:51 pm #

Finally, for >2 observers and ORDINAL variables, some people say that ‘Kendall coefficient of concordance’ is more suitable than Fleiss’ kappa.

• Jason Brownlee August 28, 2019 at 6:35 am #

Thanks for sharing.

21. anne xue November 20, 2019 at 3:04 am #

Hello Jason, thanks for the post. Is it applicable to compare and select regressors? do you have a post of hypothesis test for regression tasks? thanks

• Jason Brownlee November 20, 2019 at 6:20 am #

No, this test is for classification only.

For regression, you can use any of the tests for comparing sample means. E.g. the student t-test.