A Gentle Introduction to the Chi-Squared Test for Machine Learning

By Jason Brownlee on October 31, 2019 in Statistics 77

A common problem in applied machine learning is determining whether input features are relevant to the outcome to be predicted.

This is the problem of feature selection.

In the case of classification problems where input variables are also categorical, we can use statistical tests to determine whether the output variable is dependent or independent of the input variables. If independent, then the input variable is a candidate for a feature that may be irrelevant to the problem and removed from the dataset.

The Pearson’s chi-squared statistical hypothesis is an example of a test for independence between categorical variables.

In this tutorial, you will discover the chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.

After completing this tutorial, you will know:

Pairs of categorical variables can be summarized using a contingency table.
The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
How to calculate and interpret the chi-squared test for categorical variables in Python.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jun/2018: Minor typo fix in the interpretation of the critical values from the test (thanks Andrew).
Update Oct/2019: Fixed language around factor/levels (thanks Marc)

A Gentle Introduction to the Chi-Squared Test for Machine Learning
Photo by NC Wetlands, some rights reserved

Tutorial Overview

This tutorial is divided into 3 parts; they are:

Contingency Table
Pearson’s Chi-Squared Test
Example Chi-Squared Test

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Contingency Table

A categorical variable is a variable that may take on one of a set of labels.

An example might be sex, which may be summarized as male or female. The variable or factor is ‘sex‘ and the labels or levels of the variable are ‘male‘ and ‘female‘ in this case.

We may wish to look at a summary of a categorical variable as it pertains to another categorical variable. For example, sex and interest, where interest may have the labels ‘science‘, ‘math‘, or ‘art‘. We can collect observations from people collected with regard to these two categorical variables; for example:

Sex,	Interest
Male,	Art
Female,	Math
Male, 	Science
Male,	Math
...

Sex, Interest

Male, Art

Female, Math

Male, Science

Male, Math

...

We can summarize the collected observations in a table with one variable corresponding to columns and another variable corresponding to rows. Each cell in the table corresponds to the count or frequency of observations that correspond to the row and column categories.

Historically, a table summarization of two categorical variables in this form is called a contingency table.

For example, the Sex=rows and Interest=columns table with contrived counts might look as follows:

        Science,	Math,	Art
Male         20,      30,    15
Female       20,      15,    30

Science, Math, Art

Male 20, 30, 15

Female 20, 15, 30

The table was called a contingency table, by Karl Pearson, because the intent is to help determine whether one variable is contingent upon or depends upon the other variable. For example, does an interest in math or science depend on gender, or are they independent?

This is challenging to determine from the table alone; instead, we can use a statistical method called the Pearson’s Chi-Squared test.

Pearson’s Chi-Squared Test

The Pearson’s Chi-Squared test, or just Chi-Squared test for short, is named for Karl Pearson, although there are variations on the test.

The Chi-Squared test is a statistical hypothesis test that assumes (the null hypothesis) that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable. The test calculates a statistic that has a chi-squared distribution, named for the Greek capital letter Chi (X) pronounced “ki” as in kite.

Given the Sex/Interest example above, the number of observations for a category (such as male and female) may or may not the same. Nevertheless, we can calculate the expected frequency of observations in each Interest group and see whether the partitioning of interests by Sex results in similar or different frequencies.

The Chi-Squared test does this for a contingency table, first calculating the expected frequencies for the groups, then determining whether the division of the groups, called the observed frequencies, matches the expected frequencies.

The result of the test is a test statistic that has a chi-squared distribution and can be interpreted to reject or fail to reject the assumption or null hypothesis that the observed and expected frequencies are the same.

When observed frequency is far from the expected frequency, the corresponding term in the sum is large; when the two are close, this term is small. Large values of X^2 indicate that observed and expected frequencies are far apart. Small values of X^2 mean the opposite: observeds are close to expecteds. So X^2 does give a measure of the distance between observed and expected frequencies.

— Page 525, Statistics, Fourth Edition, 2007.

The variables are considered independent if the observed and expected frequencies are similar, that the levels of the variables do not interact, are not dependent.

The chi-square test of independence works by comparing the categorically coded data that you have collected (known as the observed frequencies) with the frequencies that you would expect to get in each cell of a table by chance alone (known as the expected frequencies).

— Page 162, Statistics in Plain English, Third Edition, 2010.

We can interpret the test statistic in the context of the chi-squared distribution with the requisite number of degress of freedom as follows:

If Statistic >= Critical Value: significant result, reject null hypothesis (H0), dependent.
If Statistic < Critical Value: not significant result, fail to reject null hypothesis (H0), independent.

The degrees of freedom for the chi-squared distribution is calculated based on the size of the contingency table as:

degrees of freedom: (rows - 1) * (cols - 1)

1	degrees of freedom: (rows - 1) * (cols - 1)

In terms of a p-value and a chosen significance level (alpha), the test can be interpreted as follows:

If p-value <= alpha: significant result, reject null hypothesis (H0), dependent.
If p-value > alpha: not significant result, fail to reject null hypothesis (H0), independent.

For the test to be effective, at least five observations are required in each cell of the contingency table.

Next, let’s look at how we can calculate the chi-squared test.

Example Chi-Squared Test

The Pearson’s chi-squared test for independence can be calculated in Python using the chi2_contingency() SciPy function.

The function takes an array as input representing the contingency table for the two categorical variables. It returns the calculated statistic and p-value for interpretation as well as the calculated degrees of freedom and table of expected frequencies.

stat, p, dof, expected = chi2_contingency(table)

1	stat, p, dof, expected = chi2_contingency(table)

We can interpret the statistic by retrieving the critical value from the chi-squared distribution for the probability and number of degrees of freedom.

For example, a probability of 95% can be used, suggesting that the finding of the test is quite likely given the assumption of the test that the variable is independent. If the statistic is less than or equal to the critical value, we can fail to reject this assumption, otherwise it can be rejected.

# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
if abs(stat) >= critical:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

# interpret test-statistic

prob = 0.95

critical = chi2.ppf(prob, dof)

if abs(stat) >= critical:

print('Dependent (reject H0)')

else:

print('Independent (fail to reject H0)')

We can also interpret the p-value by comparing it to a chosen significance level, which would be 5%, calculated by inverting the 95% probability used in the critical value interpretation.

# interpret p-value
alpha = 1.0 - prob
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

# interpret p-value

alpha = 1.0 - prob

if p <= alpha:

print('Dependent (reject H0)')

else:

print('Independent (fail to reject H0)')

We can tie all of this together and demonstrate the chi-squared significance test using a contrived contingency table.

A contingency table is defined below that has a different number of observations for each population (row), but a similar proportion across each group (column). Given the similar proportions, we would expect the test to find that the groups are similar and that the variables are independent (fail to reject the null hypothesis, or H0).

table = [	[10, 20, 30],
			[6,  9,  17]]

1 2	table = [ [10, 20, 30], [6, 9, 17]]

The complete example is listed below.

# chi-squared test with similar proportions
from scipy.stats import chi2_contingency
from scipy.stats import chi2
# contingency table
table = [	[10, 20, 30],
			[6,  9,  17]]
print(table)
stat, p, dof, expected = chi2_contingency(table)
print('dof=%d' % dof)
print(expected)
# interpret test-statistic
prob = 0.95
critical = chi2.ppf(prob, dof)
print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))
if abs(stat) >= critical:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')
# interpret p-value
alpha = 1.0 - prob
print('significance=%.3f, p=%.3f' % (alpha, p))
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')

# chi-squared test with similar proportions

from scipy.stats import chi2_contingency

from scipy.stats import chi2

# contingency table

table = [ [10, 20, 30],

[6, 9, 17]]

print(table)

stat, p, dof, expected = chi2_contingency(table)

print('dof=%d' % dof)

print(expected)

# interpret test-statistic

prob = 0.95

critical = chi2.ppf(prob, dof)

print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))

if abs(stat) >= critical:

print('Dependent (reject H0)')

else:

print('Independent (fail to reject H0)')

# interpret p-value

alpha = 1.0 - prob

print('significance=%.3f, p=%.3f' % (alpha, p))

if p <= alpha:

print('Dependent (reject H0)')

else:

print('Independent (fail to reject H0)')

Running the example first prints the contingency table. The test is calculated and the degrees of freedom (dof) is reported as 2, which makes sense given:

degrees of freedom: (rows - 1) * (cols - 1)
degrees of freedom: (2 - 1) * (3 - 1)
degrees of freedom: 1 * 2
degrees of freedom: 2

degrees of freedom: (rows - 1) * (cols - 1)

degrees of freedom: (2 - 1) * (3 - 1)

degrees of freedom: 1 * 2

degrees of freedom: 2

Next, the calculated expected frequency table is printed and we can see that indeed the observed contingency table does appear to match via an eyeball check of the numbers.

The critical value is calculated and interpreted, finding that indeed the variables are independent (fail to reject H0). The interpretation of the p-value makes the same finding.

[[10, 20, 30], [6, 9, 17]]

dof=2

[[10.43478261 18.91304348 30.65217391]
 [ 5.56521739 10.08695652 16.34782609]]

probability=0.950, critical=5.991, stat=0.272
Independent (fail to reject H0)

significance=0.050, p=0.873
Independent (fail to reject H0)

[[10, 20, 30], [6, 9, 17]]

dof=2

[[10.43478261 18.91304348 30.65217391]

[ 5.56521739 10.08695652 16.34782609]]

probability=0.950, critical=5.991, stat=0.272

Independent (fail to reject H0)

significance=0.050, p=0.873

Independent (fail to reject H0)

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Update the chi-squared test to use your own contingency table.
Write a function to report on the independence given observations from two categorical variables
Load a standard machine learning dataset containing categorical variables and report on the independence of each.

If you explore any of these extensions, I’d love to know.

Articles

Summary

In this tutorial, you discovered the chi-squared statistical hypothesis test for quantifying the independence of pairs of categorical variables.

Specifically, you learned:

Pairs of categorical variables can be summarized using a contingency table.
The chi-squared test can compare an observed contingency table to an expected table and determine if the categorical variables are independent.
How to calculate and interpret the chi-squared test for categorical variables in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

77 Responses to A Gentle Introduction to the Chi-Squared Test for Machine Learning

Elie Kawerk June 19, 2018 at 5:27 am #

Hi Jason,

Thanks for this nice post.

What statistical test should be used to test the dependence of a continuous variable on a categorical variable (ex: weight and gender).

Best,
Elie

Reply
- Jason Brownlee June 19, 2018 at 6:38 am #
  
  Good question. I have not seen a test that can do this directly.
  
  Often, the continuous variable is made discrete/ordinal and the chi-squared test is used. It will give a results, but I’m not sure how statistically valid this would be.
  
  Reply
  - DearML July 2, 2019 at 8:37 pm #
    
    Is there any way to get the correlation between all the input features only but with binary values which is 0 and 1 (converted from true and false)?
    
    Reply
    - Jason Brownlee July 3, 2019 at 8:33 am #
      
      Perhaps if you convert input features into nominal variables, e.g. discrete buckets?
      
      Reply
      - DearML July 3, 2019 at 3:00 pm #
        
        its a discrete variables like for example
        
        df = pd.DataFrame({
        ‘y1’: [1,1,1,1,1,1,1,1,0,1,0,0],
        ‘y2’: [1,1,1,1,1,1,1,1,0,1,1,0],
        y3: [0,1,0,0,0,1,0,0,0,1,1,0],
        y4: [0,1,1,1,0,0,1,1,0,0,1,0],
        
        })
        
        Here it should be a strong correlation between y1 and y2.
        this all are features ( independent/input variables ). no one is a target variable.
        is there any methods i can use to find correlation in it?
        Please help.
      - Jason Brownlee July 4, 2019 at 7:39 am #
        
        Chi squared might be a good start.
      - DearML July 5, 2019 at 2:24 pm #
        
        Chi squared is about input and output. isn’t it? What about cosine similarity? i think it will work.
      - Jason Brownlee July 6, 2019 at 8:22 am #
        
        Chi squared is only concerned with two categorical variables. They may or may not be inputs or outputs to a model.
        
        What about cosine similarity exactly?
      - DearML July 8, 2019 at 1:55 pm #
        
        cosine similarity can give me the similarity of two different vectors. here in my example above, it will say that y1 and y2 are related with some more than ~95%
      - Jason Brownlee July 9, 2019 at 8:04 am #
        
        Details here:
        https://en.wikipedia.org/wiki/Cosine_similarity
- Judith Vazquez June 26, 2018 at 1:09 am #
  
  Hi Elie,
  
  You might try using the binning technique. Please see below
  
  http://www.saedsayad.com/binning.htm
  
  Hope it helps 🙂
  
  Reply
  - Olutobi Adeyemi October 15, 2018 at 10:42 pm #
    
    Analysis of variance will work for this.
    
    Reply
- Adi June 16, 2019 at 6:24 pm #
  
  A 2 sample KS test (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ks_2samp.html)
  
  Reply
  - Jason Brownlee June 17, 2019 at 8:19 am #
    
    Nice!
    
    Reply
- Rishabh March 31, 2020 at 5:15 am #
  
  Independent two sample t test
  
  Reply
- ana July 1, 2020 at 2:59 am #
  
  https://en.wikipedia.org/wiki/Correlation_ratio
  
  Reply
Andrew V. June 21, 2018 at 3:36 am #

Hi Jason, great article! One quick thing: shouldn’t the above read: “If statistic > critical value then significant result” and “If statistic <= critical value then non-significant result"? The statistic's value and p-value should be inversely related.

Reply
- Jason Brownlee June 21, 2018 at 6:22 am #
  
  Yes, thanks. That was a typo in the explanation. Fixed.
  
  Reply
Hani December 25, 2018 at 10:51 pm #

Hi ,

How can I loop the chisq to check the Target vs. all other variables in one step
and will let me know the p-value dof etc… of any combination with Target vs each variable ?

I tried and it didnt work out….

Thanks !

Reply
- Jason Brownlee December 26, 2018 at 6:44 am #
  
  Perhaps write a for-loop to check all variables?
  
  Reply
SK January 24, 2019 at 10:12 pm #

How do we perform chi-squared test for finding terms that are the most correlated with each class ?

Reply
- Jason Brownlee January 25, 2019 at 8:44 am #
  
  Perhaps calculate the test for each term, then rank order the results?
  
  Reply
Cody March 8, 2019 at 3:43 am #

Very helpful and easy to understand. Thank you very much.

Reply
- Jason Brownlee March 8, 2019 at 7:55 am #
  
  Thanks, I’m glad it helped.
  
  Reply
Sachin April 20, 2019 at 5:23 am #

Such a nicely written article! Thanks for your time!

Reply
- Jason Brownlee April 20, 2019 at 7:43 am #
  
  Thanks, I’m glad it helped.
  
  Reply
Vaishali Bhadwaj June 11, 2019 at 7:28 pm #

Hi ,

I have 3 categorical variables in my data set:

Happiness, Income and Degree

I need to find below :-
A survey was conducted among 2800 customers on several demographic characteristics. Working status, sex, age, age-group, race, happiness, no. of child, marital status, educational qualifications, income group etc. had been captured for that purpose. (Data set: sample_survey.csv).
a. Is there any relationship in between labour force status with marital status? b. Do you think educational qualification is somehow controlling the marital status? c. Is happiness is driven by earnings or marital status?

Reply
- Jason Brownlee June 12, 2019 at 7:55 am #
  
  Perhaps try using the chi squared test?
  
  Reply
BM August 18, 2019 at 3:53 am #

How to creat the cotingency table in python

Reply
- Jason Brownlee August 18, 2019 at 6:49 am #
  
  See this:
  https://www.statsmodels.org/stable/contingency_tables.html
  
  Reply
- Patrick C. September 12, 2019 at 8:44 am #
  
  You could also have a look at the pandas crosstab functions ~ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
  
  Reply
  - Jason Brownlee September 12, 2019 at 1:48 pm #
    
    Thanks for the note Patrick.
    
    Reply
Bruno Ambrozio September 30, 2019 at 3:15 am #

Great content! Thanks!
Doubt:
– If I understood well, with this chi-squared test you can say if there is or not significant differences among the group categories, but if so, how to find out which ones?
Eg.: Let’s say that you managed to reject the null hypothesis. How to figure out which group has significant differences?
Do we have to apply a Fisher exact test in each category against each other in multiples 2×2 contingency tables?
Thanks!

Reply
- Jason Brownlee September 30, 2019 at 6:17 am #
  
  What do you mean group? Do you mean the categories for a given variable?
  
  Reply
  - Bruno Ambrozio September 30, 2019 at 7:58 pm #
    
    For example: In your example you have 3 categories been tested: Science, Math and Art. Let’s say your result concludes you have evidence enough to reject the null hypothesis (the variables are dependent). But, how do you know which one (or whether all of them) account for such result?
    Let’s consider another example, where you have 34 categories (Degrees of Freedom = 33). You also manage to reject the null hypothesis. So, how do you know which of those 34 categories were responsible for the final result (p <= alpha)?
    
    Reply
    - Jason Brownlee October 1, 2019 at 6:50 am #
      
      Yes, that is one discrete random variable that has 3 states or events.
      
      The test comments on the random variable, not the states.
      
      Does that help?
      
      Reply

Yogesh Naicker October 9, 2019 at 8:24 pm #

Can someone please provide python code for the below 4 categorical variables???

The table shows the contingency table of marital status by education. Use Chi-Square test for testing Homogenity.contingency table of marital status by education.

View the table by executing the following command python
from prettytable import PrettyTable
t = PrettyTable([‘Marital Status’,’Middle school’, ‘High School’,’Bachelor’,’Masters’,’PhD’])
t.add_row([‘Single’,18,36,21,9,6])
t.add_row([‘Married’,12,36,45,36,21])
t.add_row([‘Divorced’,6,9,9,3,3])
t.add_row([‘Widowed’,3,9,9,6,3])
print (t)
exit()

Hypothesis

Null Hypothesis: There is no difference in distribution between the types of education level in terms of marital status.

Alternate Hypothesis: There is a Difference

Coding
1.Import chi2_contingency and chi2 from scipy.stats package.

2.Declare a 2D array with the values mentioned in the contingency table of marital status by education.

3.Calculate and print the values of

– Chi-Square Statistic
– Degree of Freedom
– P value
– Hint: Use chi2_contigency() function
4.Assume the alpha value to be 0.05

5.Compare the P value with alpha and decide whether or not to reject the null hypothesis.

– If Rejected print “Reject the Null Hypothesis”
– Else print “Failed to reject the Null Hypothesis”

Sample output 2.33 4.5 8.9 Reject the Null Hypothesis

Jason Brownlee October 10, 2019 at 6:57 am #

Looks like homework. Perhaps try posting to stackoverflow?

Sachin Ladhad October 17, 2019 at 1:22 pm #

from scipy.stats import chi2_contingency
from scipy.stats import chi2

table= [ [18,31,21,9,6],[12,36,45,36,21], [6,9,9,3,3],[3,9,9,6,3] ]

stat,p,dof,expected = chi2_contingency(table)
prob = 0.95
critical = chi2.ppf(prob, dof)

if abs(p) <= 0.05:
print(stat, dof ,p ,'Reject the Null Hypothesis')
else:
print(stat, dof ,p ,'Failed to reject the Null Hypothesis')

output

21.032858435297882 12 0.0499013559023993 Reject the Null Hypothesis

Help needed : Please let me know why the output is incorrect

Jason Brownlee October 17, 2019 at 1:53 pm #

I have some suggestions here that might help:
https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

Reply

Mary November 29, 2019 at 7:48 am #

Your table indicates that for the “Single” road the values are 18,36,21,9,6
and for your code you have 18,*31*,21…
That 31 should be 36

from scipy.stats import chi2_contingency
from scipy.stats import chi2
table = [[18,36,21,9,6],[12,36,45,36,21],[6,9,9,3,3],[3,9,9,6,3]]
stat,p,dof,expected = chi2_contingency(table)
prob = 0.95
critical = chi2.ppf(prob, dof)
if abs(stat) >= 0.05:
	print(stat, dof, p, ‘Reject the Null Hypothesis’)
else:
	print(stat, dof, p, ‘Failed to reject the Null Hypothesis’)

from scipy.stats import chi2_contingency

from scipy.stats import chi2

table = [[18,36,21,9,6],[12,36,45,36,21],[6,9,9,3,3],[3,9,9,6,3]]

stat,p,dof,expected = chi2_contingency(table)

prob = 0.95

critical = chi2.ppf(prob, dof)

if abs(stat) >= 0.05:

print(stat, dof, p, ‘Reject the Null Hypothesis’)

else:

print(stat, dof, p, ‘Failed to reject the Null Hypothesis’)

Jason Brownlee November 29, 2019 at 1:40 pm #

Thanks for sharing.

Marc Hansen October 31, 2019 at 7:12 am #

Thank you for the clear explanations.

In the text you say: “The variable is ‘sex‘ and the labels or factors of the variable are ‘male‘ and ‘female‘ in this case.”

Don’t you mean: “The variable or factor is ‘sex‘ and the labels or levels of the variable are ‘male‘ and ‘female‘ in this case.”

ref: https://stattrek.com/statistics/dictionary.aspx?definition=factor

Reply
- Jason Brownlee October 31, 2019 at 7:32 am #
  
  Yes, you’re right. Fixed, thanks.
  
  Reply
Sandeep December 29, 2019 at 9:18 am #

Thanks Jason. Good read it is.

Reply
- Jason Brownlee December 30, 2019 at 5:54 am #
  
  You’re welcome.
  
  Reply
James Tizard February 12, 2020 at 2:23 pm #

Great tutorial, thanks!
I’m wondering how to do chi2 test where survey respondents could select multiple answers.

For example: which OS do you use? A) Windows B) Linux C)Mac
Results: 1000 people take the survey, 500 say windows, 400 say Mac and 200 say linux. Total is greater than the number of respondents.

Can I compare windows and mac by creating the following contingency table and running the test?

OS, Not OS
Mac 400, 600
Windows 500, 500

Reply
- Jason Brownlee February 13, 2020 at 5:36 am #
  
  Good question, I’m not sure off the cuff when it comes to multiple answers. It messes up the contingency table.
  
  You might have to hit the books or ping a statistician / post on crossvalidated.
  
  Reply
Eric Ren June 6, 2020 at 2:30 am #

Hi Jason,

Very nice article, clearly explained the Chi2 test. I have one question to ask. When reading the sklearn feature selection using the Chi2 test here: https://scikit-learn.org/stable/modules/feature_selection.html, I am confused by the example, in which the Iris data is used to demo the Chi 2 test on non categorical data, which is not even frequency or count. Is it wrong?

Reply
- Jason Brownlee June 6, 2020 at 7:58 am #
  
  Probably.
  
  Reply
Saurabh Agarwal August 12, 2020 at 5:28 pm #

A very clearly written article.

Reply
- Jason Brownlee August 13, 2020 at 6:08 am #
  
  Thank you!
  
  Reply
Dhruv Modi August 20, 2020 at 8:01 pm #

Hi Jason,

Does chi-square test work well on independent variables of an imbalanced data having bad rate just 1%?

Reply
- Jason Brownlee August 21, 2020 at 6:27 am #
  
  The test requires at least 20 examples in each cell of the contingency table I believe.
  
  Reply
  - Hridaya Saboo June 11, 2021 at 4:04 pm #
    
    Thank you. It is a really important and very practical question. Can you please provide more insights into this?
    
    Reply
    - Jason Brownlee June 12, 2021 at 5:23 am #
      
      I don’t have any more insight to give, perhaps check some of the references in the further reading section.
      
      Reply
Fidan September 23, 2020 at 10:26 am #

Hi Jason.

Great article. I have one question. If we done the chi-square test on a sample dataset and the result between two categorical variables are dependent. Does the ‘dependency’ also can be said to the population?

Reply
- Jason Brownlee September 23, 2020 at 1:44 pm #
  
  Yes and No. It is a statistical estimate. A probabilistic guess with some degree of confidence.
  
  Reply
Kenny October 20, 2020 at 11:14 pm #

Thanks Jason for the Good and informative article.
Sometimes I get mixed up between chi-square Goodness of fit and chi-square Tests of Independence. Can we use the terms interchangeably or are they different to each other?

Reply
- Jason Brownlee October 21, 2020 at 6:40 am #
  
  Same thing I believe, different use case.
  
  Reply
  - Kenny October 22, 2020 at 8:43 pm #
    
    Thanks Jason for the clarification.
    In scipy there are 2 different function for chi-square-
    1)scipy.stats.chisquare
    2)scipy.stats.chi2_contingency
    Do you mind telling which one to use for which use-case, please?
    
    Reply
    - Jason Brownlee October 23, 2020 at 6:09 am #
      
      Perhaps check the API documentation to see which matches your requirements.
      
      Reply
Curiously Coding Foxah November 5, 2020 at 7:31 am #

Hey JB,

Have you ever explored the reason why sklearn’s chi2 gives different values for the test statistic and p-value compared to performing the test by hand or using chi2_contingence from scipy?

I can’t seem to find a satisfactory answer, and I’m hoping the good doctor (you) might have some insight.

Cheers

Reply
- Jason Brownlee November 5, 2020 at 7:54 am #
  
  I have not.
  
  Reply
Hamed November 19, 2020 at 9:57 pm #

Hey,

I am very new in this area.
I have x_1 and x_2 and y. How can I see the dependence of y to x_1 and x_2?
x=[[1,0],[1,0],[0,1],[1,1],[1,0],[1,0],[1,0],[1,1],[0,1],[0,1]]
y=[0,0,1,1,0,0,0,1,1,1]

Reply
- Jason Brownlee November 20, 2020 at 6:45 am #
  
  Why not use the above tutorial to calculate the dependence?
  
  Reply
Rara July 10, 2021 at 3:01 am #

Hi! I’m new to data science. Would like to understand like how do you decide which test to use if chi-square or one-sample t-test, independent sample t-test, paired sample t test in A/B testing?

Reply
- Jason Brownlee July 10, 2021 at 6:12 am #
  
  Good question, see this:
  https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/
  
  Reply
Daniele January 31, 2022 at 2:17 am #

Hi I don’t understand why if the numerator is big then we reject the null hypothesis. This means that there is a huge difference between expected frequency and observed frequency but it doesn’t mean that the two categorical variables are dependent. Thank you in advance.

Reply
- James Carmichael January 31, 2022 at 10:47 am #
  
  Hi Danielle,
  
  After you perform a hypothesis test, there are only two possible outcomes.
  When your p-value is less than or equal to your significance level, you reject the null hypothesis. The data favors the alternative hypothesis. …
  When your p-value is greater than your significance level, you fail to reject the null hypothesis.
  
  The following will hopefully provide additional clarity:
  
  https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/support-or-reject-null-hypothesis/
  
  Reply
Makaros February 1, 2022 at 10:43 pm #

Hi thanks for this nice article.

The input to chi2_contingency needs to be the table with the occurrences / counts right?

Not the table of frequencies as percentages, right?

Thanks

Reply
- James Carmichael February 2, 2022 at 10:26 am #
  
  Hi Makaros…the following may help further clarify:
  
  https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223
  
  Reply
Francisco March 24, 2022 at 2:04 pm #

Hi James,

How do you report with python code the effect size for Chi2 tests?

Reply
- James Carmichael March 25, 2022 at 1:52 pm #
  
  Hi Francisco…this resource may be of help to you:
  
  https://www.askpython.com/python/examples/chi-square-test
  
  Reply
Justin May 19, 2022 at 10:49 am #

Hi,
One question for categorical variable testing,

I have a model trained with some data, it has a categorical variable say

purpose = [‘radio_television’, ‘new_car’, ‘furniture_equipment’, ‘used_car’, ‘business’]

For retraining, the new data has an new value in the column purpose which is “education”

purpose = [‘radio_television’, ‘new_car’, ‘furniture_equipment’, ‘used_car’, ‘business’, ‘education’]

What test should I do to say the data is not having same distribution i.e Data Drifted.

Thanks in advance

Reply
- Justin May 19, 2022 at 10:56 am #
  
  Hi,
  
  For Binary Classification Models,
  
  The existing model with data, for retraining the same model with new data,
  
  On Numeric fields
  eg: Salary Column, we can do Hypothesis test on Salary Column between old data and new data to check the distribution
  
  May I know what test to be conducted on Categorical Column of old data and new data eg: “Purpose” Column?
  
  Thank you very much
  
  Reply
Yunhao June 28, 2022 at 1:05 am #

Hi Jason,

May I ask why ‘observed contingency table and expected contingency table is same’, then it means that variables are independent?

Reply

Navigation

A Gentle Introduction to the Chi-Squared Test for Machine Learning

Tutorial Overview

Need help with Statistics for Machine Learning?

Contingency Table

Pearson’s Chi-Squared Test

Example Chi-Squared Test

Extensions

Further Reading

Books

API

Articles

Summary

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

77 Responses to A Gentle Introduction to the Chi-Squared Test for Machine Learning

Leave a Reply Click here to cancel reply.