Garage or Not? Housing Insights Through the Chi-Squared Test for Ames, Iowa

By Vinod Chugani on March 28, 2024 in Data Science 0

The Chi-squared test for independence is a statistical procedure employed to assess the relationship between two categorical variables – determining whether they are associated or independent. In the dynamic realm of real estate, where a property’s visual appeal often impacts its valuation, the exploration becomes particularly intriguing. But how often do you associate a house’s external allure with functional features like a garage? Using the Ames housing dataset, this exploration delves deep into discerning whether there exists a statistically significant association between the external quality of a house and the presence of a garage. As you navigate through statistical waters using the Chi-squared test, you unearth intriguing insights that are both enlightening and thought-provoking.

Let’s get started.

Garage or Not? Housing Insights Through the Chi-Squared Test for Ames, Iowa
Photo by Damir Kopezhanov. Some rights reserved.

Overview

This post is divided into four parts; they are:

Understanding the Chi-Squared Test
How the Chi-Squared Test Works
Unraveling the Association Between External Quality and Garage Presence
Important Caveats

Understanding the Chi-Squared Test

The Chi-squared ($\chi^2$) test is useful because of its ability to test for associations between categorical variables. It’s particularly valuable when working with nominal or ordinal data, where the variables are divided into categories or groups. The primary purpose of the Chi-squared test is to determine whether there is a statistically significant association between two categorical variables. In other words, it helps to answer questions such as:

Are two categorical variables independent of each other?
- If the variables are independent, changes in one variable are not related to changes in the other. There is no association between them.

Is there a significant association between the two categorical variables?
- If the variables are associated, changes in one variable are related to changes in the other. The Chi-squared test helps to quantify whether this association is statistically significant.

In your study, you focus on the external quality of a house (categorized as “Great” or “Average”) and its relation to the presence or absence of a garage. For the results of the Chi-squared test to be valid, the following conditions must be satisfied:

Independence: The observations must be independent, meaning the occurrence of one outcome shouldn’t affect another. Our dataset satisfies this as each entry represents a distinct house.
Sample Size: The dataset should not only be randomly sampled but also sizable enough to be representative. Our data, sourced from Ames, Iowa, meets this criterion.
Expected Frequency: Every cell in the contingency table should have an expected frequency of at least 5. This is vital for the test’s reliability, as the Chi-squared test relies on a large sample approximation. You will demonstrate this condition below by creating and visualizing the expected frequencies.

Kick-start your project with my book The Beginner’s Guide to Data Science. It provides self-study tutorials with working code.

How the Chi-Squared Test Works

Chi-squared test compares the observed frequencies from data to the expected frequencies from assumptions.

The Chi-squared test works by comparing the observed frequencies of the categories in a contingency table to the expected frequencies that would be expected under the assumption of independence. The contingency table is a cross-tabulation of the two categorical variables, showing how many observations fall into each combination of categories.

Null Hypothesis ($H_0$): The null hypothesis in the Chi-squared test assumes independence between the two variables, i.e., the observed frequencies (with or without garage) should match.
Alternative Hypothesis ($H_1$): The alternative hypothesis suggests that there is a significant association between the two variables, i.e., the observed frequencies (with or without garage) should differ based on the value of another variable (quality of a house).

The test statistic in the Chi-squared test is calculated by comparing the observed and expected frequencies in each cell of the contingency table. The larger the difference between observed and expected frequencies, the larger the Chi-squared statistic becomes. The Chi-squared test produces a p-value, which indicates the probability of observing the observed association (or a more extreme one) under the assumption of independence. If the p-value is below a chosen significance level $\alpha$ (commonly 0.05), the null hypothesis of independence is rejected, suggesting a significant association.

Unraveling the Association Between External Quality and Garage Presence

Using the Ames housing dataset, you set out to determine whether there’s an association between a house’s external quality and the presence or absence of a garage. Let’s delve into the specifics of our analysis, supported by the corresponding Python code.

# Importing the essential libraries
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset
Ames = pd.read_csv('Ames.csv')

# Extracting the relevant columns
exterqual_garagefinish_data = Ames[['ExterQual', 'GarageFinish']].copy()

# Filling missing values in the 'GarageFinish' column with 'No Garage'
exterqual_garagefinish_data['GarageFinish'].fillna('No Garage', inplace=True)

# Grouping 'GarageFinish' into 'With Garage' and 'No Garage'
exterqual_garagefinish_data['Garage Group'] \
    = exterqual_garagefinish_data['GarageFinish'] \
      .apply(lambda x: 'With Garage' if x != 'No Garage' else 'No Garage')

# Grouping 'ExterQual' into 'Great' and 'Average'
exterqual_garagefinish_data['Quality Group'] \
    = exterqual_garagefinish_data['ExterQual'] \
      .apply(lambda x: 'Great' if x in ['Ex', 'Gd'] else 'Average')

# Constructing the simplified contingency table
simplified_contingency_table \
    = pd.crosstab(exterqual_garagefinish_data['Quality Group'],
                  exterqual_garagefinish_data['Garage Group'])

#Printing the Observed Frequency
print("Observed Frequencies:")
observed_df = pd.DataFrame(simplified_contingency_table,
                           index=["Average", "Great"],
                           columns=["No Garage", "With Garage"])
print(observed_df)
print()

# Performing the Chi-squared test
chi2_stat, p_value, _, expected_freq = chi2_contingency(simplified_contingency_table)

# Printing the Expected Frequencies
print("Expected Frequencies:")
print(pd.DataFrame(expected_freq,
                   index=["Average", "Great"],
                   columns=["No Garage", "With Garage"]).round(1))
print()

# Printing the results of the test
print(f"Chi-squared Statistic: {chi2_stat:.4f}")
print(f"p-value: {p_value:.4e}")

# Importing the essential libraries

import pandas as pd

from scipy.stats import chi2_contingency

# Load the dataset

Ames = pd.read_csv('Ames.csv')

# Extracting the relevant columns

exterqual_garagefinish_data = Ames[['ExterQual', 'GarageFinish']].copy()

# Filling missing values in the 'GarageFinish' column with 'No Garage'

exterqual_garagefinish_data['GarageFinish'].fillna('No Garage', inplace=True)

# Grouping 'GarageFinish' into 'With Garage' and 'No Garage'

exterqual_garagefinish_data['Garage Group'] \

= exterqual_garagefinish_data['GarageFinish'] \

.apply(lambda x: 'With Garage' if x != 'No Garage' else 'No Garage')

# Grouping 'ExterQual' into 'Great' and 'Average'

exterqual_garagefinish_data['Quality Group'] \

= exterqual_garagefinish_data['ExterQual'] \

.apply(lambda x: 'Great' if x in ['Ex', 'Gd'] else 'Average')

# Constructing the simplified contingency table

simplified_contingency_table \

= pd.crosstab(exterqual_garagefinish_data['Quality Group'],

exterqual_garagefinish_data['Garage Group'])

#Printing the Observed Frequency

print("Observed Frequencies:")

observed_df = pd.DataFrame(simplified_contingency_table,

index=["Average", "Great"],

columns=["No Garage", "With Garage"])

print(observed_df)

print()

# Performing the Chi-squared test

chi2_stat, p_value, _, expected_freq = chi2_contingency(simplified_contingency_table)

# Printing the Expected Frequencies

print("Expected Frequencies:")

print(pd.DataFrame(expected_freq,

index=["Average", "Great"],

columns=["No Garage", "With Garage"]).round(1))

print()

# Printing the results of the test

print(f"Chi-squared Statistic: {chi2_stat:.4f}")

print(f"p-value: {p_value:.4e}")

The output should be:

Observed Frequencies:
         No Garage  With Garage
Average        121         1544
Great            8          906

Expected Frequencies:
         No Garage  With Garage
Average       83.3       1581.7
Great         45.7        868.3

Chi-squared Statistic: 49.4012
p-value: 2.0862e-12

Observed Frequencies:

No Garage With Garage

Average 121 1544

Great 8 906

Expected Frequencies:

No Garage With Garage

Average 83.3 1581.7

Great 45.7 868.3

Chi-squared Statistic: 49.4012

p-value: 2.0862e-12

The code above performs three steps:

Data Loading & Preparation:

You began by loading the dataset and extracting the pertinent columns: ExterQual (Exterior Quality) and GarageFinish (Garage Finish).
Recognizing the missing values in GarageFinish, you sensibly imputed them with the label "No Garage", indicating houses devoid of garages.

Data Grouping for Simplification:

You further categorized the GarageFinish data into two groups: “With Garage” (for houses with any kind of garage) and “No Garage”.
Similarly, you grouped the ExterQual data into “Great” (houses with excellent or good exterior quality) and “Average” (houses with average or fair exterior quality).

Chi-squared Test:

With the data aptly prepared, you constructed a contingency table to depict the observed frequencies between the newly formed categories. They are the two tables printed in the output.
You then performed a Chi-squared test on this contingency table using SciPy. The p-value is printed and found much less than $\alpha$. The extremely low p-value obtained from the test signifies a statistically significant association between a house’s external quality and the presence of a garage in this dataset.
A quick glance at the expected frequencies satisfies the third condition of a Chi-squared test, which requires a minimum of 5 occurrences in each cell.

Through this analysis, you not only refined and simplified the data to make it more interpretable but also provided statistical evidence of an association between two categorical variables of interest.

Important Caveats

The Chi-squared test, despite its utility, has its limitations:

No Causation: While the test can determine association, it doesn’t infer causation. So, even though there’s a significant link between a house’s external quality and its garage presence, you can’t conclude that one causes the other.
Directionality: The test indicates an association but doesn’t specify its direction. However, our data suggests that houses labeled as “Great” in terms of external quality are more likely to have garages than those labeled as “Average”.
Magnitude: The test doesn’t provide insights into the relationship’s strength. Other metrics, like Cramér’s V, would be more informative in this regard.
External Validity: Our conclusions are specific to the Ames dataset. Caution is advised when generalizing these findings to other regions.

Summary

In this post, you delved into the Chi-squared test and its application on the Ames housing dataset. You discovered a significant association between a house’s external quality and the presence of a garage.

Specifically, you learned:

The fundamentals and practicality of the Chi-squared test.
The Chi-squared test revealed a significant association between a house’s external quality and the presence of a garage in the Ames dataset. Houses with a “Great” external quality rating showed a higher likelihood of having a garage when compared to those with an “Average” rating, a trend that was statistically significant.
The vital caveats and limitations of the Chi-squared test.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner's Guide to Data Science!

Learn the mindset to become successful in data science projects

...using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner's Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more...all to support you in creating a narrative from a dataset.