Revealing the Invisible: Visualizing Missing Values in Ames Housing

The digital age has ushered in an era where data-driven decision-making is pivotal in various domains, real estate being a prime example. Comprehensive datasets, like the one concerning properties in Ames, offer a treasure trove for data enthusiasts. Through meticulous exploration and analysis of such datasets, one can uncover patterns, gain insights, and make informed decisions.

Starting from this post, you will embark on a captivating journey through the intricate lanes of Ames properties, focusing primarily on Data Science techniques.

Let’s get started.

Revealing the Invisible: Visualizing Missing Values in Ames Housing
Photo by Joakim Honkasalo. Some rights reserved

Overview

This post is divided into three parts; they are:

  • The Ames Properties Dataset
  • Loading & Sizing Up the Dataset
  • Uncovering & Visualizing Missing Values

The Ames Properties Dataset

Every dataset has a story to tell, and understanding its background can offer invaluable context. While the Ames Housing Dataset is widely known in academic circles, the dataset we’re analyzing today, Ames.csv, is a more comprehensive collection of property details from Ames.

Dr. Dean De Cock, a dedicated academician, recognized the need for a new, robust dataset in the domain of real estate. He meticulously compiled the Ames Housing Dataset, which has since become a cornerstone for budding data scientists and researchers. This dataset stands out due to its comprehensive details, capturing myriad facets of real estate properties. It has been a foundation for numerous predictive modeling exercises and offers a rich landscape for exploratory data analysis.

The Ames Housing Dataset was envisioned as a modern alternative to the older Boston Housing Dataset. Covering residential sales in Ames, Iowa between 2006 and 2010, it presents a diverse array of variables, setting the stage for advanced regression techniques.

This time frame is particularly significant in U.S. history. The period leading up to 2007-2008 saw the dramatic inflation of housing prices, fueled by speculative frenzy and subprime mortgages. This culminated in the devastating collapse of the housing bubble in late 2007, an event vividly captured in narratives like “The Big Short.” The aftermath of this collapse rippled across the nation, leading to the Great Recession. Housing prices plummeted, foreclosures skyrocketed, and many Americans found themselves underwater on their mortgages. The Ames dataset provides a glimpse into this turbulent period, capturing property sales in the midst of national economic upheaval.

Kick-start your project with my book The Beginner’s Guide to Data Science. It provides self-study tutorials with working code.

Loading & Sizing Up the Dataset

For those who are venturing into the realm of data science, having the right tools in your arsenal is paramount. If you require some help to set up your Python environment, this comprehensive guide is an excellent resource.

Dataset Dimensions: Before diving into intricate analyses, it’s essential to familiarize yourself with the dataset’s basic structure and data types. This step provides a roadmap for subsequent exploration and ensures you tailor your analyses based on the data’s nature. With the environment in place, let’s load and gauge the dataset’s extent in terms of rows (representing individual properties) and columns (representing attributes of these properties).  

Data Types: Recognizing the datatype of each attribute helps shape our analysis approach. Numerical attributes might be summarized using measures like mean or median, while mode (most frequent value) is apt for categorical attributes.  

The Data Dictionary: A data dictionary, often accompanying comprehensive datasets, is a handy resource. It offers detailed descriptions of each feature, specifying its meaning, possible values, and sometimes even the logic behind its collection. For datasets like the Ames properties, which encompass a wide range of features, a data dictionary can be a beacon of clarity. By referring to the attached data dictionary, analysts, data scientists, and even domain experts can gain a deeper understanding of the nuances of each feature. Whether you’re deciphering the meaning behind an unfamiliar feature or discerning the significance of particular values, the data dictionary serves as a comprehensive guide. It bridges the gap between raw data and actionable insights, ensuring that the analyses and decisions are well-informed.

Ground Living Area and Sale Price are numerical (int64) data types, while Sale Condition (object, which is string type in this example) is a categorical data type.

Uncovering & Visualizing Missing Values

Real-world datasets seldom arrive perfectly curated, often presenting analysts with the challenge of missing values. These gaps in data can arise due to various reasons, such as errors in data collection, system limitations, or the absence of information. Addressing missing values is not merely a technical necessity but a critical step that significantly impacts the integrity and reliability of subsequent analyses.

Understanding the patterns of missing values is essential for informed data analysis. This insight guides the selection of appropriate imputation methods, which fill in missing data based on available information, thereby influencing the accuracy and interpretability of results. Additionally, assessing missing value patterns informs decisions on feature selection; features with extensive missing data may be excluded to enhance model performance and focus on more reliable information. In essence, grasping the patterns of missing values ensures robust and trustworthy data analyses, guiding imputation strategies and optimizing feature inclusion for more accurate insights.

NaN or None?: In pandas, the isnull() function is used to detect missing values in a DataFrame or Series. Specifically, it identifies the following types or missing data:

  • np.nan (Not a Number), often used to denote missing numerical data
  • None, which is Python’s built-in object to denote the absence of a value or a null value

Both nan and NaN are just different ways to refer to NumPy’s np.nan, and isnull() will identify them as missing values. Here is a quick example.

Visualizing Missing Values: When it comes to visualizing missing data, tools like DataFrames, missingno, matplotlib, and seaborn come in handy. By sorting the features based on the percentage of missing values and placing them into a DataFrame, you can easily rank the features most affected by missing data.   

The missingno package facilitates a swift, graphical representation of missing data. The visualization’s white lines or gaps denote missing values. However, it will only accommodate up to 50 labeled variables. Past that range, labels begin to overlap or become unreadable, and by default, large displays omit them.  

Visual representation of missing values using missingno.matrix().

Using the msno.bar() visual after extracting the top 15 features with the most missing values provides a crisp illustration by column.   

Using missingno.bar() to visualize features with missing data.

The illustration above denotes that Pool Quality, Miscellaneous Feature, and the type of Alley access to the property are the three features with the highest number of missing values. 

Using seaborn horizontal bar plots to visualize missing data.

A horizontal bar plot using seaborn allows you to list features with the highest missing values in a vertical format, adding both readability and aesthetic value.  

Handling missing values is more than just a technical requirement; it’s a significant step that can influence the quality of your machine learning models. Understanding and visualizing these missing values are the first steps in this intricate dance.

Further Reading

This section provides more resources on the topic if you want to go deeper.

Tutorials

Papers

Resources

Summary

In this tutorial, you embarked on an exploration of the Ames Properties dataset, a comprehensive collection of housing data tailored for data science applications. 

Specifically, you learned: 

  • About the context of the Ames dataset, including the pioneers and academic importance behind it. 
  • How to extract dataset dimensions, data types, and missing values. 
  • How to use packages like missingno, Matplotlib, and Seaborn to quickly visualize your missing data. 

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on The Beginner's Guide to Data Science!

The Beginner's Guide to Data Science

Learn the mindset to become successful in data science projects

...using only minimal math and statistics, acquire your skill through short examples in Python

Discover how in my new Ebook:
The Beginner's Guide to Data Science

It provides self-study tutorials with all working code in Python to turn you from a novice to an expert. It shows you how to find outliers, confirm the normality of data, find correlated features, handle skewness, check hypotheses, and much more...all to support you in creating a narrative from a dataset.

Kick-start your data science journey with hands-on exercises


See What's Inside

4 Responses to Revealing the Invisible: Visualizing Missing Values in Ames Housing

  1. Avatar
    Abdulsalam January 27, 2024 at 7:40 am #

    This is a very good eye openers into visualizing missing values. Though, I do use missingno to run visualisation of my missing values, but plotting using seaborn I have not done that before. Will have to try that! Thank you very much!

    • Avatar
      James Carmichael January 27, 2024 at 10:29 am #

      You are very welcome Abdulsalam!

  2. Avatar
    Princess Leja January 29, 2024 at 8:48 pm #

    Many thanks Vinod for this article. I was glad to run it and see the importance of missing values. I am now going on to studying your next link on Ames housing.

    • Vinod Chugani
      Vinod Chugani January 30, 2024 at 4:19 pm #

      You are very welcome! I am glad to hear that you found the article helpful.

Leave a Reply