One Hot Encoding: Understanding the “Hot” in Data

By Vinod Chugani on February 28, 2025 in Intermediate Data Science 0

Preparing categorical data correctly is a fundamental step in machine learning, particularly when using linear models. One Hot Encoding stands out as a key technique, enabling the transformation of categorical variables into a machine-understandable format. This post tells you why you cannot use a categorical variable directly and demonstrates the use One Hot Encoding in our search for identifying the most predictive categorical features for linear regression.

Kick-start your project with my book Next-Level Data Science. It provides self-study tutorials with working code.

Let’s get started.

One Hot Encoding: Understanding the “Hot” in Data
Photo by sutirta budiman. Some rights reserved.

Overview

This post is divided into three parts; they are:

What is One Hot Encoding?
Identifying the Most Predictive Categorical Feature
Evaluating Individual Features’ Predictive Power

What is One Hot Encoding?

In data preprocessing for linear models, “One Hot Encoding” is a crucial technique for managing categorical data. In this method, “hot” signifies a category’s presence (encoded as one), while “cold” (or zero) signals its absence, using binary vectors for representation.

From the angle of levels of measurement, categorical data are nominal data, which means if we used numbers as labels (e.g., 1 for male and 2 for female), operations such as addition and subtraction would not make sense. And if the labels are not numbers, you can’t even do any math with it.

One hot encoding separates each category of a variable into distinct features, preventing the misinterpretation of categorical data as having some ordinal significance in linear regression and other linear models. After the encoding, the number bears meaning, and it can readily be used in a math equation.

For instance, consider a categorical feature like “Color” with the values Red, Blue, and Green. One Hot Encoding translates this into three binary features (“Color_Red,” “Color_Blue,” and “Color_Green”), each indicating the presence (1) or absence (0) of a color for each observation. Such a representation clarifies to the model that these categories are distinct, with no inherent order.

Why does this matter? Many machine learning models, including linear regression, operate on numerical data and assume a numerical relationship between values. Directly encoding categories as numbers (e.g., Red=1, Blue=2, Green=3) could imply a non-existent hierarchy or quantitative relationship, potentially skewing predictions. One Hot Encoding sidesteps this issue, preserving the categorical nature of the data in a form that models can accurately interpret.

Let’s apply this technique to the Ames dataset, demonstrating the transformation process with an example:

# Load only categorical columns without missing values from the Ames dataset
import pandas as pd
Ames = pd.read_csv("Ames.csv").select_dtypes(include=["object"]).dropna(axis=1)
print(f"The shape of the DataFrame before One Hot Encoding is: {Ames.shape}")

# Import OneHotEncoder and apply it to Ames:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
Ames_One_Hot = encoder.fit_transform(Ames)

# Convert the encoded result back to a DataFrame
Ames_encoded_df = pd.DataFrame(Ames_One_Hot, columns=encoder.get_feature_names_out(Ames.columns))

# Display the new DataFrame and it's expanded shape
print(Ames_encoded_df.head())
print(f"The shape of the DataFrame after One Hot Encoding is: {Ames_encoded_df.shape}")

# Load only categorical columns without missing values from the Ames dataset

import pandas as pd

Ames = pd.read_csv("Ames.csv").select_dtypes(include=["object"]).dropna(axis=1)

print(f"The shape of the DataFrame before One Hot Encoding is: {Ames.shape}")

# Import OneHotEncoder and apply it to Ames:

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

Ames_One_Hot = encoder.fit_transform(Ames)

# Convert the encoded result back to a DataFrame

Ames_encoded_df = pd.DataFrame(Ames_One_Hot, columns=encoder.get_feature_names_out(Ames.columns))

# Display the new DataFrame and it's expanded shape

print(Ames_encoded_df.head())

print(f"The shape of the DataFrame after One Hot Encoding is: {Ames_encoded_df.shape}")

This will output:

The shape of the DataFrame before One Hot Encoding is: (2579, 27)

   MSZoning_A (agr)  ...  SaleCondition_Partial
0               0.0  ...                    0.0
1               0.0  ...                    0.0
2               0.0  ...                    0.0
3               0.0  ...                    0.0
4               0.0  ...                    0.0
[5 rows x 188 columns]

The shape of the DataFrame after One Hot Encoding is: (2579, 188)

The shape of the DataFrame before One Hot Encoding is: (2579, 27)

MSZoning_A (agr) ... SaleCondition_Partial

0 0.0 ... 0.0

1 0.0 ... 0.0

2 0.0 ... 0.0

3 0.0 ... 0.0

4 0.0 ... 0.0

[5 rows x 188 columns]

The shape of the DataFrame after One Hot Encoding is: (2579, 188)

As seen, the Ames dataset’s categorical columns are converted into 188 distinct features, illustrating the expanded complexity and detailed representation that One Hot Encoding provides. This expansion, while increasing the dimensionality of the dataset, is a crucial preprocessing step when modeling the relationship between categorical features and the target variable in linear regression.

Want to Get Started With Next-Level Data Science?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Identifying the Most Predictive Categorical Feature

After understanding the basic premise and application of One Hot Encoding in linear models, the next step in our analysis involves identifying which categorical feature contributes most significantly to predicting our target variable. In the code snippet below, we iterate through each categorical feature in our dataset, apply One Hot Encoding, and evaluate its predictive power using a linear regression model in conjunction with cross-validation. Here, the drop="first" parameter in the OneHotEncoder function plays a vital role:

# Buidling on the code above to identify top categorical feature
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# Set 'SalePrice' as the target variable
y = pd.read_csv("Ames.csv")["SalePrice"]

# Dictionary to store feature names and their corresponding mean CV R² scores
feature_scores = {}

for feature in Ames.columns:
    encoder = OneHotEncoder(drop="first")
    X_encoded = encoder.fit_transform(Ames[[feature]])

    # Initialize the linear regression model
    model = LinearRegression()

    # Perform 5-fold cross-validation and calculate R^2 scores
    scores = cross_val_score(model, X_encoded, y)
    mean_score = scores.mean()

    # Store the mean R^2 score
    feature_scores[feature] = mean_score

# Sort features based on their mean CV R² scores in descending order
sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)
print("Feature selected for highest predictability:", sorted_features[0][0])

# Buidling on the code above to identify top categorical feature

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

# Set 'SalePrice' as the target variable

y = pd.read_csv("Ames.csv")["SalePrice"]

# Dictionary to store feature names and their corresponding mean CV R² scores

feature_scores = {}

for feature in Ames.columns:

encoder = OneHotEncoder(drop="first")

X_encoded = encoder.fit_transform(Ames[[feature]])

# Initialize the linear regression model

model = LinearRegression()

# Perform 5-fold cross-validation and calculate R^2 scores

scores = cross_val_score(model, X_encoded, y)

mean_score = scores.mean()

# Store the mean R^2 score

feature_scores[feature] = mean_score

# Sort features based on their mean CV R² scores in descending order

sorted_features = sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)

print("Feature selected for highest predictability:", sorted_features[0][0])

The drop="first" parameter is used to mitigate perfect collinearity. By dropping the first category (encoding it implicitly as zeros across all other categories for a feature), we reduce redundancy and the number of input variables without losing any information. This practice simplifies the model, making it easier to interpret and often improving its performance. The code above will output:

Feature selected for highest predictability: Neighborhood

1	Feature selected for highest predictability: Neighborhood

Our analysis reveals that “Neighborhood” is the categorical feature with the highest predictability in our dataset. This finding highlights the significant impact of location on housing prices within the Ames dataset.

Evaluating Individual Features’ Predictive Power

With a deeper understanding of One Hot Encoding and identifying the most predictive categorical feature, we now expand our analysis to uncover the top five categorical features that significantly impact housing prices. This step is essential for fine-tuning our predictive model, enabling us to focus on the features that offer the most value in forecasting outcomes. By evaluating each feature’s mean cross-validated R² score, we can determine not just the importance of these features individually but also gain insights into how different aspects of a property contribute to its overall valuation.

Let’s delve into this evaluation:

# Building on the code above to determine the performance of top 5 categorical features
print("Top 5 Categorical Features:")
for feature, score in sorted_features[0:5]:
    print(f"{feature}: Mean CV R² = {score:.4f}")

# Building on the code above to determine the performance of top 5 categorical features

print("Top 5 Categorical Features:")

for feature, score in sorted_features[0:5]:

print(f"{feature}: Mean CV R² = {score:.4f}")

The output from our analysis presents a revealing snapshot of the factors that play pivotal roles in determining housing prices:

Top 5 Categorical Features:
Neighborhood: Mean CV R² = 0.5407
ExterQual: Mean CV R² = 0.4651
KitchenQual: Mean CV R² = 0.4373
Foundation: Mean CV R² = 0.2547
HeatingQC: Mean CV R² = 0.1892

Top 5 Categorical Features:

Neighborhood: Mean CV R² = 0.5407

ExterQual: Mean CV R² = 0.4651

KitchenQual: Mean CV R² = 0.4373

Foundation: Mean CV R² = 0.2547

HeatingQC: Mean CV R² = 0.1892

This result accentuates the importance of the feature “Neighborhood” as the top predictor, reinforcing the idea that location significantly influences housing prices. Following closely are “ExterQual” (Exterior Material Quality) and “KitchenQual” (Kitchen Quality), which highlight the premium buyers place on the quality of construction and finishes. “Foundation” and “HeatingQC” (Heating Quality and Condition) also emerge as significant, albeit with lower predictive power, suggesting that structural integrity and comfort features are critical considerations for home buyers.

Summary

In this post, we focused on the critical process of preparing categorical data for linear models. Starting with an explanation of One Hot Encoding, we showed how this technique makes categorical data interpretable for linear regression by creating binary vectors. Our analysis identified “Neighborhood” as the categorical feature with the highest impact on housing prices, underscoring location’s pivotal role in real estate valuation.

Specifically, you learned:

One Hot Encoding’s role in converting categorical data to a format usable by linear models, preventing the algorithm from misinterpreting the data’s nature.
The importance of the drop='first' parameter in One Hot Encoding to avoid perfect collinearity in linear models.
How to evaluate the predictive power of individual categorical features and rank their performance within the context of linear models.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on Next-Level Data Science!

Master the mindset for success in data science projects

..build expertise through clear, practical examples, with minimal complex math and a focus on hands-on learning.

Discover how in my new Ebook:
Next-Level Data Science

It provides self-study tutorials designed to guide you from intermediate to advanced. Learn to optimize workflows, manage multicollinearity, refine tree-based models, and handle missing data—and more, to help you achieve deeper insights and effective storytelling with data.