Tips for Effective Feature Engineering in Machine Learning

By Bala Priya C on April 21, 2025 in Machine Learning Resources 2

Image by Author

Feature engineering is an important step in the machine learning pipeline. It is the process of transforming data in its native format into meaningful features to help the machine learning model learn better from the data.

If done right, feature engineering can significantly enhance the performance of machine learning algorithms. Beyond the basics of understanding your data and preprocessing, effective feature engineering involves creating interaction terms, generating indicator variables, and binning features into buckets.

These techniques help extract relevant information from the data and help build robust machine learning solutions. In this guide, we’ll explore these feature engineering techniques by spinning up a sample housing dataset.

Note: You can code along to this tutorial in your preferred Jupyter notebook environment. You can also follow along with the Google Colab notebook for this tutorial.

1. Understand Your Data

Before jumping into feature engineering, you should first thoroughly understand your data. This includes:

Exploring and visualizing your dataset to get an idea of the distribution and relationships between variables
Know the types of features you have (categorical, numerical, datetime objects, and more) and understand their significance in your analysis
Try to use domain knowledge to understand what each feature represents and how it might interact with other features. This insight can guide you in creating meaningful new features

Let’s create a sample housing dataset to work with:

import pandas as pd
import numpy as np

# Set random seed for reproducibility
np.random.seed(42)

# Create sample data
n_samples = 1000
data = {
	'price': np.random.normal(200000, 50000, n_samples).astype(int),
	'size': np.random.normal(1500, 500, n_samples).astype(int),
	'num_rooms': np.random.randint(2, 8, n_samples),
	'num_bathrooms': np.random.randint(1, 4, n_samples),
	'age': np.random.randint(0, 40, n_samples),
	'neighborhood': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples),
	'income': np.random.normal(60000, 15000, n_samples).astype(int)
}

df = pd.DataFrame(data)

print(df.head())

import pandas as pd

import numpy as np

# Set random seed for reproducibility

np.random.seed(42)

# Create sample data

n_samples = 1000

data = {

'price': np.random.normal(200000, 50000, n_samples).astype(int),

'size': np.random.normal(1500, 500, n_samples).astype(int),

'num_rooms': np.random.randint(2, 8, n_samples),

'num_bathrooms': np.random.randint(1, 4, n_samples),

'age': np.random.randint(0, 40, n_samples),

'neighborhood': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples),

'income': np.random.normal(60000, 15000, n_samples).astype(int)

}

df = pd.DataFrame(data)

print(df.head())

In addition to getting basic information on the dataset, you can generate distribution plots and count plots for numeric and categorical variables, respectively. The following code snippets show basic exploratory data analysis on the dataset.

First, we get some basic information on the dataframe:

# Basic data exploration on the entire dataset
print(df.head())
print(df.info())
print(df.describe())

# Basic data exploration on the entire dataset

print(df.head())

print(df.info())

print(df.describe())

You can try to visualize the distribution of numeric features ‘size’ and ‘income’ in the dataset:

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize distributions using distplot for 'size' and 'income'
plt.figure(figsize=(8, 6))
sns.histplot(df['size'], kde=True)
plt.title('Distribution of House Sizes')
plt.xlabel('Size')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(8, 6))
sns.histplot(df['income'], kde=True)
plt.title('Distribution of Household Income')
plt.xlabel('Income')
plt.ylabel('Frequency')
plt.show()

import matplotlib.pyplot as plt

import seaborn as sns

# Visualize distributions using distplot for 'size' and 'income'

plt.figure(figsize=(8, 6))

sns.histplot(df['size'], kde=True)

plt.title('Distribution of House Sizes')

plt.xlabel('Size')

plt.ylabel('Frequency')

plt.show()

plt.figure(figsize=(8, 6))

sns.histplot(df['income'], kde=True)

plt.title('Distribution of Household Income')

plt.xlabel('Income')

plt.ylabel('Frequency')

plt.show()

For categorical variables, count plot can help understand how the different values are distributed:

# Count plot for 'neighborhood'
plt.figure(figsize=(8, 6))
sns.countplot(x='neighborhood', data=df, order=df['neighborhood'].value_counts().index)
plt.title('Count of Houses per Neighborhood')
plt.xlabel('Neighborhood')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Count plot for 'neighborhood'

plt.figure(figsize=(8, 6))

sns.countplot(x='neighborhood', data=df, order=df['neighborhood'].value_counts().index)

plt.title('Count of Houses per Neighborhood')

plt.xlabel('Neighborhood')

plt.ylabel('Count')

plt.xticks(rotation=45)

plt.show()

By understanding your data, you can identify key features and relationships between features that will inform the subsequent feature engineering steps. This step ensures that your preprocessing and feature creation efforts are grounded in a thorough understanding of the dataset.

2. Preprocess the Data Effectively

Effective preprocessing involves handling missing values and outliers, scaling numerical features, and encoding categorical variables. The choice of preprocessing techniques also depend on the data and the requirements of the machine learning algorithms.

We don’t have any missing values in the example dataframe. For most real-world datasets, you can handle missing values using suitable imputation strategies.

Before you go ahead with preprocessing, split the dataset into train and test sets:

from sklearn.model_selection import train_test_split

# Split data into features X and target label y
X = df.drop('price', axis=1)
y = df['price']

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn.model_selection import train_test_split

# Split data into features X and target label y

X = df.drop('price', axis=1)

y = df['price']

# Split data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

To bring numeric features all to the same scale, you can use minmax or standard scaling. Here’s a generic code snippet to impute missing values and scale numeric features:

from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Handling missing values
imputer = SimpleImputer(strategy='mean')
X_train['feature_to_impute'] = imputer.fit_transform(X_train[['feature_to_impute']])
X_test['feature_to_impute'] = imputer.transform(X_test[['features_to_impute']])

# Scaling features
scaler = StandardScaler()
X_train[['features_to_scale']] = scaler.fit_transform(X_train[['features_to_scale']])
X_test[['features_to_scale']] = scaler.transform(X_test[['features_to_scale']])

from sklearn.model_selection import train_test_split

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler

# Handling missing values

imputer = SimpleImputer(strategy='mean')

X_train['feature_to_impute'] = imputer.fit_transform(X_train[['feature_to_impute']])

X_test['feature_to_impute'] = imputer.transform(X_test[['features_to_impute']])

# Scaling features

scaler = StandardScaler()

X_train[['features_to_scale']] = scaler.fit_transform(X_train[['features_to_scale']])

X_test[['features_to_scale']] = scaler.transform(X_test[['features_to_scale']])

Replace ‘features_to_impute’ and ‘features_to_scale’ with the specific features you’d like to impute. We’ll also look at creating more representative features from the existing features in the next sections.

In summary, effective preprocessing prepares your data for all downstream tasks by ensuring consistency and addressing any issues with the raw data. This step is essential for getting accurate and reliable results from your machine learning models.

3. Create Interaction Terms

Creating interaction terms involves generating new features that capture the interactions between existing features.

For our example dataset, we’ll generate interaction terms for ‘size’ and ‘num_rooms’ using PolynomialFeatures from scikit-learn:

from sklearn.preprocessing import PolynomialFeatures

# Creating polynomial and interaction features on training set
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
interaction_terms_train = poly.fit_transform(X_train[['size', 'num_rooms']])
interaction_terms_test = poly.transform(X_test[['size', 'num_rooms']])

interaction_df_train = pd.DataFrame(interaction_terms_train, columns=poly.get_feature_names_out(['size', 'num_rooms']))
interaction_df_test = pd.DataFrame(interaction_terms_test, columns=poly.get_feature_names_out(['size', 'num_rooms']))

# Add the interaction terms
X_train = pd.concat([X_train, interaction_df_train], axis=1)
X_test = pd.concat([X_test, interaction_df_test], axis=1)

from sklearn.preprocessing import PolynomialFeatures

# Creating polynomial and interaction features on training set

poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)

interaction_terms_train = poly.fit_transform(X_train[['size', 'num_rooms']])

interaction_terms_test = poly.transform(X_test[['size', 'num_rooms']])

interaction_df_train = pd.DataFrame(interaction_terms_train, columns=poly.get_feature_names_out(['size', 'num_rooms']))

interaction_df_test = pd.DataFrame(interaction_terms_test, columns=poly.get_feature_names_out(['size', 'num_rooms']))

# Add the interaction terms

X_train = pd.concat([X_train, interaction_df_train], axis=1)

X_test = pd.concat([X_test, interaction_df_test], axis=1)

Creating interaction terms can improve your model by capturing supposedly complex relationships between features.

4. Create Indicator Variables

You can create indicator variables to flag certain conditions or mark thresholds in your data. These variables take on values of 0 or 1, indicating the absence or presence of a particular value.

For example, suppose you have a dataset to predict loan default with a large number of defaults on student loans. It can be helpful to create an ‘is_student’ feature from the ‘professions’ categorical column.

In the housing dataset, we can create an indicator variable to denote if the houses are over 30 years old and create a count plot on the indicator variable ‘age_indicator’:

import seaborn as sns
import matplotlib.pyplot as plt

# Creating an indicator variable for houses older than 30 years
X_train['age_indicator'] = (X_train['age'] > 30).astype(int)
X_test['age_indicator'] = (X_test['age'] > 30).astype(int)

# Visualize the indicator variables
plt.figure(figsize=(10, 6))
sns.countplot(x='age_indicator', data=X_train)
plt.title('Count of Houses Based on Age Indicator (>30 years)')
plt.xlabel('Age Indicator')
plt.ylabel('Count')
plt.show()

import seaborn as sns

import matplotlib.pyplot as plt

# Creating an indicator variable for houses older than 30 years

X_train['age_indicator'] = (X_train['age'] > 30).astype(int)

X_test['age_indicator'] = (X_test['age'] > 30).astype(int)

# Visualize the indicator variables

plt.figure(figsize=(10, 6))

sns.countplot(x='age_indicator', data=X_train)

plt.title('Count of Houses Based on Age Indicator (>30 years)')

plt.xlabel('Age Indicator')

plt.ylabel('Count')

plt.show()

You can create indicator variable from the number of rooms, the ‘num_rooms’ column, as well. As seen, creating indicator variables can help encode additional information for machine learning models.

5. Create More Representative Features with Binning

Binning features into buckets involves grouping continuous variables into discrete intervals. Sometimes grouping features like age and income into bins can help find patterns that are hard to identify within continuous data.

For the example housing dataset, let’s bin the age of the house and income of the household into different bins with descriptive labels. You can use the cut() function in pandas to bin features into equal-width intervals like so:

# Creating income bins
X_train['age_bin'] = pd.cut(X_train['age'], bins=3, labels=['new', 'moderate', 'old'])
X_test['age_bin'] = pd.cut(X_test['age'], bins=3, labels=['new', 'moderate', 'old'])

# Creating income bins
X_train['income_bin'] = pd.cut(X_train['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])
X_test['income_bin'] = pd.cut(X_test['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])

# Creating income bins

X_train['age_bin'] = pd.cut(X_train['age'], bins=3, labels=['new', 'moderate', 'old'])

X_test['age_bin'] = pd.cut(X_test['age'], bins=3, labels=['new', 'moderate', 'old'])

# Creating income bins

X_train['income_bin'] = pd.cut(X_train['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])

X_test['income_bin'] = pd.cut(X_test['income'], q=4, labels=['low', 'medium', 'high', 'very_high'])

Binning continuous features into discrete intervals can simplify the representation of continuous variables as features with more predictive power.

Summary

In this guide, we went over the following tips for effective feature engineering:

Perform EDA and use visualizations to understand your data.
Preprocess effectively by handling missing values, encoding categorical variables, removing outliers, and ensuring a proper train-test split.
Create interaction terms that combine features to capture meaningful interactions.
Create indicator variables as needed based on thresholds and specific values.
to capture key categorical information.
Bin features into buckets or discrete intervals to create more representative features.

Be sure to test out these feature engineering tips in your next machine learning project. Happy feature engineering!

2 Responses to Tips for Effective Feature Engineering in Machine Learning

Jose Martinez August 10, 2024 at 11:39 pm #

Hi Bala!
Great article! Thank you so much for sharing…
In your point 5. Create More Representative Features with Binning:
where is written:
# Creating income bins
X_train[‘age_bin’] = pd.cut(X_train[‘age’], bins=3, labels=[‘new’, ‘moderate’, ‘old’])
X_test[‘age_bin’] = pd.cut(X_test[‘age’], bins=3, labels=[‘new’, ‘moderate’, ‘old’])
I believe there is a typo, as you meant: “# Creating age bins”.
Also, your instruction: “X_train[‘age_bin’] = pd.cut(X_train[‘age’], bins=3, labels=[‘new’, ‘moderate’, ‘old’])” raised an error in Pandas: “putmask: first argument must be an array”.
Thank you again.
Jose

- James Carmichael August 11, 2024 at 6:02 am #
  
  Hi Jose…You’re correct that the comment should be about creating “age bins” rather than “income bins.” Thank you for pointing that out. The error you’re encountering might be due to the use of non-ASCII characters for the quotes in the code. Pandas expects standard ASCII quotes for specifying column names.
  
  Here’s the corrected code:
  
  python # Creating age bins X_train['age_bin'] = pd.cut(X_train['age'], bins=3, labels=['new', 'moderate', 'old']) X_test['age_bin'] = pd.cut(X_test['age'], bins=3, labels=['new', 'moderate', 'old'])
  
  This should work without raising any errors. Make sure you’re using straight single quotes (') instead of any stylized or curly quotes (‘’), as this can cause issues in Python.
  
  If you continue to experience issues, double-check that the age column exists in your X_train and X_test DataFrames and that the data type of the age column is numeric.

Navigation

Tips for Effective Feature Engineering in Machine Learning

1. Understand Your Data

2. Preprocess the Data Effectively

3. Create Interaction Terms

4. Create Indicator Variables

5. Create More Representative Features with Binning

Summary

More On This Topic

2 Responses to Tips for Effective Feature Engineering in Machine Learning

Leave a Reply Click here to cancel reply.