Dealing with Missing Data Strategically: Advanced Imputation Techniques in Pandas and Scikit-learn

By Iván Palomares Carrascosa on June 7, 2025 in Data Science 0

Dealing with Missing Data Strategically: Advanced Imputation Techniques in Pandas and Scikit-learn.
Image by Author | Ideogram

Introduction

Missing values appear more often than not in many real-world datasets. There can be instances with missing values in one or several of their attributes for various reasons, such as human error, corrupted data, or incomplete data collection processes, e.g. from surveys with optional fields. While there exist basic strategies to deal with instances or attributes containing missing values, — like removing rows or columns entirely, or imputing missing values with a default value (typically the mean or median of the attribute) — these strategies are sometimes not sufficient.

This article presents some advanced strategies to handle missing data, namely, imputation techniques made possible through a combined use of Pandas and Scikit-learn libraries in Python.

Using a Synthetic Employees Dataset

To demonstrate some advanced strategies to impute missing values depending on the specific context and problem needs, we will use a synthetically created dataset (by me!), which you can easily load from the URL specified in the code below:

import pandas as pd

url = 'https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/employees_dataset_with_missing.csv'
df = pd.read_csv(url)
print(f"Loaded dataset shape: {df.shape}")
print(f"Missing values per column:\n{df.isnull().sum()}")

import pandas as pd

url = 'https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/employees_dataset_with_missing.csv'

df = pd.read_csv(url)

print(f"Loaded dataset shape: {df.shape}")

print(f"Missing values per column:\n{df.isnull().sum()}")

Multiple Imputation by Chained Equations

The iterative imputation method Multiple Imputation by Chained Equations (MICE) uses a variety of estimators like random forest, Bayesian ridge, etc. to impute missing values. By default, the Bayesian Ridge regression method is used, which deems missing values as parameters to be learned.

iterative_imputer = IterativeImputer(random_state=42, max_iter=10)
df_iterative = pd.DataFrame(
    iterative_imputer.fit_transform(df),
    columns=df.columns,
    index=df.index
)

print("\n1. Iterative Imputation (MICE):")
print(f"Full dataset shape: {df_iterative.shape}")
print(f"Number of missing values: {df_iterative.isnull().sum().sum()}")

iterative_imputer = IterativeImputer(random_state=42, max_iter=10)

df_iterative = pd.DataFrame(

iterative_imputer.fit_transform(df),

columns=df.columns,

index=df.index

)

print("\n1. Iterative Imputation (MICE):")

print(f"Full dataset shape: {df_iterative.shape}")

print(f"Number of missing values: {df_iterative.isnull().sum().sum()}")

The result shows that all missing values are no more. They have all been imputed.

You can specify which estimator to use instead of Bayesian ridge, for instance, random forest regressors, as follows:

rf_iterative_imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    random_state=42,
    max_iter=5
)
df_rf_iterative = pd.DataFrame(
    rf_iterative_imputer.fit_transform(df),
    columns=df.columns,
    index=df.index
)

df_rf_iterative.head()

rf_iterative_imputer = IterativeImputer(

estimator=RandomForestRegressor(n_estimators=10, random_state=42),

random_state=42,

max_iter=5

)

df_rf_iterative = pd.DataFrame(

rf_iterative_imputer.fit_transform(df),

columns=df.columns,

index=df.index

)

df_rf_iterative.head()

Output:

Dataset sample with imputed missing values

K-Nearest Neighbor Imputation

Just like the standard K-NN algorithm, this approach to impute missing values resorts to calculating and using similarity among samples to estimate missing values in a given instance. Weighted similarity and custom metrics can likewise be utilized.

This example sets the number of neighboring instances with known value for the “affected” attribute in the target instance to K=5, and the contribution of each neighbor to estimating the missing value for that attribute is weighted inversely proportional to the distance between that neighbor and the target instance:

knn_imputer = KNNImputer(n_neighbors=5, weights='distance')
df_knn = pd.DataFrame(
    knn_imputer.fit_transform(df),
    columns=df.columns,
    index=df.index
)

print("\n2. KNN Imputation:")
print(f"Using {knn_imputer.n_neighbors} nearest neighbors")
print(f"Remaining missing values: {df_knn.isnull().sum().sum()}")

knn_imputer = KNNImputer(n_neighbors=5, weights='distance')

df_knn = pd.DataFrame(

knn_imputer.fit_transform(df),

columns=df.columns,

index=df.index

)

print("\n2. KNN Imputation:")

print(f"Using {knn_imputer.n_neighbors} nearest neighbors")

print(f"Remaining missing values: {df_knn.isnull().sum().sum()}")

An alternate approach can be applied by setting weights='uniform', in which case all selected neighbors (ten in this case) have equal weight in contributing to the estimation of the missing value in every target instance to be treated.

knn_uniform = KNNImputer(n_neighbors=10, weights='uniform')
df_knn_uniform = pd.DataFrame(
    knn_uniform.fit_transform(df),
    columns=df.columns,
    index=df.index
)

print(f"Remaining missing values: {df_knn_uniform.isnull().sum().sum()}")

knn_uniform = KNNImputer(n_neighbors=10, weights='uniform')

df_knn_uniform = pd.DataFrame(

knn_uniform.fit_transform(df),

columns=df.columns,

index=df.index

)

print(f"Remaining missing values: {df_knn_uniform.isnull().sum().sum()}")

Imputation With Multiple Estimators (Ensemble)

Another strategy is to build multiple imputing estimators of different types, each yielding a different version of the full dataset with imputed values. Then, by inspecting each dataset and focusing on the most critical attributes that contained missing values, we can decide on one of them or another, or even perform an aggregation of two or more of them, depending on which estimator (or estimators) provides the most realistic or consistent imputations for the specific context of the data.

imputers = {
    'bayesian_ridge': IterativeImputer(estimator=BayesianRidge(), random_state=42),
    'extra_trees': IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=42), random_state=42),
    'rf_regressor': IterativeImputer(estimator=RandomForestRegressor(n_estimators=10, random_state=42), random_state=42)
}

imputed_datasets = {}
for name, imputer in imputers.items():
    imputed_datasets[name] = pd.DataFrame(
        imputer.fit_transform(df), 
        columns=df.columns,
        index=df.index
    )

print("\n3. Imputed Dataset Versions based on Different Estimators:")
for name, dataset in imputed_datasets.items():
    print(f"{name}: Mean income = ${dataset['income'].mean():.2f}")

imputers = {

'bayesian_ridge': IterativeImputer(estimator=BayesianRidge(), random_state=42),

'extra_trees': IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=42), random_state=42),

'rf_regressor': IterativeImputer(estimator=RandomForestRegressor(n_estimators=10, random_state=42), random_state=42)

}

imputed_datasets = {}

for name, imputer in imputers.items():

imputed_datasets[name] = pd.DataFrame(

imputer.fit_transform(df),

columns=df.columns,

index=df.index

)

print("\n3. Imputed Dataset Versions based on Different Estimators:")

for name, dataset in imputed_datasets.items():

print(f"{name}: Mean income = ${dataset['income'].mean():.2f}")

Wrapping Up

This table summarizes the main features of each of the three approaches explored, suggesting when to use (or to avoid) each of them.

When to use one imputation strategy or another.

The main features of the three imputation approaches explored

In summary, KNN imputation works great for smaller numerical datasets because it is computationally expensive for larger datasets. Ensemble estimators tend to provide the best quality, but they are the most complex and computationally expensive approach, and MICE is often a balanced approach suitable for a variety of scenarios.

Navigation

Dealing with Missing Data Strategically: Advanced Imputation Techniques in Pandas and Scikit-learn

Introduction

Using a Synthetic Employees Dataset

Multiple Imputation by Chained Equations

K-Nearest Neighbor Imputation

Imputation With Multiple Estimators (Ensemble)

Wrapping Up

More On This Topic

No comments yet.

Leave a Reply Click here to cancel reply.