From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles

By Vinod Chugani on February 28, 2025 in Intermediate Data Science 0

This post dives into the application of tree-based models, particularly focusing on decision trees, bagging, and random forests within the Ames Housing dataset. It begins by emphasizing the critical role of preprocessing, a fundamental step that ensures our data is optimally configured for the requirements of these models. The path from a single decision tree to a robust ensemble of trees highlights the transformative impact that multiple trees can have on predictive performance. As we progress through the intricacies of model evaluation and enhancement, we aim to equip you with practical insights and advanced strategies to refine your approach to machine learning and real estate price prediction.

Kick-start your project with my book Next-Level Data Science. It provides self-study tutorials with working code.

Let’s get started.

From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles
Photo by Steven Kamenar. Some rights reserved.

Overview

This post is divided into four parts; they are:

Laying the Groundwork: Preprocessing Techniques for Tree-Based Models
Assessing the Basics: Decision Tree Regressor Evaluation
Improving Predictions: Introduction to Bagging with Decision Trees
Advanced Ensembles: Comparing Bagging and Random Forest Regressors

Laying the Groundwork: Preprocessing Techniques for Tree-Based Models

Preprocessing is crucial in any data science workflow, especially when dealing with tree-based models. This first part of this post brings together essential techniques covered in earlier discussions—such as ordinal encoding from the post Decision Trees and Ordinal Encoding: A Practical Guide, one-hot encoding, various imputation methods, and more—to ensure our dataset is thoroughly prepared for the sophisticated requirements of tree-based modeling. To illustrate these principles in action, let’s walk through a practical example that applies these preprocessing techniques to the Ames Housing dataset.

# Import necessary libraries for preprocessing
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer

# Load the dataset
Ames = pd.read_csv('Ames.csv')

# Convert the below numeric features to categorical features
Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')
Ames['YrSold'] = Ames['YrSold'].astype('object')
Ames['MoSold'] = Ames['MoSold'].astype('object')

# Exclude 'PID' and 'SalePrice' from features and specifically handle the 'Electrical' column
numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns
categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])
electrical_feature = ['Electrical']

# Manually specify the categories for ordinal encoding according to the data dictionary
ordinal_order = {
    'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'],  # Electrical system
    'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'],  # General shape of property
    'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'],  # Type of utilities available
    'LandSlope': ['Sev', 'Mod', 'Gtl'],  # Slope of property
    'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the quality of the material on the exterior
    'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Evaluates the present condition of the material on the exterior
    'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Height of the basement
    'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # General condition of the basement
    'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'],  # Walkout or garden level basement walls
    'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of basement finished area
    'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # Quality of second basement finished area
    'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Heating quality and condition
    'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Kitchen quality
    'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'],  # Home functionality
    'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Fireplace quality
    'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'],  # Interior finish of the garage
    'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage quality
    'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],  # Garage condition
    'PavedDrive': ['N', 'P', 'Y'],  # Paved driveway
    'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'],  # Pool quality
    'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv']  # Fence quality
}

# Extract list of ALL ordinal features from dictionary
ordinal_features = list(ordinal_order.keys())

# List of ordinal features except Electrical
ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# Helper function to fill 'None' for missing categorical data
def fill_none(X):
    return X.fillna("None")

# Pipeline for 'Electrical': Fill missing value with mode then apply ordinal encoding
electrical_transformer = Pipeline(steps=[
    ('impute_electrical', SimpleImputer(strategy='most_frequent')),
    ('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))
])

# Pipeline for numeric features: Impute missing values using mean
numeric_transformer = Pipeline(steps=[
    ('impute_mean', SimpleImputer(strategy='mean'))
])

# Pipeline for ordinal features: Fill missing values with 'None' then apply ordinal encoding
ordinal_transformer = Pipeline(steps=[
    ('fill_none', FunctionTransformer(fill_none, validate=False)),
    ('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))
])

# Pipeline for nominal categorical features: Fill missing values with 'None' then apply one-hot encoding
nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]
categorical_transformer = Pipeline(steps=[
    ('fill_none', FunctionTransformer(fill_none, validate=False)),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data
preprocessor = ColumnTransformer(
    transformers=[
        ('electrical', electrical_transformer, ['Electrical']),
        ('num', numeric_transformer, numeric_features),
        ('ordinal', ordinal_transformer, ordinal_except_electrical),
        ('nominal', categorical_transformer, nominal_features)
])

# Apply the preprocessing pipeline to Ames
transformed_data = preprocessor.fit_transform(Ames).toarray()

# Generate column names for the one-hot encoded features
onehot_features = preprocessor.named_transformers_['nominal'].named_steps['onehot'].get_feature_names_out()

# Combine all feature names
all_feature_names = ['Electrical'] + list(numeric_features) + list(ordinal_except_electrical) + list(onehot_features)

# Convert the transformed array to a DataFrame
transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names)

# Import necessary libraries for preprocessing

import pandas as pd

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer

from sklearn.compose import ColumnTransformer

# Load the dataset

Ames = pd.read_csv('Ames.csv')

# Convert the below numeric features to categorical features

Ames['MSSubClass'] = Ames['MSSubClass'].astype('object')

Ames['YrSold'] = Ames['YrSold'].astype('object')

Ames['MoSold'] = Ames['MoSold'].astype('object')

# Exclude 'PID' and 'SalePrice' from features and specifically handle the 'Electrical' column

numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns

categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical'])

electrical_feature = ['Electrical']

# Manually specify the categories for ordinal encoding according to the data dictionary

ordinal_order = {

'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'], # Electrical system

'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'], # General shape of property

'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'], # Type of utilities available

'LandSlope': ['Sev', 'Mod', 'Gtl'], # Slope of property

'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Evaluates the quality of the material on the exterior

'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Evaluates the present condition of the material on the exterior

'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Height of the basement

'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # General condition of the basement

'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'], # Walkout or garden level basement walls

'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # Quality of basement finished area

'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # Quality of second basement finished area

'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Heating quality and condition

'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Kitchen quality

'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # Home functionality

'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Fireplace quality

'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'], # Interior finish of the garage

'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Garage quality

'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Garage condition

'PavedDrive': ['N', 'P', 'Y'], # Paved driveway

'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'], # Pool quality

'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'] # Fence quality

}

# Extract list of ALL ordinal features from dictionary

ordinal_features = list(ordinal_order.keys())

# List of ordinal features except Electrical

ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical']

# Helper function to fill 'None' for missing categorical data

def fill_none(X):

return X.fillna("None")

# Pipeline for 'Electrical': Fill missing value with mode then apply ordinal encoding

electrical_transformer = Pipeline(steps=[

('impute_electrical', SimpleImputer(strategy='most_frequent')),

('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']]))

])

# Pipeline for numeric features: Impute missing values using mean

numeric_transformer = Pipeline(steps=[

('impute_mean', SimpleImputer(strategy='mean'))

])

# Pipeline for ordinal features: Fill missing values with 'None' then apply ordinal encoding

ordinal_transformer = Pipeline(steps=[

('fill_none', FunctionTransformer(fill_none, validate=False)),

('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical]))

])

# Pipeline for nominal categorical features: Fill missing values with 'None' then apply one-hot encoding

nominal_features = [feature for feature in categorical_features if feature not in ordinal_features]

categorical_transformer = Pipeline(steps=[

('fill_none', FunctionTransformer(fill_none, validate=False)),

('onehot', OneHotEncoder(handle_unknown='ignore'))

])

# Combined preprocessor for numeric, ordinal, nominal, and specific electrical data

preprocessor = ColumnTransformer(

transformers=[

('electrical', electrical_transformer, ['Electrical']),

('num', numeric_transformer, numeric_features),

('ordinal', ordinal_transformer, ordinal_except_electrical),

('nominal', categorical_transformer, nominal_features)

])

# Apply the preprocessing pipeline to Ames

transformed_data = preprocessor.fit_transform(Ames).toarray()

# Generate column names for the one-hot encoded features

onehot_features = preprocessor.named_transformers_['nominal'].named_steps['onehot'].get_feature_names_out()

# Combine all feature names

all_feature_names = ['Electrical'] + list(numeric_features) + list(ordinal_except_electrical) + list(onehot_features)

# Convert the transformed array to a DataFrame

transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names)

With our data loaded and initial transformations in place, we now have a structured approach to handle missing values and encode our categorical variables appropriately. The following summary outlines the key preprocessing tasks we have accomplished, setting a solid foundation for the upcoming modeling stages.

Data Categorization:
- Convert “MSSubClass”, “YrSold”, and “MoSold” from numeric to categorical data types to reflect their actual data characteristics.
Exclusion of Irrelevant Features:
- Remove “PID” and “SalePrice” from the features set to focus on the predictors and avoid including the unique identifier.
Handling Missing Values:
- Numeric features: Impute missing values with the mean to maintain the distribution.
- Categorical features: Fill in missing values with None for all categorical features except “Electrical”, based on guidance provided by the data dictionary.
- Electrical feature: Use the mode to impute the one missing value, based on the guidance provided by the data dictionary.
Encoding Categorical Data:
- Ordinal features: Encode with a predefined order that respects the inherent ranking in the data (like “ExterQual” from poor to excellent).
- Nominal features: Apply one-hot encoding to transform these into a format suitable for modeling, creating binary columns for each category.
Pipelines for Streamlined Processing:
- Separate pipelines for numeric, ordinal, and nominal features to streamline transformations and ensure consistent application across the dataset.
Combined Preprocessing:
- Use a ColumnTransformer to apply all pipelines in a single step, enhancing the efficiency and manageability of the data transformation process.
Transformation Application and Result Inspection:
- Apply the preprocessing pipeline to the dataset, convert the transformed array back to a DataFrame, and systematically name the columns, especially after one-hot encoding, for easy identification and analysis.

Observing the transformed DataFrame above gives us a clear view of how our preprocessing steps have altered the data. This transformation ensures that each feature is appropriately formatted and ready for the next steps in our analysis. Notice how each category and numerical feature has been handled to retain the most information possible.

# # Optional command for expanded view
# pd.set_option('display.max_columns', None)

# View the transformation
print(transformed_df)

# # Optional command for expanded view

# pd.set_option('display.max_columns', None)

# View the transformation

print(transformed_df)

      Electrical  GrLivArea  LotFrontage  ...  YrSold_2008  YrSold_2009  YrSold_2010
0            4.0      856.0    68.510628  ...          0.0          0.0          1.0
1            4.0     1049.0    42.000000  ...          0.0          1.0          0.0
2            4.0     1001.0    60.000000  ...          0.0          0.0          0.0
3            4.0     1039.0    80.000000  ...          0.0          1.0          0.0
4            4.0     1665.0    70.000000  ...          0.0          1.0          0.0
...          ...        ...          ...  ...          ...          ...          ...
2574         2.0      952.0    68.510628  ...          0.0          1.0          0.0
2575         3.0     1733.0    68.510628  ...          0.0          1.0          0.0
2576         3.0     2002.0    82.000000  ...          0.0          0.0          0.0
2577         4.0     1842.0    68.510628  ...          0.0          0.0          0.0
2578         4.0     1911.0    80.000000  ...          0.0          0.0          0.0

[2579 rows x 2819 columns]

Electrical GrLivArea LotFrontage ... YrSold_2008 YrSold_2009 YrSold_2010

0 4.0 856.0 68.510628 ... 0.0 0.0 1.0

1 4.0 1049.0 42.000000 ... 0.0 1.0 0.0

2 4.0 1001.0 60.000000 ... 0.0 0.0 0.0

3 4.0 1039.0 80.000000 ... 0.0 1.0 0.0

4 4.0 1665.0 70.000000 ... 0.0 1.0 0.0

... ... ... ... ... ... ... ...

2574 2.0 952.0 68.510628 ... 0.0 1.0 0.0

2575 3.0 1733.0 68.510628 ... 0.0 1.0 0.0

2576 3.0 2002.0 82.000000 ... 0.0 0.0 0.0

2577 4.0 1842.0 68.510628 ... 0.0 0.0 0.0

2578 4.0 1911.0 80.000000 ... 0.0 0.0 0.0

[2579 rows x 2819 columns]

The original dataset is now expanded to 2819 columns. We can perform the quick calculation below to cross check the correct number of columns post transformation.

#Quick way to cross-check number of columns after preprocessing
print(len(numeric_features) + len(ordinal_features) + Ames[nominal_features].fillna("None").nunique().sum())

1 2	#Quick way to cross-check number of columns after preprocessing print(len(numeric_features) + len(ordinal_features) + Ames[nominal_features].fillna("None").nunique().sum())

This quick validation shows us the total number of features after preprocessing, confirming that all transformations have been applied correctly.

2819

2819

Ensuring the integrity of our data at this stage is crucial for building reliable models.

Want to Get Started With Next-Level Data Science?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Assessing the Basics: Decision Tree Regressor Evaluation

In the second part of this post, we focus on evaluating the performance of a basic Decision Tree model by building on our foundation above:

# Build on previous blocks of code
# Import additional necessary libraries for modeling and evaluation
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score

# Define the full model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', DecisionTreeRegressor(random_state=42))
])

# Evaluate the model using cross-validation
scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Output the result
print("Decision Tree Regressor Mean CV R²:", round(scores.mean(),4))

# Build on previous blocks of code

# Import additional necessary libraries for modeling and evaluation

from sklearn.tree import DecisionTreeRegressor

from sklearn.model_selection import cross_val_score

# Define the full model pipeline

model_pipeline = Pipeline(steps=[

('preprocessor', preprocessor),

('regressor', DecisionTreeRegressor(random_state=42))

])

# Evaluate the model using cross-validation

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Output the result

print("Decision Tree Regressor Mean CV R²:", round(scores.mean(),4))

By applying cross-validation, we aim to obtain a benchmark for comparing more complex models in subsequent parts of the series:

Decision Tree Regressor Mean CV R²: 0.7663

1	Decision Tree Regressor Mean CV R²: 0.7663

An R² score of 0.7663 indicates that our model explains approximately 77% of the variability in housing prices, which is a good (but not great) starting point. This foundational performance will help us appreciate the incremental benefits offered by more sophisticated ensemble methods that we will explore next.

Improving Predictions: Introduction to Bagging with Decision Trees

Building on our initial model, this part explores the enhancement of predictive performance through Bagging. Bagging, or Bootstrap Aggregating, is an ensemble technique that aims to improve stability and accuracy by effectively reducing variance and preventing overfitting. Unlike simply cloning the same decision tree multiple times, Bagging involves creating multiple trees where each tree is trained on a different bootstrap sample of the dataset. These samples are drawn with replacement, meaning each tree learns from slightly varied slices of the data, ensuring diversity in the models’ perspectives. We will compare the effectiveness of a single Decision Tree with a Bagging Regressor that uses multiple trees, demonstrating the power of ensemble learning:

# Import Bagging Regressor and build on previous blocks of code
# Compare how performance is affected by Bagging (i.e. increasing number of trees)

from sklearn.ensemble import BaggingRegressor

models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),
    'Bagging Regressor (10 Trees)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),
                                          n_estimators=10, random_state=42)
}

results = {}
for name, model in models.items():
    # Define the full model pipeline for each model
    model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

    # Perform cross-validation
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

    # Store and print the mean of the scores
    results[name] = round(scores.mean(), 4)

# Output the cross-validation scores
print("Cross-validation scores:", results)

# Import Bagging Regressor and build on previous blocks of code

# Compare how performance is affected by Bagging (i.e. increasing number of trees)

from sklearn.ensemble import BaggingRegressor

models = {

'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),

'Bagging Regressor (10 Trees)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=10, random_state=42)

}

results = {}

for name, model in models.items():

# Define the full model pipeline for each model

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# Perform cross-validation

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Store and print the mean of the scores

results[name] = round(scores.mean(), 4)

# Output the cross-validation scores

print("Cross-validation scores:", results)

By leveraging multiple decision trees, Bagging improves approximately 11% over the single Decision Tree, demonstrating how ensemble methods can enhance model performance.

Cross-validation scores: {'Decision Tree (1 Tree)': 0.7663, 'Bagging Regressor (10 Trees)': 0.8781}

1	Cross-validation scores: {'Decision Tree (1 Tree)': 0.7663, 'Bagging Regressor (10 Trees)': 0.8781}

To further investigate this, we will examine how performance varies with different numbers of trees in the ensemble:

# Build on previous blocks of code
# Compare how performance is affected by Bagging in increments of 10 trees

# Number of trees to test
n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors
models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42)
}

# Adding Bagging models for each tree count
for n in n_trees:
    models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n,
        random_state=42
    )

results = {}
for name, model in models.items():
    # Define the full model pipeline for each model
    model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

    # Perform cross-validation
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

    # Store and print the mean of the scores
    results[name] = round(scores.mean(), 4)

# Output the cross-validation scores
print("Cross-validation scores:")
for name, score in results.items():
    print(f"{name}: {score}")

# Build on previous blocks of code

# Compare how performance is affected by Bagging in increments of 10 trees

# Number of trees to test

n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors

models = {

'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42)

}

# Adding Bagging models for each tree count

for n in n_trees:

models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(

base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=n,

random_state=42

)

results = {}

for name, model in models.items():

# Define the full model pipeline for each model

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# Perform cross-validation

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Store and print the mean of the scores

results[name] = round(scores.mean(), 4)

# Output the cross-validation scores

print("Cross-validation scores:")

for name, score in results.items():

print(f"{name}: {score}")

As we increase the number of trees in the Bagging Regressor, we observe an initial significant improvement in the model’s performance. However, it is crucial to note that the marginal gains begin to plateau beyond a certain point. For instance, while the jump in R² score from 1 to 20 trees is notable, the incremental improvement beyond 20 trees is much less pronounced.

Cross-validation scores:
Decision Tree (1 Tree): 0.7663
Bagging Regressor 10 Trees: 0.8781
Bagging Regressor 20 Trees: 0.8898
Bagging Regressor 30 Trees: 0.8911
Bagging Regressor 40 Trees: 0.8922
Bagging Regressor 50 Trees: 0.8931
Bagging Regressor 60 Trees: 0.8933
Bagging Regressor 70 Trees: 0.8936
Bagging Regressor 80 Trees: 0.895
Bagging Regressor 90 Trees: 0.8954
Bagging Regressor 100 Trees: 0.8957

Cross-validation scores:

Decision Tree (1 Tree): 0.7663

Bagging Regressor 10 Trees: 0.8781

Bagging Regressor 20 Trees: 0.8898

Bagging Regressor 30 Trees: 0.8911

Bagging Regressor 40 Trees: 0.8922

Bagging Regressor 50 Trees: 0.8931

Bagging Regressor 60 Trees: 0.8933

Bagging Regressor 70 Trees: 0.8936

Bagging Regressor 80 Trees: 0.895

Bagging Regressor 90 Trees: 0.8954

Bagging Regressor 100 Trees: 0.8957

This trend demonstrates the law of diminishing returns in model complexity and highlights an important consideration in machine learning: beyond a certain level of complexity, the additional computational cost may not justify the minimal gains in performance.

Advanced Ensembles: Comparing Bagging and Random Forest Regressors

In the final part of our series on tree-based modeling techniques, we delve into a comparative analysis of two popular ensemble methods: Bagging Regressors and Random Forests. Both methods build on the concept of ensemble learning, which we explored in the previous sections, but they incorporate different approaches to how trees are constructed and combined.

Random Forest is an extension of the Bagging technique and involves creating many decision trees during training. Unlike simple Bagging, where each tree is built on a bootstrap sample of the data, Random Forest introduces another layer of randomness by considering a random subset of features to split each node in the decision trees. This randomness helps in creating more diverse trees, which generally results in a model with better generalization capabilities.

Let’s assess and compare the performance of these two methods using the Ames Housing dataset, focusing on how increasing the number of trees affects the cross-validated R² score:

# Build on previous blocks of code
# Evaluate performance of Random Forest against Bagging Regressor

from sklearn.ensemble import RandomForestRegressor

# Number of trees to test
n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors
models = {
    'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),
}

# Adding Bagging and Random Forest models for each tree count
for n in n_trees:
    models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(
        base_estimator=DecisionTreeRegressor(random_state=42),
        n_estimators=n,
        random_state=42
    )
    models[f'Random Forest {n} Trees'] = RandomForestRegressor(
        n_estimators=n,
        random_state=42
    )

results = {}
for name, model in models.items():
    # Define the full model pipeline for each model
    model_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', model)
    ])

    # Perform cross-validation
    scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

    # Store and print the mean of the scores
    results[name] = round(scores.mean(), 4)

# Output the cross-validation scores
print("Cross-validation scores:")
for name, score in results.items():
    print(f"{name}: {score}")

# Build on previous blocks of code

# Evaluate performance of Random Forest against Bagging Regressor

from sklearn.ensemble import RandomForestRegressor

# Number of trees to test

n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Define the model pipelines with various regressors

models = {

'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42),

}

# Adding Bagging and Random Forest models for each tree count

for n in n_trees:

models[f'Bagging Regressor {n} Trees'] = BaggingRegressor(

base_estimator=DecisionTreeRegressor(random_state=42),

n_estimators=n,

random_state=42

)

models[f'Random Forest {n} Trees'] = RandomForestRegressor(

n_estimators=n,

random_state=42

)

results = {}

for name, model in models.items():

# Define the full model pipeline for each model

model_pipeline = Pipeline([

('preprocessor', preprocessor),

('regressor', model)

])

# Perform cross-validation

scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice'])

# Store and print the mean of the scores

results[name] = round(scores.mean(), 4)

# Output the cross-validation scores

print("Cross-validation scores:")

for name, score in results.items():

print(f"{name}: {score}")

Examining the cross-validation scores reveals interesting patterns. Both Bagging and Random Forest models show significant improvements over a single Decision Tree, highlighting the strength of ensemble methods:

Cross-validation scores:
Decision Tree (1 Tree): 0.7663
Bagging Regressor 10 Trees: 0.8781
Random Forest 10 Trees: 0.8762
Bagging Regressor 20 Trees: 0.8898
Random Forest 20 Trees: 0.8893
Bagging Regressor 30 Trees: 0.8911
Random Forest 30 Trees: 0.8897
Bagging Regressor 40 Trees: 0.8922
Random Forest 40 Trees: 0.8909
Bagging Regressor 50 Trees: 0.8931
Random Forest 50 Trees: 0.8922
Bagging Regressor 60 Trees: 0.8933
Random Forest 60 Trees: 0.8931
Bagging Regressor 70 Trees: 0.8936
Random Forest 70 Trees: 0.8932
Bagging Regressor 80 Trees: 0.895
Random Forest 80 Trees: 0.8943
Bagging Regressor 90 Trees: 0.8954
Random Forest 90 Trees: 0.8948
Bagging Regressor 100 Trees: 0.8957
Random Forest 100 Trees: 0.8954

Cross-validation scores:

Decision Tree (1 Tree): 0.7663

Bagging Regressor 10 Trees: 0.8781

Random Forest 10 Trees: 0.8762

Bagging Regressor 20 Trees: 0.8898

Random Forest 20 Trees: 0.8893

Bagging Regressor 30 Trees: 0.8911

Random Forest 30 Trees: 0.8897

Bagging Regressor 40 Trees: 0.8922

Random Forest 40 Trees: 0.8909

Bagging Regressor 50 Trees: 0.8931

Random Forest 50 Trees: 0.8922

Bagging Regressor 60 Trees: 0.8933

Random Forest 60 Trees: 0.8931

Bagging Regressor 70 Trees: 0.8936

Random Forest 70 Trees: 0.8932

Bagging Regressor 80 Trees: 0.895

Random Forest 80 Trees: 0.8943

Bagging Regressor 90 Trees: 0.8954

Random Forest 90 Trees: 0.8948

Bagging Regressor 100 Trees: 0.8957

Random Forest 100 Trees: 0.8954

Interestingly, as we increase the number of trees, both methods show similar performance levels, with neither consistently outperforming the other significantly. This similarity in performance can be attributed to how the specific characteristics of the Ames Housing dataset may naturally limit the benefits of the additional randomization introduced by Random Forest. If the dataset has a few highly predictive features, the random feature selection of Random Forest does not significantly enhance the model’s ability to generalize compared to Bagging, which uses all features.

These insights suggest that while Random Forest typically offers improvements over Bagging by reducing correlation between trees through its feature randomization, the specific dynamics of the dataset and the problem context can sometimes negate these advantages. Therefore, in cases where computational efficiency is a concern, Bagging might be preferred due to its simplicity and similar performance levels. This comparison underscores the importance of understanding the dataset and the modeling objectives when choosing between ensemble strategies.

Summary

This blog post provides a detailed exploration of tree-based modeling techniques using the Ames Housing dataset. It starts with essential preprocessing steps like encoding and handling missing values and progresses through the evaluation and enhancement of decision tree models using bagging. The narrative culminates in a comparative analysis of bagging and random forest regressors, highlighting the incremental benefits and performance comparisons as the number of trees is varied. Each section builds upon the last, offering practical examples and insights culminating in a comprehensive understanding of tree-based predictive modeling.

Specifically, you learned:

Preprocessing is crucial for tree-based models, involving techniques such as categorical conversion, handling missing values, and applying appropriate encodings.
Evaluating a basic Decision Tree model with cross-validation can provide a solid benchmark for assessing the performance of more complex tree-based models.
Using Bagging and Random Forest enhances Decision Tree performance, demonstrating significant improvements in prediction accuracy through ensemble techniques.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on Next-Level Data Science!

Master the mindset for success in data science projects

..build expertise through clear, practical examples, with minimal complex math and a focus on hands-on learning.

Discover how in my new Ebook:
Next-Level Data Science

It provides self-study tutorials designed to guide you from intermediate to advanced. Learn to optimize workflows, manage multicollinearity, refine tree-based models, and handle missing data—and more, to help you achieve deeper insights and effective storytelling with data.