This post dives into the application of tree-based models, particularly focusing on decision trees, bagging, and random forests within the Ames Housing dataset. It begins by emphasizing the critical role of preprocessing, a fundamental step that ensures our data is optimally configured for the requirements of these models. The path from a single decision tree to a robust ensemble of trees highlights the transformative impact that multiple trees can have on predictive performance. As we progress through the intricacies of model evaluation and enhancement, we aim to equip you with practical insights and advanced strategies to refine your approach to machine learning and real estate price prediction.
Kick-start your project with my book Next-Level Data Science. It provides self-study tutorials with working code.
Let’s get started.

From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles
Photo by Steven Kamenar. Some rights reserved.
Overview
This post is divided into four parts; they are:
- Laying the Groundwork: Preprocessing Techniques for Tree-Based Models
- Assessing the Basics: Decision Tree Regressor Evaluation
- Improving Predictions: Introduction to Bagging with Decision Trees
- Advanced Ensembles: Comparing Bagging and Random Forest Regressors
Laying the Groundwork: Preprocessing Techniques for Tree-Based Models
Preprocessing is crucial in any data science workflow, especially when dealing with tree-based models. This first part of this post brings together essential techniques covered in earlier discussions—such as ordinal encoding from the post Decision Trees and Ordinal Encoding: A Practical Guide, one-hot encoding, various imputation methods, and more—to ensure our dataset is thoroughly prepared for the sophisticated requirements of tree-based modeling. To illustrate these principles in action, let’s walk through a practical example that applies these preprocessing techniques to the Ames Housing dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
# Import necessary libraries for preprocessing import pandas as pd from sklearn.pipeline import Pipeline from sklearn.impute import SimpleImputer from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, FunctionTransformer from sklearn.compose import ColumnTransformer # Load the dataset Ames = pd.read_csv('Ames.csv') # Convert the below numeric features to categorical features Ames['MSSubClass'] = Ames['MSSubClass'].astype('object') Ames['YrSold'] = Ames['YrSold'].astype('object') Ames['MoSold'] = Ames['MoSold'].astype('object') # Exclude 'PID' and 'SalePrice' from features and specifically handle the 'Electrical' column numeric_features = Ames.select_dtypes(include=['int64', 'float64']).drop(columns=['PID', 'SalePrice']).columns categorical_features = Ames.select_dtypes(include=['object']).columns.difference(['Electrical']) electrical_feature = ['Electrical'] # Manually specify the categories for ordinal encoding according to the data dictionary ordinal_order = { 'Electrical': ['Mix', 'FuseP', 'FuseF', 'FuseA', 'SBrkr'], # Electrical system 'LotShape': ['IR3', 'IR2', 'IR1', 'Reg'], # General shape of property 'Utilities': ['ELO', 'NoSeWa', 'NoSewr', 'AllPub'], # Type of utilities available 'LandSlope': ['Sev', 'Mod', 'Gtl'], # Slope of property 'ExterQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Evaluates the quality of the material on the exterior 'ExterCond': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Evaluates the present condition of the material on the exterior 'BsmtQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Height of the basement 'BsmtCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # General condition of the basement 'BsmtExposure': ['None', 'No', 'Mn', 'Av', 'Gd'], # Walkout or garden level basement walls 'BsmtFinType1': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # Quality of basement finished area 'BsmtFinType2': ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'], # Quality of second basement finished area 'HeatingQC': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Heating quality and condition 'KitchenQual': ['Po', 'Fa', 'TA', 'Gd', 'Ex'], # Kitchen quality 'Functional': ['Sal', 'Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # Home functionality 'FireplaceQu': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Fireplace quality 'GarageFinish': ['None', 'Unf', 'RFn', 'Fin'], # Interior finish of the garage 'GarageQual': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Garage quality 'GarageCond': ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex'], # Garage condition 'PavedDrive': ['N', 'P', 'Y'], # Paved driveway 'PoolQC': ['None', 'Fa', 'TA', 'Gd', 'Ex'], # Pool quality 'Fence': ['None', 'MnWw', 'GdWo', 'MnPrv', 'GdPrv'] # Fence quality } # Extract list of ALL ordinal features from dictionary ordinal_features = list(ordinal_order.keys()) # List of ordinal features except Electrical ordinal_except_electrical = [feature for feature in ordinal_features if feature != 'Electrical'] # Helper function to fill 'None' for missing categorical data def fill_none(X): return X.fillna("None") # Pipeline for 'Electrical': Fill missing value with mode then apply ordinal encoding electrical_transformer = Pipeline(steps=[ ('impute_electrical', SimpleImputer(strategy='most_frequent')), ('ordinal_electrical', OrdinalEncoder(categories=[ordinal_order['Electrical']])) ]) # Pipeline for numeric features: Impute missing values using mean numeric_transformer = Pipeline(steps=[ ('impute_mean', SimpleImputer(strategy='mean')) ]) # Pipeline for ordinal features: Fill missing values with 'None' then apply ordinal encoding ordinal_transformer = Pipeline(steps=[ ('fill_none', FunctionTransformer(fill_none, validate=False)), ('ordinal', OrdinalEncoder(categories=[ordinal_order[feature] for feature in ordinal_features if feature in ordinal_except_electrical])) ]) # Pipeline for nominal categorical features: Fill missing values with 'None' then apply one-hot encoding nominal_features = [feature for feature in categorical_features if feature not in ordinal_features] categorical_transformer = Pipeline(steps=[ ('fill_none', FunctionTransformer(fill_none, validate=False)), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) # Combined preprocessor for numeric, ordinal, nominal, and specific electrical data preprocessor = ColumnTransformer( transformers=[ ('electrical', electrical_transformer, ['Electrical']), ('num', numeric_transformer, numeric_features), ('ordinal', ordinal_transformer, ordinal_except_electrical), ('nominal', categorical_transformer, nominal_features) ]) # Apply the preprocessing pipeline to Ames transformed_data = preprocessor.fit_transform(Ames).toarray() # Generate column names for the one-hot encoded features onehot_features = preprocessor.named_transformers_['nominal'].named_steps['onehot'].get_feature_names_out() # Combine all feature names all_feature_names = ['Electrical'] + list(numeric_features) + list(ordinal_except_electrical) + list(onehot_features) # Convert the transformed array to a DataFrame transformed_df = pd.DataFrame(transformed_data, columns=all_feature_names) |
With our data loaded and initial transformations in place, we now have a structured approach to handle missing values and encode our categorical variables appropriately. The following summary outlines the key preprocessing tasks we have accomplished, setting a solid foundation for the upcoming modeling stages.
- Data Categorization:
- Convert “MSSubClass”, “YrSold”, and “MoSold” from numeric to categorical data types to reflect their actual data characteristics.
- Exclusion of Irrelevant Features:
- Remove “PID” and “SalePrice” from the features set to focus on the predictors and avoid including the unique identifier.
- Handling Missing Values:
- Numeric features: Impute missing values with the mean to maintain the distribution.
- Categorical features: Fill in missing values with
None
for all categorical features except “Electrical”, based on guidance provided by the data dictionary. - Electrical feature: Use the mode to impute the one missing value, based on the guidance provided by the data dictionary.
- Encoding Categorical Data:
- Ordinal features: Encode with a predefined order that respects the inherent ranking in the data (like “ExterQual” from poor to excellent).
- Nominal features: Apply one-hot encoding to transform these into a format suitable for modeling, creating binary columns for each category.
- Pipelines for Streamlined Processing:
- Separate pipelines for numeric, ordinal, and nominal features to streamline transformations and ensure consistent application across the dataset.
- Combined Preprocessing:
- Use a
ColumnTransformer
to apply all pipelines in a single step, enhancing the efficiency and manageability of the data transformation process.
- Use a
- Transformation Application and Result Inspection:
- Apply the preprocessing pipeline to the dataset, convert the transformed array back to a DataFrame, and systematically name the columns, especially after one-hot encoding, for easy identification and analysis.
Observing the transformed DataFrame above gives us a clear view of how our preprocessing steps have altered the data. This transformation ensures that each feature is appropriately formatted and ready for the next steps in our analysis. Notice how each category and numerical feature has been handled to retain the most information possible.
1 2 3 4 5 |
# # Optional command for expanded view # pd.set_option('display.max_columns', None) # View the transformation print(transformed_df) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Electrical GrLivArea LotFrontage ... YrSold_2008 YrSold_2009 YrSold_2010 0 4.0 856.0 68.510628 ... 0.0 0.0 1.0 1 4.0 1049.0 42.000000 ... 0.0 1.0 0.0 2 4.0 1001.0 60.000000 ... 0.0 0.0 0.0 3 4.0 1039.0 80.000000 ... 0.0 1.0 0.0 4 4.0 1665.0 70.000000 ... 0.0 1.0 0.0 ... ... ... ... ... ... ... ... 2574 2.0 952.0 68.510628 ... 0.0 1.0 0.0 2575 3.0 1733.0 68.510628 ... 0.0 1.0 0.0 2576 3.0 2002.0 82.000000 ... 0.0 0.0 0.0 2577 4.0 1842.0 68.510628 ... 0.0 0.0 0.0 2578 4.0 1911.0 80.000000 ... 0.0 0.0 0.0 [2579 rows x 2819 columns] |
The original dataset is now expanded to 2819 columns. We can perform the quick calculation below to cross check the correct number of columns post transformation.
1 2 |
#Quick way to cross-check number of columns after preprocessing print(len(numeric_features) + len(ordinal_features) + Ames[nominal_features].fillna("None").nunique().sum()) |
This quick validation shows us the total number of features after preprocessing, confirming that all transformations have been applied correctly.
1 |
2819 |
Ensuring the integrity of our data at this stage is crucial for building reliable models.
Want to Get Started With Next-Level Data Science?
Take my free email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Assessing the Basics: Decision Tree Regressor Evaluation
In the second part of this post, we focus on evaluating the performance of a basic Decision Tree model by building on our foundation above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Build on previous blocks of code # Import additional necessary libraries for modeling and evaluation from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import cross_val_score # Define the full model pipeline model_pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('regressor', DecisionTreeRegressor(random_state=42)) ]) # Evaluate the model using cross-validation scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice']) # Output the result print("Decision Tree Regressor Mean CV R²:", round(scores.mean(),4)) |
By applying cross-validation, we aim to obtain a benchmark for comparing more complex models in subsequent parts of the series:
1 |
Decision Tree Regressor Mean CV R²: 0.7663 |
An R² score of 0.7663 indicates that our model explains approximately 77% of the variability in housing prices, which is a good (but not great) starting point. This foundational performance will help us appreciate the incremental benefits offered by more sophisticated ensemble methods that we will explore next.
Improving Predictions: Introduction to Bagging with Decision Trees
Building on our initial model, this part explores the enhancement of predictive performance through Bagging. Bagging, or Bootstrap Aggregating, is an ensemble technique that aims to improve stability and accuracy by effectively reducing variance and preventing overfitting. Unlike simply cloning the same decision tree multiple times, Bagging involves creating multiple trees where each tree is trained on a different bootstrap sample of the dataset. These samples are drawn with replacement, meaning each tree learns from slightly varied slices of the data, ensuring diversity in the models’ perspectives. We will compare the effectiveness of a single Decision Tree with a Bagging Regressor that uses multiple trees, demonstrating the power of ensemble learning:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# Import Bagging Regressor and build on previous blocks of code # Compare how performance is affected by Bagging (i.e. increasing number of trees) from sklearn.ensemble import BaggingRegressor models = { 'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42), 'Bagging Regressor (10 Trees)': BaggingRegressor(base_estimator=DecisionTreeRegressor(random_state=42), n_estimators=10, random_state=42) } results = {} for name, model in models.items(): # Define the full model pipeline for each model model_pipeline = Pipeline([ ('preprocessor', preprocessor), ('regressor', model) ]) # Perform cross-validation scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice']) # Store and print the mean of the scores results[name] = round(scores.mean(), 4) # Output the cross-validation scores print("Cross-validation scores:", results) |
By leveraging multiple decision trees, Bagging improves approximately 11% over the single Decision Tree, demonstrating how ensemble methods can enhance model performance.
1 |
Cross-validation scores: {'Decision Tree (1 Tree)': 0.7663, 'Bagging Regressor (10 Trees)': 0.8781} |
To further investigate this, we will examine how performance varies with different numbers of trees in the ensemble:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# Build on previous blocks of code # Compare how performance is affected by Bagging in increments of 10 trees # Number of trees to test n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # Define the model pipelines with various regressors models = { 'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42) } # Adding Bagging models for each tree count for n in n_trees: models[f'Bagging Regressor {n} Trees'] = BaggingRegressor( base_estimator=DecisionTreeRegressor(random_state=42), n_estimators=n, random_state=42 ) results = {} for name, model in models.items(): # Define the full model pipeline for each model model_pipeline = Pipeline([ ('preprocessor', preprocessor), ('regressor', model) ]) # Perform cross-validation scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice']) # Store and print the mean of the scores results[name] = round(scores.mean(), 4) # Output the cross-validation scores print("Cross-validation scores:") for name, score in results.items(): print(f"{name}: {score}") |
As we increase the number of trees in the Bagging Regressor, we observe an initial significant improvement in the model’s performance. However, it is crucial to note that the marginal gains begin to plateau beyond a certain point. For instance, while the jump in R² score from 1 to 20 trees is notable, the incremental improvement beyond 20 trees is much less pronounced.
1 2 3 4 5 6 7 8 9 10 11 12 |
Cross-validation scores: Decision Tree (1 Tree): 0.7663 Bagging Regressor 10 Trees: 0.8781 Bagging Regressor 20 Trees: 0.8898 Bagging Regressor 30 Trees: 0.8911 Bagging Regressor 40 Trees: 0.8922 Bagging Regressor 50 Trees: 0.8931 Bagging Regressor 60 Trees: 0.8933 Bagging Regressor 70 Trees: 0.8936 Bagging Regressor 80 Trees: 0.895 Bagging Regressor 90 Trees: 0.8954 Bagging Regressor 100 Trees: 0.8957 |
This trend demonstrates the law of diminishing returns in model complexity and highlights an important consideration in machine learning: beyond a certain level of complexity, the additional computational cost may not justify the minimal gains in performance.
Advanced Ensembles: Comparing Bagging and Random Forest Regressors
In the final part of our series on tree-based modeling techniques, we delve into a comparative analysis of two popular ensemble methods: Bagging Regressors and Random Forests. Both methods build on the concept of ensemble learning, which we explored in the previous sections, but they incorporate different approaches to how trees are constructed and combined.
Random Forest is an extension of the Bagging technique and involves creating many decision trees during training. Unlike simple Bagging, where each tree is built on a bootstrap sample of the data, Random Forest introduces another layer of randomness by considering a random subset of features to split each node in the decision trees. This randomness helps in creating more diverse trees, which generally results in a model with better generalization capabilities.
Let’s assess and compare the performance of these two methods using the Ames Housing dataset, focusing on how increasing the number of trees affects the cross-validated R² score:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# Build on previous blocks of code # Evaluate performance of Random Forest against Bagging Regressor from sklearn.ensemble import RandomForestRegressor # Number of trees to test n_trees = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] # Define the model pipelines with various regressors models = { 'Decision Tree (1 Tree)': DecisionTreeRegressor(random_state=42), } # Adding Bagging and Random Forest models for each tree count for n in n_trees: models[f'Bagging Regressor {n} Trees'] = BaggingRegressor( base_estimator=DecisionTreeRegressor(random_state=42), n_estimators=n, random_state=42 ) models[f'Random Forest {n} Trees'] = RandomForestRegressor( n_estimators=n, random_state=42 ) results = {} for name, model in models.items(): # Define the full model pipeline for each model model_pipeline = Pipeline([ ('preprocessor', preprocessor), ('regressor', model) ]) # Perform cross-validation scores = cross_val_score(model_pipeline, Ames.drop(columns='SalePrice'), Ames['SalePrice']) # Store and print the mean of the scores results[name] = round(scores.mean(), 4) # Output the cross-validation scores print("Cross-validation scores:") for name, score in results.items(): print(f"{name}: {score}") |
Examining the cross-validation scores reveals interesting patterns. Both Bagging and Random Forest models show significant improvements over a single Decision Tree, highlighting the strength of ensemble methods:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Cross-validation scores: Decision Tree (1 Tree): 0.7663 Bagging Regressor 10 Trees: 0.8781 Random Forest 10 Trees: 0.8762 Bagging Regressor 20 Trees: 0.8898 Random Forest 20 Trees: 0.8893 Bagging Regressor 30 Trees: 0.8911 Random Forest 30 Trees: 0.8897 Bagging Regressor 40 Trees: 0.8922 Random Forest 40 Trees: 0.8909 Bagging Regressor 50 Trees: 0.8931 Random Forest 50 Trees: 0.8922 Bagging Regressor 60 Trees: 0.8933 Random Forest 60 Trees: 0.8931 Bagging Regressor 70 Trees: 0.8936 Random Forest 70 Trees: 0.8932 Bagging Regressor 80 Trees: 0.895 Random Forest 80 Trees: 0.8943 Bagging Regressor 90 Trees: 0.8954 Random Forest 90 Trees: 0.8948 Bagging Regressor 100 Trees: 0.8957 Random Forest 100 Trees: 0.8954 |
Interestingly, as we increase the number of trees, both methods show similar performance levels, with neither consistently outperforming the other significantly. This similarity in performance can be attributed to how the specific characteristics of the Ames Housing dataset may naturally limit the benefits of the additional randomization introduced by Random Forest. If the dataset has a few highly predictive features, the random feature selection of Random Forest does not significantly enhance the model’s ability to generalize compared to Bagging, which uses all features.
These insights suggest that while Random Forest typically offers improvements over Bagging by reducing correlation between trees through its feature randomization, the specific dynamics of the dataset and the problem context can sometimes negate these advantages. Therefore, in cases where computational efficiency is a concern, Bagging might be preferred due to its simplicity and similar performance levels. This comparison underscores the importance of understanding the dataset and the modeling objectives when choosing between ensemble strategies.
Further Reading
APIs
- sklearn.ensemble.BaggingRegressor API
- sklearn.ensemble.RandomForestRegressor API
Tutorials
- Decision Trees and Random Forests in Machine Learning by Nikola Pulev
- Bagging and Random Forests by Leslie Myint
Ames Housing Dataset & Data Dictionary
Summary
This blog post provides a detailed exploration of tree-based modeling techniques using the Ames Housing dataset. It starts with essential preprocessing steps like encoding and handling missing values and progresses through the evaluation and enhancement of decision tree models using bagging. The narrative culminates in a comparative analysis of bagging and random forest regressors, highlighting the incremental benefits and performance comparisons as the number of trees is varied. Each section builds upon the last, offering practical examples and insights culminating in a comprehensive understanding of tree-based predictive modeling.
Specifically, you learned:
- Preprocessing is crucial for tree-based models, involving techniques such as categorical conversion, handling missing values, and applying appropriate encodings.
- Evaluating a basic Decision Tree model with cross-validation can provide a solid benchmark for assessing the performance of more complex tree-based models.
- Using Bagging and Random Forest enhances Decision Tree performance, demonstrating significant improvements in prediction accuracy through ensemble techniques.
Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.
No comments yet.