From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles

This post dives into the application of tree-based models, particularly focusing on decision trees, bagging, and random forests within the Ames Housing dataset. It begins by emphasizing the critical role of preprocessing, a fundamental step that ensures our data is optimally configured for the requirements of these models. The path from a single decision tree to a robust ensemble of trees highlights the transformative impact that multiple trees can have on predictive performance. As we progress through the intricacies of model evaluation and enhancement, we aim to equip you with practical insights and advanced strategies to refine your approach to machine learning and real estate price prediction.

Kick-start your project with my book Next-Level Data Science. It provides self-study tutorials with working code.

Let’s get started.

From Single Trees to Forests: Enhancing Real Estate Predictions with Ensembles
Photo by Steven Kamenar. Some rights reserved.

Overview

This post is divided into four parts; they are:

  • Laying the Groundwork: Preprocessing Techniques for Tree-Based Models
  • Assessing the Basics: Decision Tree Regressor Evaluation
  • Improving Predictions: Introduction to Bagging with Decision Trees
  • Advanced Ensembles: Comparing Bagging and Random Forest Regressors

Laying the Groundwork: Preprocessing Techniques for Tree-Based Models

Preprocessing is crucial in any data science workflow, especially when dealing with tree-based models. This first part of this post brings together essential techniques covered in earlier discussions—such as ordinal encoding from the post Decision Trees and Ordinal Encoding: A Practical Guide, one-hot encoding, various imputation methods, and more—to ensure our dataset is thoroughly prepared for the sophisticated requirements of tree-based modeling. To illustrate these principles in action, let’s walk through a practical example that applies these preprocessing techniques to the Ames Housing dataset.

With our data loaded and initial transformations in place, we now have a structured approach to handle missing values and encode our categorical variables appropriately. The following summary outlines the key preprocessing tasks we have accomplished, setting a solid foundation for the upcoming modeling stages.

  • Data Categorization:
    • Convert “MSSubClass”, “YrSold”, and “MoSold” from numeric to categorical data types to reflect their actual data characteristics.
  • Exclusion of Irrelevant Features:
    • Remove “PID” and “SalePrice” from the features set to focus on the predictors and avoid including the unique identifier.
  • Handling Missing Values:
    • Numeric features: Impute missing values with the mean to maintain the distribution.
    • Categorical features: Fill in missing values with None for all categorical features except “Electrical”, based on guidance provided by the data dictionary.
    • Electrical feature: Use the mode to impute the one missing value, based on the guidance provided by the data dictionary.
  • Encoding Categorical Data:
    • Ordinal features: Encode with a predefined order that respects the inherent ranking in the data (like “ExterQual” from poor to excellent).
    • Nominal features: Apply one-hot encoding to transform these into a format suitable for modeling, creating binary columns for each category.
  • Pipelines for Streamlined Processing:
    • Separate pipelines for numeric, ordinal, and nominal features to streamline transformations and ensure consistent application across the dataset.
  • Combined Preprocessing:
    • Use a ColumnTransformer to apply all pipelines in a single step, enhancing the efficiency and manageability of the data transformation process.
  • Transformation Application and Result Inspection:
    • Apply the preprocessing pipeline to the dataset, convert the transformed array back to a DataFrame, and systematically name the columns, especially after one-hot encoding, for easy identification and analysis.

Observing the transformed DataFrame above gives us a clear view of how our preprocessing steps have altered the data. This transformation ensures that each feature is appropriately formatted and ready for the next steps in our analysis. Notice how each category and numerical feature has been handled to retain the most information possible.

The original dataset is now expanded to 2819 columns. We can perform the quick calculation below to cross check the correct number of columns post transformation.

This quick validation shows us the total number of features after preprocessing, confirming that all transformations have been applied correctly.

Ensuring the integrity of our data at this stage is crucial for building reliable models.

Want to Get Started With Next-Level Data Science?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Assessing the Basics: Decision Tree Regressor Evaluation

In the second part of this post, we focus on evaluating the performance of a basic Decision Tree model by building on our foundation above:

By applying cross-validation, we aim to obtain a benchmark for comparing more complex models in subsequent parts of the series:

An R² score of 0.7663 indicates that our model explains approximately 77% of the variability in housing prices, which is a good (but not great) starting point. This foundational performance will help us appreciate the incremental benefits offered by more sophisticated ensemble methods that we will explore next.

Improving Predictions: Introduction to Bagging with Decision Trees

Building on our initial model, this part explores the enhancement of predictive performance through Bagging. Bagging, or Bootstrap Aggregating, is an ensemble technique that aims to improve stability and accuracy by effectively reducing variance and preventing overfitting. Unlike simply cloning the same decision tree multiple times, Bagging involves creating multiple trees where each tree is trained on a different bootstrap sample of the dataset. These samples are drawn with replacement, meaning each tree learns from slightly varied slices of the data, ensuring diversity in the models’ perspectives. We will compare the effectiveness of a single Decision Tree with a Bagging Regressor that uses multiple trees, demonstrating the power of ensemble learning:

By leveraging multiple decision trees, Bagging improves approximately 11% over the single Decision Tree, demonstrating how ensemble methods can enhance model performance.

To further investigate this, we will examine how performance varies with different numbers of trees in the ensemble:

As we increase the number of trees in the Bagging Regressor, we observe an initial significant improvement in the model’s performance. However, it is crucial to note that the marginal gains begin to plateau beyond a certain point. For instance, while the jump in R² score from 1 to 20 trees is notable, the incremental improvement beyond 20 trees is much less pronounced.

This trend demonstrates the law of diminishing returns in model complexity and highlights an important consideration in machine learning: beyond a certain level of complexity, the additional computational cost may not justify the minimal gains in performance.

Advanced Ensembles: Comparing Bagging and Random Forest Regressors

In the final part of our series on tree-based modeling techniques, we delve into a comparative analysis of two popular ensemble methods: Bagging Regressors and Random Forests. Both methods build on the concept of ensemble learning, which we explored in the previous sections, but they incorporate different approaches to how trees are constructed and combined.

Random Forest is an extension of the Bagging technique and involves creating many decision trees during training. Unlike simple Bagging, where each tree is built on a bootstrap sample of the data, Random Forest introduces another layer of randomness by considering a random subset of features to split each node in the decision trees. This randomness helps in creating more diverse trees, which generally results in a model with better generalization capabilities.

Let’s assess and compare the performance of these two methods using the Ames Housing dataset, focusing on how increasing the number of trees affects the cross-validated R² score:

Examining the cross-validation scores reveals interesting patterns. Both Bagging and Random Forest models show significant improvements over a single Decision Tree, highlighting the strength of ensemble methods:

Interestingly, as we increase the number of trees, both methods show similar performance levels, with neither consistently outperforming the other significantly. This similarity in performance can be attributed to how the specific characteristics of the Ames Housing dataset may naturally limit the benefits of the additional randomization introduced by Random Forest. If the dataset has a few highly predictive features, the random feature selection of Random Forest does not significantly enhance the model’s ability to generalize compared to Bagging, which uses all features.

These insights suggest that while Random Forest typically offers improvements over Bagging by reducing correlation between trees through its feature randomization, the specific dynamics of the dataset and the problem context can sometimes negate these advantages. Therefore, in cases where computational efficiency is a concern, Bagging might be preferred due to its simplicity and similar performance levels. This comparison underscores the importance of understanding the dataset and the modeling objectives when choosing between ensemble strategies.

Further Reading

APIs

Tutorials

Ames Housing Dataset & Data Dictionary

Summary

This blog post provides a detailed exploration of tree-based modeling techniques using the Ames Housing dataset. It starts with essential preprocessing steps like encoding and handling missing values and progresses through the evaluation and enhancement of decision tree models using bagging. The narrative culminates in a comparative analysis of bagging and random forest regressors, highlighting the incremental benefits and performance comparisons as the number of trees is varied. Each section builds upon the last, offering practical examples and insights culminating in a comprehensive understanding of tree-based predictive modeling.

Specifically, you learned:

  • Preprocessing is crucial for tree-based models, involving techniques such as categorical conversion, handling missing values, and applying appropriate encodings.
  • Evaluating a basic Decision Tree model with cross-validation can provide a solid benchmark for assessing the performance of more complex tree-based models.
  • Using Bagging and Random Forest enhances Decision Tree performance, demonstrating significant improvements in prediction accuracy through ensemble techniques.

Do you have any questions? Please ask your questions in the comments below, and I will do my best to answer.

Get Started on Next-Level Data Science!

Next-Level Data Science

Master the mindset for success in data science projects

..build expertise through clear, practical examples, with minimal complex math and a focus on hands-on learning.

Discover how in my new Ebook:
Next-Level Data Science

It provides self-study tutorials designed to guide you from intermediate to advanced. Learn to optimize workflows, manage multicollinearity, refine tree-based models, and handle missing data—and more, to help you achieve deeper insights and effective storytelling with data.

Advance your data science skills with real-world exercises


See What's Inside

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.