Bootstrap aggregation, or bagging, is a popular ensemble method that fits a decision tree on different bootstrap samples of the training dataset.
It is simple to implement and effective on a wide range of problems, and importantly, modest extensions to the technique result in ensemble methods that are among some of the most powerful techniques, like random forest, that perform well on a wide range of predictive modeling problems.
As such, we can generalize the bagging method to a framework for ensemble learning and compare and contrast a suite of common ensemble methods that belong to the “bagging family” of methods. We can also use this framework to explore further extensions and how the method can be further tailored to a project dataset or chosen predictive model.
In this tutorial, you will discover the essence of the bootstrap aggregation approach to machine learning ensembles.
After completing this tutorial, you will know:
- The bagging ensemble method for machine learning using bootstrap samples and decision trees.
- How to distill the essential elements from the bagging method and how popular extensions like random forest are directly related to bagging.
- How to devise new extensions to bagging by selecting new procedures for the essential elements of the method.
Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Tutorial Overview
This tutorial is divided into four parts; they are:
- Bootstrap Aggregation
- Essence of Bagging Ensembles
- Bagging Ensemble Family
- Random Subspace Ensemble
- Random Forest Ensemble
- Extra Trees Ensemble
- Customized Bagging Ensembles
Bootstrap Aggregation
Bootstrap Aggregation, or bagging for short, is an ensemble machine learning algorithm.
The techniques involve creating a bootstrap sample of the training dataset for each ensemble member and training a decision tree model on each sample, then combining the predictions directly using a statistic like the average of the predictions.
Breiman’s bagging (short for Bootstrap Aggregation) algorithm is one of the earliest and simplest, yet effective, ensemble-based algorithms.
— Page 12, Ensemble Machine Learning, 2012.
The sample of the training dataset is created using the bootstrap method, which involves selecting examples randomly with replacement.
Replacement means that the same example is metaphorically returned to the pool of candidate rows and may be selected again or many times in any single sample of the training dataset. It is also possible that some examples in the training dataset are not selected at all for some bootstrap samples.
Some original examples appear more than once, while some original examples are not present in the sample.
— Page 48, Ensemble Methods, 2012.
The bootstrap method has the desired effect of making each sample of the dataset quite different, or usefully different for creating an ensemble.
A decision tree is then fit on each sample of data. Each tree will be a little different given the differences in the training dataset. Typically, the decision tree is configured to have perhaps an increased depth or to not use pruning. This can make each tree more specialized to the training dataset and, in turn, further increase the differences between the trees.
Differences in trees are desirable as they will increase the “diversity” of the ensemble, which means produce ensemble members that have a lower correlation in their prediction or prediction errors. It is generally accepted that ensembles composed of ensemble members that are skillful and diverse (skillful in different ways or make different errors) perform better.
The diversity in the ensemble is ensured by the variations within the bootstrapped replicas on which each classifier is trained, as well as by using a relatively weak classifier whose decision boundaries measurably vary with respect to relatively small perturbations in the training data.
— Page 12, Ensemble Machine Learning, 2012.
A benefit of bagging is that it generally does not overfit the training dataset, and the number of ensemble members can continue to be increased until performance on a holdout dataset stops improving.
This is a high-level summary of the bagging ensemble method, yet we can generalize the approach and extract the essential elements.
Want to Get Started With Ensemble Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Essence of Bagging Ensembles
The essence of bagging is about leveraging independent models.
In this way, it might be the closest realization of the “wisdom of the crowd” metaphor, especially if we consider that performance continues to improve with the addition of independent contributors.
Unfortunately, we cannot develop truly independent models as we only have one training dataset. Instead, the bagging approach approximates independent models using randomness. Specifically, by using randomness in the sampling of the dataset used to train each model, forcing some semi-independence between the models.
Though it is practically impossible to get really independent base learners since they are generated from the same training data set, base learners with less dependence can be obtained by introducing randomness in the learning process, and a good generalization ability can be expected by the ensemble.
— Page 48, Ensemble Methods, 2012.
The structure of the bagging procedure can be divided into three essential elements; they are:
- Different Training Datasets: Create a different sample of the training dataset for each ensemble model.
- High-Variance Model: Train the same high-variance model on each sample of the training dataset.
- Average Predictions: Use statistics to combine predictions.
We can map the canonical bagging method onto these elements as follows:
- Different Training Datasets: Bootstrap sample.
- High-Variance Model: Decision tree.
- Average Predictions: Mean for regression, mode for classification.
This provides a framework where we could consider alternate methods for each essential element of the model.
For example, we could change the algorithm to another high-variance technique that has somewhat unstable learning behavior, perhaps like k-nearest neighbors with a modest value for the k hyperparameter.
Often, bagging produces a combined model that outperforms the model that is built using a single instance of the original data. […] this is true especially for unstable inducers since bagging can eliminate their instability. In this context, an inducer is considered unstable if perturbations in the learning set can produce significant changes in the constructed classifier.
— Page 28, Pattern Classification Using Ensemble Methods, 2010.
We might also change the sampling method from the bootstrap to another sampling technique, or more generally, a different method entirely. In fact, this is a basis for many of the extensions of bagging described in the literature. Specifically, to attempt to get ensemble members that are more independent, yet remain skillful.
We know that the combination of independent base learners will lead to a dramatic decrease of errors and therefore, we want to get base learners as independent as possible.
— Page 48, Ensemble Methods, 2012.
Let’s take a closer look at other ensemble methods that may be considered a part of the bagging family.
Bagging Ensemble Family
Many ensemble machine learning techniques may be considered descendants of bagging.
As such, we can map them onto our framework of essential bagging. This is a helpful exercise as it both highlights the differences between methods and the uniqueness of each technique. Perhaps more importantly, it could also spark ideas for additional variations that you may want to explore on your own predictive modeling project.
Let’s take a closer look at three of the more popular ensemble methods related to bagging.
Random Subspace Ensemble
The random subspace method, or random subspace ensemble, involves selecting random subsets of the features (columns) in the training dataset for each ensemble member.
Each training dataset has all rows as it is only the columns that are randomly sampled.
- Different Training Datasets: Randomly sample columns.
Random Forest Ensemble
The random forest method is perhaps one of the most successful and widely used ensemble methods, given its ease of implementation and often superior performance on a wide range of predictive modeling problems.
The method often involves selecting a bootstrap sample of the training dataset and a small random subset of columns to consider when choosing each split point in each ensemble member.
In this way, it is like a combination of bagging with the random subspace method, although the random subspaces are used uniquely for the way decision trees are constructed.
- Different Training Datasets: Bootstrap sample.
- High-Variance Model: Decision tree with split points on random subsets of columns.
Extra Trees Ensemble
The extra trees ensemble uses the entire training dataset, although it configures the decision tree algorithm to select the split points at random.
- Different Training Datasets: Whole dataset.
- High-Variance Model: Decision tree with random split points.
Customized Bagging Ensembles
We have briefly reviewed the canonical random subspace, random forest, and extra trees methods, although there is no reason that the methods could not share more implementation details.
In fact, modern implementations of algorithms like bagging and random forest proved sufficient configuration to combine many of these features.
Rather than exhausting the literature, we can devise our own extensions that map into the bagging framework. This may inspire you to explore a less common method or devise your own bagging approach targeted on your dataset or choice of model.
There are perhaps tens or hundreds of extensions of bagging with minor modifications to the manner in which the training dataset for each ensemble member is prepared or the specifics of how the model is constructed from the training dataset.
The changes are built around the three main elements of the essential bagging method and often seek better performance by exploring the balance between skillful-enough ensemble members whilst maintaining enough diversity between predictions or prediction errors.
For example, we could change the sampling of the training dataset to be a random sample without replacement, instead of a bootstrap sample. This is referred to as “pasting.”
- Different Training Dataset: Random subsample of rows.
We could go further and select a random subsample of rows (like pasting) and a random subsample of columns (random subsample) for each decision tree. This is known as “random patches.”
- Different Training Dataset: Random subsample of rows and columns.
We can also consider our own simple extensions of the idea.
For example, it is common to use feature selection techniques to choose a subset of input variables in order to reduce the complexity of a prediction problem (fewer columns) and achieve better performance (less noise). We could imagine a bagging ensemble where each model is fit on a different “view” of the training dataset selected by a different feature selection or feature importance method.
- Different Training Dataset: Columns chosen by different feature selection methods.
It is also common to test a model with many different data transforms as part of a modeling pipeline. This is done because we cannot know beforehand which representation of the training dataset will best expose the unknown underlying structure of the dataset to the learning algorithms. We could imagine a bagging ensemble where each model is fit on a different transform of the training dataset.
- Different Training Dataset: Data transforms of the raw training dataset.
These are a few perhaps obvious examples of how the essence of the bagging method can be explored, hopefully inspiring further ideas. I would encourage you to brainstorm how you might adapt the methods to your own specific project.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Tutorials
- How to Develop a Bagging Ensemble with Python
- Bagging and Random Forest for Imbalanced Classification
- How to Create a Bagging Ensemble of Deep Learning Models in Keras
- How to Implement Bagging From Scratch With Python
- A Gentle Introduction to Machine Learning Modeling Pipelines
Books
- Pattern Classification Using Ensemble Methods, 2010.
- Ensemble Methods, 2012.
- Ensemble Machine Learning, 2012.
Articles
Summary
In this tutorial, you discovered the essence of the bootstrap aggregation approach to machine learning ensembles.
Specifically, you learned:
- The bagging ensemble method for machine learning using bootstrap samples and decision trees.
- How to distill the essential elements from the bagging method and how popular extensions like random forest are directly related to bagging.
- How to devise new extensions to bagging by selecting new procedures for the essential elements of the method.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Thanks for this great tutorial. If I understand correctly, for a decision tree being a part of a random forest, the tree is trained on a randomly picked set of training examples taken from the total batch (bootstrapping) and also, in each split in the tree, the features used for doing the classification is also randomly picked?
Which classification algorithm is typically used in each split?
Thanks again
Not quite.
In bagging, each tree is fit on a bootstrap sample.
In RF, it will use a random subest of features at each split point.
Thanks for your answer. A final question about a tree used in a RF; are the number of features in the subset the same in every split? E.g., if each training point has 10 features in total, and I use 5 of those (randomly picked) in the first split, would the rest of the subsets in the following splits also have 5 (randomly picked) features?
Thanks!
Yes. It is specified by a hyperparameter. I recommend this tutorial:
https://machinelearningmastery.com/implement-random-forest-scratch-python/
And this:
https://machinelearningmastery.com/random-forest-ensemble-in-python/
Hi, I am doing a machine learning project, and I would like to use random forests or decision trees. I am also working with time series stock market data. I believe that the chronological order of the data is important, therefore, would it be wise to use bagging? Given that the chronological order is altered when the new data sets are formed.
I am new to ML and I could really do with your help.
Regards
Hi Kamal…The following is a great starting point related to time series forecasting:
https://machinelearningmastery.com/time-series-forecasting-python-mini-course/
thanks for the nice collection,
I have question regarding different methods of “Average Predictions”, there are some methods for statistical-guided prediction? more effective than just averaging?
Hi azadeh…The following resource may be of interest:
https://machinelearningmastery.com/weighted-average-ensemble-for-deep-learning-neural-networks/