Essence of Boosting Ensembles for Machine Learning

By Jason Brownlee on May 5, 2021 in Ensemble Learning 2

Boosting is a powerful and popular class of ensemble learning techniques.

Historically, boosting algorithms were challenging to implement, and it was not until AdaBoost demonstrated how to implement boosting that the technique could be used effectively. AdaBoost and modern gradient boosting work by sequentially adding models that correct the residual prediction errors of the model. As such, boosting methods are known to be effective, but constructing models can be slow, especially for large datasets.

More recently, extensions designed for computational efficiency have made the methods fast enough for broader adoption. Open-source implementations, such as XGBoost and LightGBM, have meant that boosting algorithms have become the preferred and often top-performing approach in machine learning competitions for classification and regression on tabular data.

In this tutorial, you will discover the essence of boosting to machine learning ensembles.

After completing this tutorial, you will know:

The boosting ensemble method for machine learning incrementally adds weak learners trained on weighted versions of the training dataset.
The essential idea that underlies all boosting algorithms and the key approach used within each boosting algorithm.
How the essential ideas that underlie boosting could be explored on new predictive modeling projects.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Essence of Boosting Ensembles for Machine Learning
Photo by Armin S Kowalski, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

Boosting Ensembles
Essence of Boosting Ensembles
Family of Boosting Ensemble Algorithms
1. AdaBoost Ensembles
2. Classic Gradient Boosting Ensembles
3. Modern Gradient Boosting Ensembles
Customized Boosting Ensembles

Boosting Ensembles

Boosting is a powerful ensemble learning technique.

As such, boosting is popular and may be the most widely used ensemble techniques at the time of writing.

Boosting is one of the most powerful learning ideas introduced in the last twenty years.

— Page 337, The Elements of Statistical Learning, 2016.

As an ensemble technique, it can read and sound more complex than sibling methods, such as bootstrap aggregation (bagging) and stacked generalization (stacking). The implementations can, in fact, be quite complicated, yet the ideas that underly boosting ensembles are very simple.

Boosting can be understood by contrasting it to bagging.

In bagging, an ensemble is created by making multiple different samples of the same training dataset and fitting a decision tree on each. Given that each sample of the training dataset is different, each decision tree is different, in turn making slightly different predictions and prediction errors. The predictions for all of the created decision trees are combined, resulting in lower error than fitting a single tree.

Boosting operates in a similar manner. Multiple trees are fit on different versions of the training dataset and the predictions from the trees are combined using simple voting for classification or averaging for regression to result in a better prediction than fitting a single decision tree.

… boosting […] combines an ensemble of weak classifiers using simple majority voting …

— Page 13, Ensemble Machine Learning, 2012.

There are some important differences; they are:

Instances in the training set are assigned a weight based on difficulty.
Learning algorithms must pay attention to instance weights.
Ensemble members are added sequentially.

The first difference is that the same training dataset is used to train each decision tree. No sampling of the training dataset is performed. Instead, each example in the training dataset (each row of data) is assigned a weight based on how easy or difficult the ensemble finds that example to predict.

The main idea behind this algorithm is to give more focus to patterns that are harder to classify. The amount of focus is quantified by a weight that is assigned to every pattern in the training set.

— Pages 28-29, Pattern Classification Using Ensemble Methods, 2010.

This means that rows that are easy to predict using the ensemble have a small weight and rows that are difficult to predict correctly will have a much larger weight.

Boosting works in a similar way, except that the trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

— Page 322, An Introduction to Statistical Learning with Applications in R, 2014.

The second difference from bagging is that the base learning algorithm, e.g. the decision tree, must pay attention to the weightings of the training dataset. In turn, it means that boosting is specifically designed to use decision trees as the base learner, or other algorithms that support a weighting of rows when constructing the model.

The construction of the model must pay more attention to training examples proportional to their assigned weight. This means that ensemble members are constructed in a biased way to make (or work hard to make) correct predictions on heavily weighted examples.

Finally, the boosting ensemble is constructed slowly. Ensemble members are added sequentially, one, then another, and so on until the ensemble has the desired number of members.

Importantly, the weighting of the training dataset is updated based on the capability of the entire ensemble after each ensemble member is added. This ensures that each member that is subsequently added works hard to correct errors made by the whole model on the training dataset.

In boosting, however, the training dataset for each subsequent classifier increasingly focuses on instances misclassified by previously generated classifiers.

— Page 13, Ensemble Machine Learning, 2012.

The contribution of each model to the final prediction is a weighted sum of the performance of each model, e.g. a weighted average or weighted vote.

This incremental addition of ensemble members to correct errors on the training dataset sounds like it would eventually overfit the training dataset. In practice, boosting ensembles can overfit the training dataset, but often, the effect is subtle and overfitting is not a major problem.

Unlike bagging and random forests, boosting can overfit if [the number of trees] is too large, although this overfitting tends to occur slowly if at all.

— Page 323, An Introduction to Statistical Learning with Applications in R, 2014.

This is a high-level summary of the boosting ensemble method and closely resembles the AdaBoost algorithm, yet we can generalize the approach and extract the essential elements.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Essence of Boosting Ensembles

The essence of boosting sounds like it might be about correcting predictions.

This is how all modern boosting algorithms are implemented, and it is an interesting and important idea. Nevertheless, correcting prediction errors might be considered an implementation detail for achieving boosting (a large and important detail) rather than the essence of the boosting ensemble approach.

The essence of boosting is the combination of multiple weak learners into a strong learner.

The strategy of boosting, and ensembles of classifiers, is to learn many weak classifiers and combine them in some way, instead of trying to learn a single strong classifier.

— Page 35, Ensemble Machine Learning, 2012.

A weak learner is a model that has a very modest skill, often meaning that its performance is slightly above a random classifier for binary classification or predicting the mean value for regression. Traditionally, this means a decision stump, which is a decision tree that considers one value of one variable and makes a prediction.

A weak learner (WL) is a learning algorithm capable of producing classifiers with probability of error strictly (but only slightly) less than that of random guessing …

— Page 35, Ensemble Machine Learning, 2012.

A weak learner can be contrasted to a strong learner that performs well on a predictive modeling problem. In general, we seek a strong learner to address a classification or regression problem.

… a strong learner (SL) is able (given enough training data) to yield classifiers with arbitrarily small error probability.

— Page 35, Ensemble Machine Learning, 2012.

Although we seek a strong learner for a given predictive modeling problem, they are challenging to train. Whereas weak learners are very fast and easy to train.

Boosting recognizes this distinction and proposes explicitly constructing a strong learner from multiple weak learners.

Boosting is a class of machine learning methods based on the idea that a combination of simple classifiers (obtained by a weak learner) can perform better than any of the simple classifiers alone.

— Page 35, Ensemble Machine Learning, 2012.

Many approaches to boosting have been explored, but only one has been truly successful. That is the method described in the previous section where weak learners are added sequentially to the ensemble to specifically address or correct the residual errors for regression trees or class label prediction errors for classification. The result is a strong learner.

Let’s take a closer look at ensemble methods that may be considered a part of the boosting family.

Family of Boosting Ensemble Algorithms

There are a large number of boosting ensemble learning algorithms, although all work in generally the same way.

Namely, they involve sequentially adding simple base learner models that are trained on (re-)weighted versions of the training dataset.

The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners.

— Page 23, Ensemble Methods, 2012.

We might consider three main families of boosting methods; they are: AdaBoost, Classic Gradient Boosting, and Modern Gradient Boosting.

The division is somewhat arbitrary, as there are techniques that may span all groups or implementations that can be configured to realize an example from each group and even bagging-based methods.

AdaBoost Ensembles

Initially, naive boosting methods explored training weak classifiers on separate samples of the training dataset and combining the predictions.

These methods were not successful compared to bagging.

Adaptive Boosting, or AdaBoost for short, were the first successful implementations of boosting.

Researchers struggled for a time to find an effective implementation of boosting theory, until Freund and Schapire collaborated to produce the AdaBoost algorithm.

— Page 204, Applied Predictive Modeling, 2013.

It was not just a successful realization of the boosting principle; it was an effective algorithm for classification.

Boosting, especially in the form of the AdaBoost algorithm, was shown to be a powerful prediction tool, usually outperforming any individual model. Its success drew attention from the modeling community and its use became widespread …

— Page 204, Applied Predictive Modeling, 2013.

Although AdaBoost was developed initially for binary classification, it was later extended for multi-class classification, regression, and a myriad of other extensions and specialized versions.

They were the first to maintain the same training dataset and to introduce weighting of training examples and the sequential addition of models trained to correct prediction errors of the ensemble.

Classic Gradient Boosting Ensembles

After the success of AdaBoost, a lot of attention was paid to boosting methods.

Gradient boosting was a generalization of the AdaBoost group of techniques that allowed the training of each subsequent base learning to be achieved using arbitrary loss functions.

Rather than deriving new versions of boosting for every different loss function, it is possible to derive a generic version, known as gradient boosting.

— Page 560, Machine Learning: A Probabilistic Perspective, 2012.

The “gradient” in gradient boosting refers to the prediction error from a chosen loss function, which is minimized by adding base learners.

The basic principles of gradient boosting are as follows: given a loss function (e.g., squared error for regression) and a weak learner (e.g., regression trees), the algorithm seeks to find an additive model that minimizes the loss function.

— Page 204, Applied Predictive Modeling, 2013.

After the initial reframing of AdaBoost as gradient boosting and the use of alternate loss functions, there was a lot of further innovation, such as Multivariate Adaptive Regression Trees (MART), Tree Boosting, and Gradient Boosting Machines (GBM).

If we combine the gradient boosting algorithm with (shallow) regression trees, we get a model known as MART […] after fitting a regression tree to the residual (negative gradient), we re-estimate the parameters at the leaves of the tree to minimize the loss …

— Page 562, Machine Learning: A Probabilistic Perspective, 2012.

The technique was extended to include regularization in an attempt to further slow down the learning and sampling of rows and columns for each decision tree in order to add some independence to the ensemble members, based on ideas from bagging, referred to as stochastic gradient boosting.

Modern Gradient Boosting Ensembles

Gradient boosting and variances of the method were shown to be very effective, yet were often slow to train, especially for large training datasets.

This was mainly due to the sequential training of ensemble members, which could not be parallelized. This was unfortunate, as training ensemble members in parallel and the computational speed-up it provides is an often described desirable property of using ensembles.

As such, much effort was put into improving the computational efficiency of the method.

This resulted in highly optimized open-source implementations of gradient boosting that introduced innovative techniques that both accelerated the training of the model and offered further improved predictive performance.

Notable examples included both Extreme Gradient Boosting (XGBoost) and the Light Gradient Boosting Machine (LightGBM) projects. Both were so effective that they became de facto methods used in machine learning competitions when working with tabular data.

Customized Boosting Ensembles

We have briefly reviewed the canonical types of boosting algorithms.

Modern implementations of algorithms like XGBoost and LightGBM provide sufficient configuration hyperparameters to realize many different types of boosting algorithms.

Although boosting was initially challenging to realize as compared to simpler to implement methods like bagging, the essential ideas may be useful in exploring or extending ensemble methods on your own predictive modeling projects in the future.

The essential idea of constructing a strong learner from weak learners could be implemented in many different ways. For example, bagging with decision stumps or other similarly weak learner configurations of standard machine learning algorithms could be considered a realization of this approach.

Bagging weak learners.

It also provides a contrast to other ensemble types, such as stacking, that attempts to combine multiple strong learners into a slightly stronger learner. Even then, perhaps alternate success could be achieved on a project by stacking diverse weak learners instead.

Stacking weak learners.

The path to effective boosting involves specific implementation details of weighted training examples, models that can honor the weightings, and the sequential addition of models fit under some loss minimization method.

Nevertheless, these principles could be harnessed in ensemble models more generally.

For example, perhaps members can be added to a bagging or stacking ensemble sequentially and only kept if they result in a useful lift in skill, drop in prediction error, or change in the distribution of predictions made by the model.

Sequential bagging or stacking.

In some senses, stacking offers a realization of the idea of correcting predictions of other models. The meta-model (level-1 learner) attempts to effectively combine the predictions of the base models (level-0 learners). In a sense, it is attempting to correct the predictions of those models.

Levels of the stacked model could be added to address specific requirements, such as the minimization of some or all prediction errors.

Deep stacking of models.

These are a few perhaps obvious examples of how the essence of the boosting method can be explored, hopefully inspiring further ideas. I would encourage you to brainstorm how you might adapt the methods to your own specific project.

Summary

In this tutorial, you discovered the essence of boosting to machine learning ensembles.

Specifically, you learned:

The boosting ensemble method for machine learning incrementally adds weak learners trained on weighted versions of the training dataset.
The essential idea that underlies all boosting algorithms and the key approach used within each boosting algorithm.
How the essential ideas that underlie boosting could be explored on new predictive modeling projects.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

2 Responses to Essence of Boosting Ensembles for Machine Learning

rafael June 5, 2021 at 1:49 pm #

Thank you, a great explanation as always. This website is a gold mine of knowledge!
I find the concept of “wisdom of the crowd” very useful to understand how combining weak models can make a strong model.

- Jason Brownlee June 6, 2021 at 5:38 am #
  
  Thanks!

Navigation

Essence of Boosting Ensembles for Machine Learning

Tutorial Overview

Boosting Ensembles

Want to Get Started With Ensemble Learning?

Essence of Boosting Ensembles