# How to Develop a Bagging Ensemble with Python

Last Updated on April 27, 2021

Bagging is an ensemble machine learning algorithm that combines the predictions from many decision trees.

It is also easy to implement given that it has few key hyperparameters and sensible heuristics for configuring these hyperparameters.

Bagging performs well in general and provides the basis for a whole field of ensemble of decision tree algorithms such as the popular random forest and extra trees ensemble algorithms, as well as the lesser-known Pasting, Random Subspaces, and Random Patches ensemble algorithms.

In this tutorial, you will discover how to develop Bagging ensembles for classification and regression.

After completing this tutorial, you will know:

• Bagging ensemble is an ensemble created from decision trees fit on different samples of a dataset.
• How to use the Bagging ensemble for classification and regression with scikit-learn.
• How to explore the effect of Bagging model hyperparameters on model performance.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Update Aug/2020: Added a common questions section.

How to Develop a Bagging Ensemble in Python
Photo by daveynin, some rights reserved.

## Tutorial Overview

This tutorial is divided into five parts; they are:

1. Bagging Ensemble Algorithm
2. Bagging Scikit-Learn API
1. Bagging for Classification
2. Bagging for Regression
3. Bagging Hyperparameters
1. Explore Number of Trees
2. Explore Number of Samples
3. Explore Alternate Algorithm
4. Bagging Extensions
1. Pasting Ensemble
2. Random Subspaces Ensemble
3. Random Patches Ensemble
5. Common Questions

## Bagging Ensemble Algorithm

Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm.

Specifically, it is an ensemble of decision tree models, although the bagging technique can also be used to combine the predictions of other types of models.

As its name suggests, bootstrap aggregation is based on the idea of the “bootstrap” sample.

A bootstrap sample is a sample of a dataset with replacement. Replacement means that a sample drawn from the dataset is replaced, allowing it to be selected again and perhaps multiple times in the new sample. This means that the sample may have duplicate examples from the original dataset.

The bootstrap sampling technique is used to estimate a population statistic from a small data sample. This is achieved by drawing multiple bootstrap samples, calculating the statistic on each, and reporting the mean statistic across all samples.

An example of using bootstrap sampling would be estimating the population mean from a small dataset. Multiple bootstrap samples are drawn from the dataset, the mean calculated on each, then the mean of the estimated means is reported as an estimate of the population.

Surprisingly, the bootstrap method provides a robust and accurate approach to estimating statistical quantities compared to a single estimate on the original dataset.

This same approach can be used to create an ensemble of decision tree models.

This is achieved by drawing multiple bootstrap samples from the training dataset and fitting a decision tree on each. The predictions from the decision trees are then combined to provide a more robust and accurate prediction than a single decision tree (typically, but not always).

Bagging predictors is a method for generating multiple versions of a predictor and using these to get an aggregated predictor. […] The multiple versions are formed by making bootstrap replicates of the learning set and using these as new learning sets

Bagging predictors, 1996.

Predictions are made for regression problems by averaging the prediction across the decision trees. Predictions are made for classification problems by taking the majority vote prediction for the classes from across the predictions made by the decision trees.

The bagged decision trees are effective because each decision tree is fit on a slightly different training dataset, which in turn allows each tree to have minor differences and make slightly different skillful predictions.

Technically, we say that the method is effective because the trees have a low correlation between predictions and, in turn, prediction errors.

Decision trees, specifically unpruned decision trees, are used as they slightly overfit the training data and have a high variance. Other high-variance machine learning algorithms can be used, such as a k-nearest neighbors algorithm with a low k value, although decision trees have proven to be the most effective.

If perturbing the learning set can cause significant changes in the predictor constructed, then bagging can improve accuracy.

Bagging predictors, 1996.

Bagging does not always offer an improvement. For low-variance models that already perform well, bagging can result in a decrease in model performance.

The evidence, both experimental and theoretical, is that bagging can push a good but unstable procedure a significant step towards optimality. On the other hand, it can slightly degrade the performance of stable procedures.

Bagging predictors, 1996.

### Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Bagging Scikit-Learn API

Bagging ensembles can be implemented from scratch, although this can be challenging for beginners.

For an example, see the tutorial:

The scikit-learn Python machine learning library provides an implementation of Bagging ensembles for machine learning.

It is available in modern versions of the library.

First, confirm that you are using a modern version of the library by running the following script:

Running the script will print your version of scikit-learn.

Your version should be the same or higher. If not, you must upgrade your version of the scikit-learn library.

Bagging is provided via the BaggingRegressor and BaggingClassifier classes.

Both models operate the same way and take the same arguments that influence how the decision trees are created.

Randomness is used in the construction of the model. This means that each time the algorithm is run on the same data, it will produce a slightly different model.

When using machine learning algorithms that have a stochastic learning algorithm, it is good practice to evaluate them by averaging their performance across multiple runs or repeats of cross-validation. When fitting a final model, it may be desirable to either increase the number of trees until the variance of the model is reduced across repeated evaluations, or to fit multiple final models and average their predictions.

Let’s take a look at how to develop a Bagging ensemble for both classification and regression.

### Bagging for Classification

In this section, we will look at using Bagging for a classification problem.

First, we can use the make_classification() function to create a synthetic binary classification problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a Bagging algorithm on this dataset.

We will evaluate the model using repeated stratified k-fold cross-validation, with three repeats and 10 folds. We will report the mean and standard deviation of the accuracy of the model across all repeats and folds.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Bagging ensemble with default hyperparameters achieves a classification accuracy of about 85 percent on this test dataset.

We can also use the Bagging model as a final model and make predictions for classification.

First, the Bagging ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our binary classification dataset.

Running the example fits the Bagging ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using Bagging for classification, let’s look at the API for regression.

### Bagging for Regression

In this section, we will look at using Bagging for a regression problem.

First, we can use the make_regression() function to create a synthetic regression problem with 1,000 examples and 20 input features.

The complete example is listed below.

Running the example creates the dataset and summarizes the shape of the input and output components.

Next, we can evaluate a Bagging algorithm on this dataset.

As we did with the last section, we will evaluate the model using repeated k-fold cross-validation, with three repeats and 10 folds. We will report the mean absolute error (MAE) of the model across all repeats and folds. The scikit-learn library makes the MAE negative so that it is maximized instead of minimized. This means that larger negative MAE are better and a perfect model has a MAE of 0.

The complete example is listed below.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the Bagging ensemble with default hyperparameters achieves a MAE of about 100.

We can also use the Bagging model as a final model and make predictions for regression.

First, the Bagging ensemble is fit on all available data, then the predict() function can be called to make predictions on new data.

The example below demonstrates this on our regression dataset.

Running the example fits the Bagging ensemble model on the entire dataset and is then used to make a prediction on a new row of data, as we might when using the model in an application.

Now that we are familiar with using the scikit-learn API to evaluate and use Bagging ensembles, let’s look at configuring the model.

## Bagging Hyperparameters

In this section, we will take a closer look at some of the hyperparameters you should consider tuning for the Bagging ensemble and their effect on model performance.

### Explore Number of Trees

An important hyperparameter for the Bagging algorithm is the number of decision trees used in the ensemble.

Typically, the number of trees is increased until the model performance stabilizes. Intuition might suggest that more trees will lead to overfitting, although this is not the case. Bagging and related ensemble of decision trees algorithms (like random forest) appear to be somewhat immune to overfitting the training dataset given the stochastic nature of the learning algorithm.

The number of trees can be set via the “n_estimators” argument and defaults to 100.

The example below explores the effect of the number of trees with values between 10 to 5,000.

Running the example first reports the mean accuracy for each configured number of decision trees.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that that performance improves on this dataset until about 100 trees and remains flat after that.

A box and whisker plot is created for the distribution of accuracy scores for each configured number of trees.

We can see the general trend of no further improvement beyond about 100 trees.

Box Plot of Bagging Ensemble Size vs. Classification Accuracy

### Explore Number of Samples

The size of the bootstrap sample can also be varied.

The default is to create a bootstrap sample that has the same number of examples as the original dataset. Using a smaller dataset can increase the variance of the resulting decision trees and could result in better overall performance.

The number of samples used to fit each decision tree is set via the “max_samples” argument.

The example below explores different sized samples as a ratio of the original dataset from 10 percent to 100 percent (the default).

Running the example first reports the mean accuracy for each sample set size.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest that performance generally improves with an increase in the sample size, highlighting that the default of 100 percent the size of the training dataset is sensible.

It might also be interesting to explore a smaller sample size with a corresponding increase in the number of trees in an effort to reduce the variance of the individual models.

A box and whisker plot is created for the distribution of accuracy scores for each sample size.

We see a general trend of increasing accuracy with sample size.

Box Plot of Bagging Sample Size vs. Classification Accuracy

### Explore Alternate Algorithm

Decision trees are the most common algorithm used in a bagging ensemble.

The reason for this is that they are easy to configure to have a high variance and because they perform well in general.

Other algorithms can be used with bagging and must be configured to have a modestly high variance. One example is the k-nearest neighbors algorithm where the k value can be set to a low value.

The algorithm used in the ensemble is specified via the “base_estimator” argument and must be set to an instance of the algorithm and algorithm configuration to use.

The example below demonstrates using a KNeighborsClassifier as the base algorithm used in the bagging ensemble. Here, the algorithm is used with default hyperparameters where k is set to 5.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Bagging ensemble with KNN and default hyperparameters achieves a classification accuracy of about 88 percent on this test dataset.

We can test different values of k to find the right balance of model variance to achieve good performance as a bagged ensemble.

The below example tests bagged KNN models with k values between 1 and 20.

Running the example first reports the mean accuracy for each k value.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the results suggest a small k value such as two to four results in the best mean accuracy when used in a bagging ensemble.

A box and whisker plot is created for the distribution of accuracy scores for each k value.

We see a general trend of increasing accuracy with sample size in the beginning, then a modest decrease in performance as the variance of the individual KNN models used in the ensemble is increased with larger k values.

Box Plot of Bagging KNN Number of Neighbors vs. Classification Accuracy

## Bagging Extensions

There are many modifications and extensions to the bagging algorithm in an effort to improve the performance of the approach.

Perhaps the most famous is the random forest algorithm.

There is a number of less famous, although still effective, extensions to bagging that may be interesting to investigate.

This section demonstrates some of these approaches, such as pasting ensemble, random subspace ensemble, and the random patches ensemble.

We are not racing these extensions on the dataset, but rather providing working examples of how to use each technique that you can copy-paste and try with your own dataset.

### Pasting Ensemble

The Pasting Ensemble is an extension to bagging that involves fitting ensemble members based on random samples of the training dataset instead of bootstrap samples.

The approach is designed to use smaller sample sizes than the training dataset in cases where the training dataset does not fit into memory.

The procedure takes small pieces of the data, grows a predictor on each small piece and then pastes these predictors together. A version is given that scales up to terabyte data sets. The methods are also applicable to on-line learning.

The example below demonstrates the Pasting ensemble by setting the “bootstrap” argument to “False” and setting the number of samples used in the training dataset via “max_samples” to a modest value, in this case, 50 percent of the training dataset size.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Pasting ensemble achieves a classification accuracy of about 84 percent on this dataset.

### Random Subspaces Ensemble

A Random Subspace Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from random subsets of the features in the training dataset.

It is similar to the random forest except the data samples are random rather than a bootstrap sample and the subset of features is selected for the entire decision tree rather than at each split point in the tree.

The classifier consists of multiple trees constructed systematically by pseudorandomly selecting subsets of components of the feature vector, that is, trees constructed in randomly chosen subspaces.

The example below demonstrates the Random Subspace ensemble by setting the “bootstrap” argument to “False” and setting the number of features used in the training dataset via “max_features” to a modest value, in this case, 10.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Random Subspace ensemble achieves a classification accuracy of about 86 percent on this dataset.

We would expect that there would be a number of features in the random subspace that provides the right balance of model variance and model skill.

The example below demonstrates the effect of using different numbers of features in the random subspace ensemble from 1 to 20.

Running the example first reports the mean accuracy for each number of features.

In this case, the results suggest that using about half the number of features in the dataset (e.g. between 9 and 13) might give the best results for the random subspace ensemble on this dataset.

A box and whisker plot is created for the distribution of accuracy scores for each random subspace size.

We see a general trend of increasing accuracy with the number of features to about 10 to 13 where it is approximately level, then a modest decreasing trend in performance after that.

Box Plot of Random Subspace Ensemble Number of Features vs. Classification Accuracy

### Random Patches Ensemble

The Random Patches Ensemble is an extension to bagging that involves fitting ensemble members based on datasets constructed from random subsets of rows (samples) and columns (features) of the training dataset.

It does not use bootstrap samples and might be considered an ensemble that combines both the random sampling of the dataset of the Pasting ensemble and the random sampling of features of the Random Subspace ensemble.

We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset.

Ensembles on Random Patches, 2012.

The example below demonstrates the Random Patches ensemble with decision trees created from a random sample of the training dataset limited to 50 percent of the size of the training dataset, and with a random subset of 10 features.

Running the example reports the mean and standard deviation accuracy of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the Random Patches ensemble achieves a classification accuracy of about 84 percent on this dataset.

## Common Questions

In this section we will take a closer look at some common sticking points you may have with the bagging ensemble procedure.

Q. What algorithm should be used in the ensemble?

The algorithm should have a moderate variance, meaning it is moderately dependent upon the specific training data.

The decision tree is the default model to use because it works well in practice. Other algorithms can be used as long as they are configured to have a moderate variance.

The chosen algorithm should be moderately stable, not unstable like a decision stump and not very stable like a pruned decision tree, typically an unpruned decision tree is used.

… it is well known that Bagging should be used with unstable learners, and generally, the more unstable, the larger the performance improvement.

— Page 52, Ensemble Methods, 2012.

Q. How many ensemble members should be used?

The performance of the model will converge with the increase of the number of decision trees to a point then remain level.

… the performance of Bagging converges as the ensemble size, i.e., the number of base learners, grows large …

— Page 52, Ensemble Methods, 2012.

Therefore, keep increasing the number of trees until the performance stabilizes on your dataset.

Q. Won’t the ensemble overfit with too many trees?

No. Bagging ensembles (do not) are very unlikely to overfit in general.

Q. How large should the bootstrap sample be?

It is good practice to make the bootstrap sample as large as the original dataset size.

That is 100% the size or an equal number of rows as the original dataset.

Q. What problems are well suited to bagging?

Generally, bagging is well suited to problems with small or modest sized datasets. But this is a rough guide.

Bagging is best suited for problems with relatively small available training datasets.

— Page 12, Ensemble Machine Learning, 2012.

Try it and see.

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered how to develop Bagging ensembles for classification and regression.

Specifically, you learned:

• Bagging ensemble is an ensemble created from decision trees fit on different samples of a dataset.
• How to use the Bagging ensemble for classification and regression with scikit-learn.
• How to explore the effect of Bagging model hyperparameters on model performance.

Do you have any questions?

## Get a Handle on Modern Ensemble Learning!

#### Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

### 16 Responses to How to Develop a Bagging Ensemble with Python

1. Christina May 28, 2020 at 7:37 pm #

Thank you for the tutorials!!They are all valuable:))
Can I ask you, shouldn’t we take the fit method only in the train set? Why do you take it to all the dataset(X, y) ??

• Jason Brownlee May 29, 2020 at 6:29 am #

You’re welcome.

Yes, models are only fit on the training dataset.

• Gustavo June 20, 2020 at 4:43 am #

To clarify my understanding, you work with the train/test from the available data during the “ML evaluation phase.” Still, when considering to deploy the best algorithm (found in previous “phase”), you create a model trained with all available data and evaluate new/unseen examples a posteriori, right?

• Jason Brownlee June 20, 2020 at 6:18 am #

Not quite.

We use something like cross-validation to estimate the performance of each model/pipeline and choose one. Now we know how well it will perform on average on new data.

We then fit this model on all data and start using it. No need to evaluate it again.

2. Asha October 25, 2020 at 8:28 pm #

Breiman in his article written about two kinds of pasting: Rvote and Ivote.
I assume that you implemented Ivote Pasting with bootstrap=False and ‘max_samples=0.5’
But how I can implement Rvote Pasting?

• Jason Brownlee October 26, 2020 at 6:48 am #

3. Ron February 21, 2021 at 11:47 pm #

Hi Jason,

One can also use scores.mean() and scores.std() for calculating average and standard deviation of the scores without invoking numpy mean and std functions.

Ron

• Jason Brownlee February 22, 2021 at 5:01 am #

Yes, I typically for get that. Thank you for the reminder!

4. Mehdi April 26, 2021 at 2:11 pm #

Thanks for the tutorial.
I tried to apply hyper parameter tuning for number of samples for regression problem, but I ran into this error:”Supported target types are: (‘binary’, ‘multiclass’). Got ‘continuous’ instead.”.

I don’t know why, I’m using randomforestregressor function, and when the code wants to evaluate model and take the models from the defined dic, this error comes up.
Would appreciate it if you could give me a heads up.

• Jason Brownlee April 27, 2021 at 5:13 am #

Perhaps the above model does not support regression. I expect that this is the case.

5. Sheryl August 29, 2021 at 2:29 pm #

Hi why is n_informative=15 and n_redundant=5? 🙂

• Adrian Tam September 1, 2021 at 7:04 am #

These parameters are just an example. You may need to change it for your specific problem to solve.

6. lika August 29, 2021 at 9:07 pm #

can we use random forest as parameter for baggin classifier ?

• Adrian Tam September 1, 2021 at 7:19 am #

Yes, add “base_estimator=RandomForestClassifier()” to the BaggingClassier() call will make it. But this will be very slow as you are now having a lot of decision trees to fit.

7. Jane Delaney November 15, 2021 at 11:34 pm #

Don’t you need to scale your data to use KNN?

• Adrian Tam November 16, 2021 at 2:31 am #

Yes, that would be better especially if each feature is in a different order of magnitude.