Bagging and Random Forest for Imbalanced Classification

Last Updated on July 31, 2020

Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset, then combines the predictions from all models.

Random forest is an extension of bagging that also randomly selects subsets of features used in each data sample. Both bagging and random forests have proven effective on a wide range of different predictive modeling problems.

Although effective, they are not suited to classification problems with a skewed class distribution. Nevertheless, many modifications to the algorithms have been proposed that adapt their behavior and make them better suited to a severe class imbalance.

In this tutorial, you will discover how to use bagging and random forest for imbalanced classification.

After completing this tutorial, you will know:

  • How to use Bagging with random undersampling for imbalanced classification.
  • How to use Random Forest with class weighting and random undersampling for imbalanced classification.
  • How to use the Easy Ensemble that combines bagging and boosting for imbalanced classification.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

Bagging and Random Forest for Imbalanced Classification

Bagging and Random Forest for Imbalanced Classification
Photo by Don Graham, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Bagging for Imbalanced Classification
    1. Standard Bagging
    2. Bagging With Random Undersampling
  2. Random Forest for Imbalanced Classification
    1. Standard Random Forest
    2. Random Forest With Class Weighting
    3. Random Forest With Bootstrap Class Weighting
    4. Random Forest With Random Undersampling
  3. Easy Ensemble for Imbalanced Classification
    1. Easy Ensemble

Bagging for Imbalanced Classification

Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm.

It involves first selecting random samples of a training dataset with replacement, meaning that a given sample may contain zero, one, or more than one copy of examples in the training dataset. This is called a bootstrap sample. One weak learner model is then fit on each data sample. Typically, decision tree models that do not use pruning (e.g. may overfit their training set slightly) are used as weak learners. Finally, the predictions from all of the fit weak learners are combined to make a single prediction (e.g. aggregated).

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the bagged model’s prediction.

— Page 192, Applied Predictive Modeling, 2013.

The process of creating new bootstrap samples and fitting and adding trees to the sample can continue until no further improvement is seen in the ensemble’s performance on a validation dataset.

This simple procedure often results in better performance than a single well-configured decision tree algorithm.

Bagging as-is will create bootstrap samples that will not consider the skewed class distribution for imbalanced classification datasets. As such, although the technique performs well in general, it may not perform well if a severe class imbalance is present.

Standard Bagging

Before we dive into exploring extensions to bagging, let’s evaluate a standard bagged decision tree ensemble without and use it as a point of comparison.

We can use the BaggingClassifier scikit-sklearn class to create a bagged decision tree model with roughly the same configuration.

First, let’s define a synthetic imbalanced binary classification problem with 10,000 examples, 99 percent of which are in the majority class and 1 percent are in the minority class.

We can then define the standard bagged decision tree ensemble model ready for evaluation.

We can then evaluate this model using repeated stratified k-fold cross-validation, with three repeats and 10 folds.

We will use the mean ROC AUC score across all folds and repeats to evaluate the performance of the model.

Tying this together, the complete example of evaluating a standard bagged ensemble on the imbalanced classification dataset is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieves a score of about 0.87.

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Bagging With Random Undersampling

There are many ways to adapt bagging for use with imbalanced classification.

Perhaps the most straightforward approach is to apply data resampling on the bootstrap sample prior to fitting the weak learner model. This might involve oversampling the minority class or undersampling the majority class.

An easy way to overcome class imbalance problem when facing the resampling stage in bagging is to take the classes of the instances into account when they are randomly drawn from the original dataset.

— Page 175, Learning from Imbalanced Data Sets, 2018.

Oversampling the minority class in the bootstrap is referred to as OverBagging; likewise, undersampling the majority class in the bootstrap is referred to as UnderBagging, and combining both approaches is referred to as OverUnderBagging.

The imbalanced-learn library provides an implementation of UnderBagging.

Specifically, it provides a version of bagging that uses a random undersampling strategy on the majority class within a bootstrap sample in order to balance the two classes. This is provided in the BalancedBaggingClassifier class.

Next, we can evaluate a modified version of the bagged decision tree ensemble that performs random undersampling of the majority class prior to fitting each decision tree.

We would expect that the use of random undersampling would improve the performance of the ensemble.

The default number of trees (n_estimators) for this model and the previous is 10. In practice, it is a good idea to test larger values for this hyperparameter, such as 100 or 1,000.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see a lift on mean ROC AUC from about 0.87 without any data resampling, to about 0.96 with random undersampling of the majority class.

This is not a true apples-to-apples comparison as we are using the same algorithm implementation from two different libraries, but it makes the general point that balancing the bootstrap prior to fitting a weak learner offers some benefit when the class distribution is skewed.

Although the BalancedBaggingClassifier class uses a decision tree, you can test different models, such as k-nearest neighbors and more. You can set the base_estimator argument when defining the class to use a different weaker learner classifier model.

Random Forest for Imbalanced Classification

Random forest is another ensemble of decision tree models and may be considered an improvement upon bagging.

Like bagging, random forest involves selecting bootstrap samples from the training dataset and fitting a decision tree on each. The main difference is that all features (variables or columns) are not used; instead, a small, randomly selected subset of features (columns) is chosen for each bootstrap sample. This has the effect of de-correlating the decision trees (making them more independent), and in turn, improving the ensemble prediction.

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction. Since the algorithm randomly selects predictors at each split, tree correlation will necessarily be lessened.

— Page 199, Applied Predictive Modeling, 2013.

Again, random forest is very effective on a wide range of problems, but like bagging, performance of the standard algorithm is not great on imbalanced classification problems.

In learning extremely imbalanced data, there is a significant probability that a bootstrap sample contains few or even none of the minority class, resulting in a tree with poor performance for predicting the minority class.

Using Random Forest to Learn Imbalanced Data, 2004.

Standard Random Forest

Before we dive into extensions of the random forest ensemble algorithm to make it better suited for imbalanced classification, let’s fit and evaluate a random forest algorithm on our synthetic dataset.

We can use the RandomForestClassifier class from scikit-learn and use a small number of trees, in this case, 10.

The complete example of fitting a standard random forest ensemble on the imbalanced dataset is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a mean ROC AUC of about 0.86.

Random Forest With Class Weighting

A simple technique for modifying a decision tree for imbalanced classification is to change the weight that each class has when calculating the “impurity” score of a chosen split point.

Impurity measures how mixed the groups of samples are for a given split in the training dataset and is typically measured with Gini or entropy. The calculation can be biased so that a mixture in favor of the minority class is favored, allowing some false positives for the majority class.

This modification of random forest is referred to as Weighted Random Forest.

Another approach to make random forest more suitable for learning from extremely imbalanced data follows the idea of cost sensitive learning. Since the RF classifier tends to be biased towards the majority class, we shall place a heavier penalty on misclassifying the minority class.

Using Random Forest to Learn Imbalanced Data, 2004.

This can be achieved by setting the class_weight argument on the RandomForestClassifier class.

This argument takes a dictionary with a mapping of each class value (e.g. 0 and 1) to the weighting. The argument value of ‘balanced‘ can be provided to automatically use the inverse weighting from the training dataset, giving focus to the minority class.

We can test this modification of random forest on our test problem. Although not specific to random forest, we would expect some modest improvement.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.86 to about 0.87.

Random Forest With Bootstrap Class Weighting

Given that each decision tree is constructed from a bootstrap sample (e.g. random selection with replacement), the class distribution in the data sample will be different for each tree.

As such, it might be interesting to change the class weighting based on the class distribution in each bootstrap sample, instead of the entire training dataset.

This can be achieved by setting the class_weight argument to the value ‘balanced_subsample‘.

We can test this modification and compare the results to the ‘balanced’ case above; the complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.87 to about 0.88.

Random Forest With Random Undersampling

Another useful modification to random forest is to perform data resampling on the bootstrap sample in order to explicitly change the class distribution.

The BalancedRandomForestClassifier class from the imbalanced-learn library implements this and performs random undersampling of the majority class in reach bootstrap sample. This is generally referred to as Balanced Random Forest.

We would expect this to have a more dramatic effect on model performance, given the broader success of data resampling techniques.

We can test this modification of random forest on our synthetic dataset and compare the results. The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.89 to about 0.97.

Easy Ensemble for Imbalanced Classification

When considering bagged ensembles for imbalanced classification, a natural thought might be to use random resampling of the majority class to create multiple datasets with a balanced class distribution.

Specifically, a dataset can be created from all of the examples in the minority class and a randomly selected sample from the majority class. Then a model or weak learner can be fit on this dataset. The process can be repeated multiple times and the average prediction across the ensemble of models can be used to make predictions.

This is exactly the approach proposed by Xu-Ying Liu, et al. in their 2008 paper titled “Exploratory Undersampling for Class-Imbalance Learning.”

The selective construction of the subsamples is seen as a type of undersampling of the majority class. The generation of multiple subsamples allows the ensemble to overcome the downside of undersampling in which valuable information is discarded from the training process.

… under-sampling is an efficient strategy to deal with class-imbalance. However, the drawback of under-sampling is that it throws away many potentially useful data.

Exploratory Undersampling for Class-Imbalance Learning, 2008.

The authors propose two variations on the approach, called the Easy Ensemble and the Balance Cascade.

Let’s take a closer look at the Easy Ensemble.

Easy Ensemble

The Easy Ensemble involves creating balanced samples of the training dataset by selecting all examples from the minority class and a subset from the majority class.

Rather than using pruned decision trees, boosted decision trees are used on each subset, specifically the AdaBoost algorithm.

AdaBoost works by first fitting a decision tree on the dataset, then determining the errors made by the tree and weighing the examples in the dataset by those errors so that more attention is paid to the misclassified examples and less to the correctly classified examples. A subsequent tree is then fit on the weighted dataset intended to correct the errors. The process is then repeated for a given number of decision trees.

This means that samples that are difficult to classify receive increasingly larger weights until the algorithm identifies a model that correctly classifies these samples. Therefore, each iteration of the algorithm is required to learn a different aspect of the data, focusing on regions that contain difficult-to-classify samples.

— Page 389, Applied Predictive Modeling, 2013.

The EasyEnsembleClassifier class from the imbalanced-learn library provides an implementation of the easy ensemble technique.

We can evaluate the technique on our synthetic imbalanced classification problem.

Given the use of a type of random undersampling, we would expect the technique to perform well in general.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the ensemble performs well on the dataset, achieving a mean ROC AUC of about 0.96, close to that achieved on this dataset with random forest with random undersampling (0.97).

Although an AdaBoost classifier is used on each subsample, alternate classifier models can be used via setting the base_estimator argument to the model.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Summary

In this tutorial, you discovered how to use bagging and random forest for imbalanced classification.

Specifically, you learned:

  • How to use Bagging with random undersampling for imbalance classification.
  • How to use Random Forest with class weighting and random undersampling for imbalanced classification.
  • How to use the Easy Ensemble that combines bagging and boosting for imbalanced classification.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

18 Responses to Bagging and Random Forest for Imbalanced Classification

  1. marco February 14, 2020 at 3:13 am #

    Hello Jason,
    I’ve found over the web the Pandas Profiling Tool.
    It seems helpful to perform analysis (it creates also a HTML file with graphs).
    Do you think is enough to perform a data analysis phase in order to start then data preparation and then modeling?
    Thanks

  2. Frank February 15, 2020 at 8:31 pm #

    Hello Jason,
    the sample generated dataset has a normal distribution, yes?
    What if the dataset has a skewed distribution, or is totally irregular?

    • Jason Brownlee February 16, 2020 at 6:06 am #

      Distribution does not matter much for ensembles of decision trees.

      • sst March 11, 2020 at 4:11 pm #

        n which approach do the classification models train on data sets whose distribution are modified in comparison to the distribution of the original training data set
        bagging
        boosting
        both
        neither

        • Jason Brownlee March 12, 2020 at 8:38 am #

          Sorry, I don’t understand your question. Perhaps you can rephrase it or elaborate?

  3. Carlos February 16, 2020 at 5:31 am #

    Hello Jason,

    As someone asked when you described the case for xgBoost, is it possible to apply thiis method for multi-class problems?.

    And another question, what is the effect of this method in terms of the model calibration?. It changes the distributions of the output probabilities, right?.

    Thanks,
    Carlos.
    P.S: Regarding the previous question this kind of “profiling tool” is a new feature in pandas that creates a more detailed ouput html. It is like a pandas.descriibe with steroids. :-).

    • Jason Brownlee February 16, 2020 at 6:16 am #

      I believe so. Try them and see.

      Yes, if you want probabilities you might want to explore calibration.

  4. Igor Franzoni April 9, 2020 at 10:19 pm #

    Hi, Jason!

    First of all thanks for your post!

    In this problem you decided to use the repeated stratified k-fold cross-validation. Is it more convenient than just a plain stratified k-fold cross-validation for an imbalanced problem?

    Best!

    • Jason Brownlee April 10, 2020 at 8:29 am #

      The repetitions can give a less biased estimate of performance in most cases.

  5. Igor Franzoni April 9, 2020 at 11:27 pm #

    Hi, Jason,

    Another question that comes up to my mind is if ROC-AUC is the appropriate measure for this problem. I see that it is increasing, but it would be interesting to check the Precision-Recall curve also, right? I am just facing a problem where ROC-AUC is high (around 0.9), but Precision-Recall area is very low (0.005)…

    Thanks!

  6. Steven Larsson May 19, 2020 at 7:54 am #

    By any chance, do you have a guide on how to write the easy ensemble from scratch?
    I want to combine sampling algorithms with XGB and then bundle it as an ensemble to have an “advanced” easy ensemble but i don’t know how i could do that.

    • Jason Brownlee May 19, 2020 at 1:23 pm #

      Sorry I don’t.

      I don’t expect it to be too challenging to implement. Let me know how you go!

  7. suyash July 23, 2020 at 6:59 pm #

    from imblearn.ensemble import BalanceCascade

    Error
    cannot import name ‘BalanceCascade’ from ‘imblearn.ensemble’

    I am using python 3.8.3. I have also installed the imblearn but unable to import Balance Cascade.
    Please help me on this issue.

  8. suyash July 24, 2020 at 9:47 pm #

    What is the difference between Balance Cascade and Balanced Bagging Classifier?

    • Jason Brownlee July 25, 2020 at 6:18 am #

      I don’t remember off hand, I recommend checking the literature.

Leave a Reply