Bagging and Random Forest for Imbalanced Classification

Bagging is an ensemble algorithm that fits multiple models on different subsets of a training dataset, then combines the predictions from all models.

Random forest is an extension of bagging that also randomly selects subsets of features used in each data sample. Both bagging and random forests have proven effective on a wide range of different predictive modeling problems.

Although effective, they are not suited to classification problems with a skewed class distribution. Nevertheless, many modifications to the algorithms have been proposed that adapt their behavior and make them better suited to a severe class imbalance.

In this tutorial, you will discover how to use bagging and random forest for imbalanced classification.

After completing this tutorial, you will know:

  • How to use Bagging with random undersampling for imbalanced classification.
  • How to use Random Forest with class weighting and random undersampling for imbalanced classification.
  • How to use the Easy Ensemble that combines bagging and boosting for imbalanced classification.

Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Jan/2021: Updated links for API documentation.
Bagging and Random Forest for Imbalanced Classification

Bagging and Random Forest for Imbalanced Classification
Photo by Don Graham, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Bagging for Imbalanced Classification
    1. Standard Bagging
    2. Bagging With Random Undersampling
  2. Random Forest for Imbalanced Classification
    1. Standard Random Forest
    2. Random Forest With Class Weighting
    3. Random Forest With Bootstrap Class Weighting
    4. Random Forest With Random Undersampling
  3. Easy Ensemble for Imbalanced Classification
    1. Easy Ensemble

Bagging for Imbalanced Classification

Bootstrap Aggregation, or Bagging for short, is an ensemble machine learning algorithm.

It involves first selecting random samples of a training dataset with replacement, meaning that a given sample may contain zero, one, or more than one copy of examples in the training dataset. This is called a bootstrap sample. One weak learner model is then fit on each data sample. Typically, decision tree models that do not use pruning (e.g. may overfit their training set slightly) are used as weak learners. Finally, the predictions from all of the fit weak learners are combined to make a single prediction (e.g. aggregated).

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the bagged model’s prediction.

— Page 192, Applied Predictive Modeling, 2013.

The process of creating new bootstrap samples and fitting and adding trees to the sample can continue until no further improvement is seen in the ensemble’s performance on a validation dataset.

This simple procedure often results in better performance than a single well-configured decision tree algorithm.

Bagging as-is will create bootstrap samples that will not consider the skewed class distribution for imbalanced classification datasets. As such, although the technique performs well in general, it may not perform well if a severe class imbalance is present.

Standard Bagging

Before we dive into exploring extensions to bagging, let’s evaluate a standard bagged decision tree ensemble without and use it as a point of comparison.

We can use the BaggingClassifier scikit-sklearn class to create a bagged decision tree model with roughly the same configuration.

First, let’s define a synthetic imbalanced binary classification problem with 10,000 examples, 99 percent of which are in the majority class and 1 percent are in the minority class.

We can then define the standard bagged decision tree ensemble model ready for evaluation.

We can then evaluate this model using repeated stratified k-fold cross-validation, with three repeats and 10 folds.

We will use the mean ROC AUC score across all folds and repeats to evaluate the performance of the model.

Tying this together, the complete example of evaluating a standard bagged ensemble on the imbalanced classification dataset is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieves a score of about 0.87.

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Bagging With Random Undersampling

There are many ways to adapt bagging for use with imbalanced classification.

Perhaps the most straightforward approach is to apply data resampling on the bootstrap sample prior to fitting the weak learner model. This might involve oversampling the minority class or undersampling the majority class.

An easy way to overcome class imbalance problem when facing the resampling stage in bagging is to take the classes of the instances into account when they are randomly drawn from the original dataset.

— Page 175, Learning from Imbalanced Data Sets, 2018.

Oversampling the minority class in the bootstrap is referred to as OverBagging; likewise, undersampling the majority class in the bootstrap is referred to as UnderBagging, and combining both approaches is referred to as OverUnderBagging.

The imbalanced-learn library provides an implementation of UnderBagging.

Specifically, it provides a version of bagging that uses a random undersampling strategy on the majority class within a bootstrap sample in order to balance the two classes. This is provided in the BalancedBaggingClassifier class.

Next, we can evaluate a modified version of the bagged decision tree ensemble that performs random undersampling of the majority class prior to fitting each decision tree.

We would expect that the use of random undersampling would improve the performance of the ensemble.

The default number of trees (n_estimators) for this model and the previous is 10. In practice, it is a good idea to test larger values for this hyperparameter, such as 100 or 1,000.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a lift on mean ROC AUC from about 0.87 without any data resampling, to about 0.96 with random undersampling of the majority class.

This is not a true apples-to-apples comparison as we are using the same algorithm implementation from two different libraries, but it makes the general point that balancing the bootstrap prior to fitting a weak learner offers some benefit when the class distribution is skewed.

Although the BalancedBaggingClassifier class uses a decision tree, you can test different models, such as k-nearest neighbors and more. You can set the base_estimator argument when defining the class to use a different weaker learner classifier model.

Random Forest for Imbalanced Classification

Random forest is another ensemble of decision tree models and may be considered an improvement upon bagging.

Like bagging, random forest involves selecting bootstrap samples from the training dataset and fitting a decision tree on each. The main difference is that all features (variables or columns) are not used; instead, a small, randomly selected subset of features (columns) is chosen for each bootstrap sample. This has the effect of de-correlating the decision trees (making them more independent), and in turn, improving the ensemble prediction.

Each model in the ensemble is then used to generate a prediction for a new sample and these m predictions are averaged to give the forest’s prediction. Since the algorithm randomly selects predictors at each split, tree correlation will necessarily be lessened.

— Page 199, Applied Predictive Modeling, 2013.

Again, random forest is very effective on a wide range of problems, but like bagging, performance of the standard algorithm is not great on imbalanced classification problems.

In learning extremely imbalanced data, there is a significant probability that a bootstrap sample contains few or even none of the minority class, resulting in a tree with poor performance for predicting the minority class.

Using Random Forest to Learn Imbalanced Data, 2004.

Standard Random Forest

Before we dive into extensions of the random forest ensemble algorithm to make it better suited for imbalanced classification, let’s fit and evaluate a random forest algorithm on our synthetic dataset.

We can use the RandomForestClassifier class from scikit-learn and use a small number of trees, in this case, 10.

The complete example of fitting a standard random forest ensemble on the imbalanced dataset is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a mean ROC AUC of about 0.86.

Random Forest With Class Weighting

A simple technique for modifying a decision tree for imbalanced classification is to change the weight that each class has when calculating the “impurity” score of a chosen split point.

Impurity measures how mixed the groups of samples are for a given split in the training dataset and is typically measured with Gini or entropy. The calculation can be biased so that a mixture in favor of the minority class is favored, allowing some false positives for the majority class.

This modification of random forest is referred to as Weighted Random Forest.

Another approach to make random forest more suitable for learning from extremely imbalanced data follows the idea of cost sensitive learning. Since the RF classifier tends to be biased towards the majority class, we shall place a heavier penalty on misclassifying the minority class.

Using Random Forest to Learn Imbalanced Data, 2004.

This can be achieved by setting the class_weight argument on the RandomForestClassifier class.

This argument takes a dictionary with a mapping of each class value (e.g. 0 and 1) to the weighting. The argument value of ‘balanced‘ can be provided to automatically use the inverse weighting from the training dataset, giving focus to the minority class.

We can test this modification of random forest on our test problem. Although not specific to random forest, we would expect some modest improvement.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.86 to about 0.87.

Random Forest With Bootstrap Class Weighting

Given that each decision tree is constructed from a bootstrap sample (e.g. random selection with replacement), the class distribution in the data sample will be different for each tree.

As such, it might be interesting to change the class weighting based on the class distribution in each bootstrap sample, instead of the entire training dataset.

This can be achieved by setting the class_weight argument to the value ‘balanced_subsample‘.

We can test this modification and compare the results to the ‘balanced’ case above; the complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.87 to about 0.88.

Random Forest With Random Undersampling

Another useful modification to random forest is to perform data resampling on the bootstrap sample in order to explicitly change the class distribution.

The BalancedRandomForestClassifier class from the imbalanced-learn library implements this and performs random undersampling of the majority class in reach bootstrap sample. This is generally referred to as Balanced Random Forest.

We would expect this to have a more dramatic effect on model performance, given the broader success of data resampling techniques.

We can test this modification of random forest on our synthetic dataset and compare the results. The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a modest lift in mean ROC AUC from 0.89 to about 0.97.

Easy Ensemble for Imbalanced Classification

When considering bagged ensembles for imbalanced classification, a natural thought might be to use random resampling of the majority class to create multiple datasets with a balanced class distribution.

Specifically, a dataset can be created from all of the examples in the minority class and a randomly selected sample from the majority class. Then a model or weak learner can be fit on this dataset. The process can be repeated multiple times and the average prediction across the ensemble of models can be used to make predictions.

This is exactly the approach proposed by Xu-Ying Liu, et al. in their 2008 paper titled “Exploratory Undersampling for Class-Imbalance Learning.”

The selective construction of the subsamples is seen as a type of undersampling of the majority class. The generation of multiple subsamples allows the ensemble to overcome the downside of undersampling in which valuable information is discarded from the training process.

… under-sampling is an efficient strategy to deal with class-imbalance. However, the drawback of under-sampling is that it throws away many potentially useful data.

Exploratory Undersampling for Class-Imbalance Learning, 2008.

The authors propose variations on the approach, such as the Easy Ensemble and the Balance Cascade.

Let’s take a closer look at the Easy Ensemble.

Easy Ensemble

The Easy Ensemble involves creating balanced samples of the training dataset by selecting all examples from the minority class and a subset from the majority class.

Rather than using pruned decision trees, boosted decision trees are used on each subset, specifically the AdaBoost algorithm.

AdaBoost works by first fitting a decision tree on the dataset, then determining the errors made by the tree and weighing the examples in the dataset by those errors so that more attention is paid to the misclassified examples and less to the correctly classified examples. A subsequent tree is then fit on the weighted dataset intended to correct the errors. The process is then repeated for a given number of decision trees.

This means that samples that are difficult to classify receive increasingly larger weights until the algorithm identifies a model that correctly classifies these samples. Therefore, each iteration of the algorithm is required to learn a different aspect of the data, focusing on regions that contain difficult-to-classify samples.

— Page 389, Applied Predictive Modeling, 2013.

The EasyEnsembleClassifier class from the imbalanced-learn library provides an implementation of the easy ensemble technique.

We can evaluate the technique on our synthetic imbalanced classification problem.

Given the use of a type of random undersampling, we would expect the technique to perform well in general.

The complete example is listed below.

Running the example evaluates the model and reports the mean ROC AUC score.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the ensemble performs well on the dataset, achieving a mean ROC AUC of about 0.96, close to that achieved on this dataset with random forest with random undersampling (0.97).

Although an AdaBoost classifier is used on each subsample, alternate classifier models can be used via setting the base_estimator argument to the model.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Summary

In this tutorial, you discovered how to use bagging and random forest for imbalanced classification.

Specifically, you learned:

  • How to use Bagging with random undersampling for imbalance classification.
  • How to use Random Forest with class weighting and random undersampling for imbalanced classification.
  • How to use the Easy Ensemble that combines bagging and boosting for imbalanced classification.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

36 Responses to Bagging and Random Forest for Imbalanced Classification

  1. Avatar
    marco February 14, 2020 at 3:13 am #

    Hello Jason,
    I’ve found over the web the Pandas Profiling Tool.
    It seems helpful to perform analysis (it creates also a HTML file with graphs).
    Do you think is enough to perform a data analysis phase in order to start then data preparation and then modeling?
    Thanks

  2. Avatar
    Frank February 15, 2020 at 8:31 pm #

    Hello Jason,
    the sample generated dataset has a normal distribution, yes?
    What if the dataset has a skewed distribution, or is totally irregular?

    • Avatar
      Jason Brownlee February 16, 2020 at 6:06 am #

      Distribution does not matter much for ensembles of decision trees.

      • Avatar
        sst March 11, 2020 at 4:11 pm #

        n which approach do the classification models train on data sets whose distribution are modified in comparison to the distribution of the original training data set
        bagging
        boosting
        both
        neither

        • Avatar
          Jason Brownlee March 12, 2020 at 8:38 am #

          Sorry, I don’t understand your question. Perhaps you can rephrase it or elaborate?

  3. Avatar
    Carlos February 16, 2020 at 5:31 am #

    Hello Jason,

    As someone asked when you described the case for xgBoost, is it possible to apply thiis method for multi-class problems?.

    And another question, what is the effect of this method in terms of the model calibration?. It changes the distributions of the output probabilities, right?.

    Thanks,
    Carlos.
    P.S: Regarding the previous question this kind of “profiling tool” is a new feature in pandas that creates a more detailed ouput html. It is like a pandas.descriibe with steroids. :-).

    • Avatar
      Jason Brownlee February 16, 2020 at 6:16 am #

      I believe so. Try them and see.

      Yes, if you want probabilities you might want to explore calibration.

  4. Avatar
    Igor Franzoni April 9, 2020 at 10:19 pm #

    Hi, Jason!

    First of all thanks for your post!

    In this problem you decided to use the repeated stratified k-fold cross-validation. Is it more convenient than just a plain stratified k-fold cross-validation for an imbalanced problem?

    Best!

    • Avatar
      Jason Brownlee April 10, 2020 at 8:29 am #

      The repetitions can give a less biased estimate of performance in most cases.

  5. Avatar
    Igor Franzoni April 9, 2020 at 11:27 pm #

    Hi, Jason,

    Another question that comes up to my mind is if ROC-AUC is the appropriate measure for this problem. I see that it is increasing, but it would be interesting to check the Precision-Recall curve also, right? I am just facing a problem where ROC-AUC is high (around 0.9), but Precision-Recall area is very low (0.005)…

    Thanks!

  6. Avatar
    Steven Larsson May 19, 2020 at 7:54 am #

    By any chance, do you have a guide on how to write the easy ensemble from scratch?
    I want to combine sampling algorithms with XGB and then bundle it as an ensemble to have an “advanced” easy ensemble but i don’t know how i could do that.

    • Avatar
      Jason Brownlee May 19, 2020 at 1:23 pm #

      Sorry I don’t.

      I don’t expect it to be too challenging to implement. Let me know how you go!

  7. Avatar
    suyash July 23, 2020 at 6:59 pm #

    from imblearn.ensemble import BalanceCascade

    Error
    cannot import name ‘BalanceCascade’ from ‘imblearn.ensemble’

    I am using python 3.8.3. I have also installed the imblearn but unable to import Balance Cascade.
    Please help me on this issue.

  8. Avatar
    suyash July 24, 2020 at 9:47 pm #

    What is the difference between Balance Cascade and Balanced Bagging Classifier?

    • Avatar
      Jason Brownlee July 25, 2020 at 6:18 am #

      I don’t remember off hand, I recommend checking the literature.

  9. Avatar
    elham November 9, 2020 at 1:09 am #

    I do not understand at what stage of the BalancedBaggingClassifier, undersampling occurs? When a random sample is taken from the main data set, is the positive class and the negative class balanced, or is the main data set balanced from the beginning and then a random sample is taken?

    • Avatar
      Jason Brownlee November 9, 2020 at 6:13 am #

      Good question.

      Bagging draws samples from your dataset many times in order to create each tree in the ensemble.

      Balanced bagging ensures that each sample that is drawn used to train a tree is balanced.

  10. Avatar
    Carlos G November 22, 2020 at 12:25 am #

    Hi Jason,

    Thanks for the great post! What explains the significant performance difference between a Random Forest with Undersampling vs Random Forest with balanced class weighting? I would’ve expected class weighting to achieve a similar purpose than undersampling, but without losing information by leaving data points out of the training set.

    Do you have any resource suggestions for learning more about the difference between these two approaches?

    • Avatar
      Jason Brownlee November 22, 2020 at 6:55 am #

      You’re welcome.

      We can’t explain why one model does better than another for a given dataset. If we could – we would then be able to choose algorithms for datasets – which we cannot.

      The difference between the approaches is understood at the implementation level only – as described in the above tutorial.

      • Avatar
        Carlos G November 25, 2020 at 11:42 pm #

        Thanks Jason. Could you kindly elaborate on your point? I understand predicting actual performance on a particular problem is nearly impossible, but can we pick the algorithms most likely to work for a particular dataset, based on our understanding of how the algorithms work?

        • Avatar
          Jason Brownlee November 26, 2020 at 6:35 am #

          Knowing how an algorithm works (theory/implementation) does not help you configure it and does not help you choose when to use it. If it did, then academics would win every kaggle competition. They don’t. They rarely do well as they are stuck using their favorite method.

          Instead, the best process available is trial and error (experimentation) and discover what works best for a dataset. This is the field of applied machine learning.

          Perhaps this will help:
          https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

    • Avatar
      Tim Martin October 3, 2021 at 5:22 am #

      Though we don’t know why underbagging worked better than weighting in this particular case, there is a theoretical explanation for why this sort of thing works at all. It’s explained very simply here (https://www.svds.com/tbt-learning-imbalanced-classes/), in the section titled “Bayesian argument of Wallace et al.”

      Hope that’s helpful.

  11. Avatar
    nabila March 23, 2021 at 3:28 pm #

    hello jason,
    Thanks for the great post! i will ask to you, how to smote bagging SVM and smote boosting SVM in python?

  12. Avatar
    Nisha November 14, 2021 at 2:30 pm #

    When I tried the BalancedRandomForestClassifier from imblearn I am getting an error ‘AttributeError: can’t set attribute’ and this post is from Jan 2021, so wondering if anyone else has same issue and if so how can it possibly be resolved or any suggestions?

    • Avatar
      Adrian Tam November 14, 2021 at 3:07 pm #

      I’ve checked but I don’t see any error.

  13. Avatar
    Nisha November 15, 2021 at 1:48 am #

    Could you post the version of imblearn package being used in this code?

    • Avatar
      Adrian Tam November 15, 2021 at 2:56 am #

      imbalanced-learn 0.8.1

  14. Avatar
    Eva December 10, 2021 at 11:41 pm #

    Hi Jason,

    I don’t understand the difference between ‘resampling’ and ‘with replacement’. Do they mean the same thing? I am using MATLAB and the function ‘fitcensemble’ to create my RF model, which has the options ‘Replace’ and ‘Resample’ to specify as ‘on’ or ‘off’, so this implies that they are different things, but I don’t understand this difference.

    Thanks in advance!

    • Avatar
      Adrian Tam December 15, 2021 at 5:44 am #

      No. Give you a deck of 52 poker cards, pick 5 from it is sampling without replacement. Pick one from it and put it back and do this 5 times is sampling with replacement.

    • Avatar
      James Carmichael December 21, 2021 at 11:45 pm #

      Hi Eva…Thank you for your question! The following is an excellent source for understanding these terms and their applications.

      https://web.ma.utexas.edu/users/parker/sampling/repl.htm

      Let me know if you would like more information.

      Regards,

  15. Avatar
    Mike April 18, 2024 at 5:39 am #

    Jason, This dude here stole your content. Google the article “Random Forest for Learning Imbalanced Data” by Manish Prasad. He copied it 1:1

    • Avatar
      James Carmichael April 18, 2024 at 8:48 am #

      Thank you Mike!

Leave a Reply