Random Oversampling and Undersampling for Imbalanced Classification

Last Updated on August 28, 2020

Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class.

This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely. This is a problem as it is typically the minority class on which predictions are most important.

One approach to addressing the problem of class imbalance is to randomly resample the training dataset. The two main approaches to randomly resampling an imbalanced dataset are to delete examples from the majority class, called undersampling, and to duplicate examples from the minority class, called oversampling.

In this tutorial, you will discover random oversampling and undersampling for imbalanced classification

After completing this tutorial, you will know:

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Random Oversampling and Undersampling for Imbalanced Classification

Random Oversampling and Undersampling for Imbalanced Classification
Photo by RichardBH, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. Random Resampling Imbalanced Datasets
  2. Imbalanced-Learn Library
  3. Random Oversampling Imbalanced Datasets
  4. Random Undersampling Imbalanced Datasets
  5. Combining Random Oversampling and Undersampling

Random Resampling Imbalanced Datasets

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.

This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem

A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

The simplest strategy is to choose examples for the transformed dataset randomly, called random resampling.

There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

  • Random Oversampling: Randomly duplicate examples in the minority class.
  • Random Undersampling: Randomly delete examples in the majority class.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

Both approaches can be repeated until the desired class distribution is achieved in the training dataset, such as an equal split across the classes.

They are referred to as “naive resampling” methods because they assume nothing about the data and no heuristics are used. This makes them simple to implement and fast to execute, which is desirable for very large and complex datasets.

Both techniques can be used for two-class (binary) classification problems and multi-class classification problems with one or more majority or minority classes.

Importantly, the change to the class distribution is only applied to the training dataset. The intent is to influence the fit of the models. The resampling is not applied to the test or holdout dataset used to evaluate the performance of a model.

Generally, these naive methods can be effective, although that depends on the specifics of the dataset and models involved.

Let’s take a closer look at each method and how to use them in practice.

Imbalanced-Learn Library

In these examples, we will use the implementations provided by the imbalanced-learn Python library, which can be installed via pip as follows:

You can confirm that the installation was successful by printing the version of the installed library:

Running the example will print the version number of the installed library; for example:

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Random Oversampling Imbalanced Datasets

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

Examples from the training dataset are selected randomly with replacement. This means that examples from the minority class can be chosen and added to the new “more balanced” training dataset multiple times; they are selected from the original training dataset, added to the new training dataset, and then returned or “replaced” in the original dataset, allowing them to be selected again.

This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model. This might include algorithms that iteratively learn coefficients, like artificial neural networks that use stochastic gradient descent. It can also affect models that seek good splits of the data, such as support vector machines and decision trees.

It might be useful to tune the target class distribution. In some cases, seeking a balanced distribution for a severely imbalanced dataset can cause affected algorithms to overfit the minority class, leading to increased generalization error. The effect can be better performance on the training dataset, but worse performance on the holdout or test dataset.

… the random oversampling may increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.

— Page 83, Learning from Imbalanced Data Sets, 2018.

As such, to gain insight into the impact of the method, it is a good idea to monitor the performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

The increase in the number of examples for the minority class, especially if the class skew was severe, can also result in a marked increase in the computational cost when fitting the model, especially considering the model is seeing the same examples in the training dataset again and again.

… in random over-sampling, a random set of copies of minority class examples is added to the data. This may increase the likelihood of overfitting, specially for higher over-sampling rates. Moreover, it may decrease the classifier performance and increase the computational effort.

A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

Random oversampling can be implemented using the RandomOverSampler class.

The class can be defined and takes a sampling_strategy argument that can be set to “minority” to automatically balance the minority class with majority class or classes.

For example:

This means that if the majority class had 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.

A floating point value can be specified to indicate the ratio of minority class majority examples in the transformed dataset. For example:

This would ensure that the minority class was oversampled to have half the number of examples as the majority class, for binary classification problems. This means that if the majority class had 1,000 examples and the minority class had 100, the transformed dataset would have 500 examples of the minority class.

The class is like a scikit-learn transform object in that it is fit on a dataset, then used to generate a new or transformed dataset. Unlike the scikit-learn transforms, it will change the number of examples in the dataset, not just the values (like a scaler) or number of features (like a projection).

For example, it can be fit and applied in one step by calling the fit_sample() function:

We can demonstrate this on a simple synthetic binary classification problem with a 1:100 class imbalance.

The complete example of defining the dataset and performing random oversampling to balance the class distribution is listed below.

Running the example first creates the dataset, then summarizes the class distribution. We can see that there are nearly 10K examples in the majority class and 100 examples in the minority class.

Then the random oversample transform is defined to balance the minority class, then fit and applied to the dataset. The class distribution for the transformed dataset is reported showing that now the minority class has the same number of examples as the majority class.

This transform can be used as part of a Pipeline to ensure that it is only applied to the training dataset as part of each split in a k-fold cross validation.

A traditional scikit-learn Pipeline cannot be used; instead, a Pipeline from the imbalanced-learn library can be used. For example:

The example below provides a complete example of evaluating a decision tree on an imbalanced dataset with a 1:100 class distribution.

The model is evaluated using repeated 10-fold cross-validation with three repeats, and the oversampling is performed on the training dataset within each fold separately, ensuring that there is no data leakage as might occur if the oversampling was performed prior to the cross-validation.

Running the example evaluates the decision tree model on the imbalanced dataset with oversampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm, rather than optimally solve the synthetic dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The default oversampling strategy is used, which balances the minority classes with the majority class. The F1 score averaged across each fold and each repeat is reported.

Now that we are familiar with oversampling, let’s take a look at undersampling.

Random Undersampling Imbalanced Datasets

Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset.

This has the effect of reducing the number of examples in the majority class in the transformed version of the training dataset. This process can be repeated until the desired class distribution is achieved, such as an equal number of examples for each class.

This approach may be more suitable for those datasets where there is a class imbalance although a sufficient number of examples in the minority class, such a useful model can be fit.

A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary. Given that examples are deleted randomly, there is no way to detect or preserve “good” or more information-rich examples from the majority class.

… in random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between minority and majority instances harder to learn, resulting in a loss in classification performance.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The random undersampling technique can be implemented using the RandomUnderSampler imbalanced-learn class.

The class can be used just like the RandomOverSampler class in the previous section, except the strategies impact the majority class instead of the minority class. For example, setting the sampling_strategy argument to “majority” will undersample the majority class determined by the class with the largest number of examples.

For example, a dataset with 1,000 examples in the majority class and 100 examples in the minority class will be undersampled such that both classes would have 100 examples in the transformed training dataset.

We can also set the sampling_strategy argument to a floating point value which will be a percentage relative to the minority class, specifically the number of examples in the minority class divided by the number of examples in the majority class. For example, if we set sampling_strategy to 0.5 in an imbalanced data dataset with 1,000 examples in the majority class and 100 examples in the minority class, then there would be 200 examples for the majority class in the transformed dataset (or 100/200 = 0.5).

This might be preferred to ensure that the resulting dataset is both large enough to fit a reasonable model, and that not too much useful information from the majority class is discarded.

In random under-sampling, one might attempt to create a balanced class distribution by selecting 90 majority class instances at random to be removed. The resulting dataset will then consist of 20 instances: 10 (randomly remaining) majority class instances and (the original) 10 minority class instances.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The transform can then be fit and applied to a dataset in one step by calling the fit_resample() function and passing the untransformed dataset as arguments.

We can demonstrate this on a dataset with a 1:100 class imbalance.

The complete example is listed below.

Running the example first creates the dataset and reports the imbalanced class distribution.

The transform is fit and applied on the dataset and the new class distribution is reported. We can see that that majority class is undersampled to have the same number of examples as the minority class.

Judgment and empirical results will have to be used as to whether a training dataset with just 200 examples would be sufficient to train a model.

This undersampling transform can also be used in a Pipeline, like the oversampling transform from the previous section.

This allows the transform to be applied to the training dataset only using evaluation schemes such as k-fold cross-validation, avoiding any data leakage in the evaluation of a model.

We can define an example of fitting a decision tree on an imbalanced classification dataset with the undersampling transform applied to the training dataset on each split of a repeated 10-fold cross-validation.

The complete example is listed below.

Running the example evaluates the decision tree model on the imbalanced dataset with undersampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The default undersampling strategy is used, which balances the majority classes with the minority class. The F1 score averaged across each fold and each repeat is reported.

Combining Random Oversampling and Undersampling

Interesting results may be achieved by combining both random oversampling and undersampling.

For example, a modest amount of oversampling can be applied to the minority class to improve the bias towards these examples, whilst also applying a modest amount of undersampling to the majority class to reduce the bias on that class.

This can result in improved overall performance compared to performing one or the other techniques in isolation.

For example, if we had a dataset with a 1:100 class distribution, we might first apply oversampling to increase the ratio to 1:10 by duplicating examples from the minority class, then apply undersampling to further improve the ratio to 1:2 by deleting examples from the majority class.

This could be implemented using imbalanced-learn by using a RandomOverSampler with sampling_strategy set to 0.1 (10%), then using a RandomUnderSampler with a sampling_strategy set to 0.5 (50%). For example:

We can demonstrate this on a synthetic dataset with a 1:100 class distribution. The complete example is listed below:

Running the example first creates the synthetic dataset and summarizes the class distribution, showing an approximate 1:100 class distribution.

Then oversampling is applied, increasing the distribution from about 1:100 to about 1:10. Finally, undersampling is applied, further improving the class distribution from 1:10 to about 1:2

We might also want to apply this same hybrid approach when evaluating a model using k-fold cross-validation.

This can be achieved by using a Pipeline with a sequence of transforms and ending with the model that is being evaluated; for example:

We can demonstrate this with a decision tree model on the same synthetic dataset.

The complete example is listed below.

Running the example evaluates a decision tree model using repeated k-fold cross-validation where the training dataset is transformed, first using oversampling, then undersampling, for each split and repeat performed. The F1 score averaged across each fold and each repeat is reported.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

API

Articles

Summary

In this tutorial, you discovered random oversampling and undersampling for imbalanced classification

Specifically, you learned:

  • Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
  • Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
  • Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

34 Responses to Random Oversampling and Undersampling for Imbalanced Classification

  1. Pierrick Pochelu January 15, 2020 at 11:10 am #

    Hi, how oversampling perform versus weighted loss for increase train on rare class. Experimental test have been done ?

    • Jason Brownlee January 15, 2020 at 1:40 pm #

      It differs from problem to problem.

      The best you can do is use controlled experiments on your dataset to discover what works best.

  2. marco January 16, 2020 at 4:33 am #

    Hello Jason,
    i’m trying to understand the following example.
    I’m confused about the first piece of code. It seems to me that cv = 5 is in both examples. The result is the same. So are they the same? What is the difference (to me there is no difference).

    scores = cross_val_score(xgbr, xtrain,ytrain,cv=5)
    print(“Mean cross-validation score: %.2f” % scores.mean())
    Mean cross-validataion score: 0.87

    #Cross-validation with a k-fold method can be checked as a following.
    kfold = KFold(n_splits=5, shuffle=True)
    kf_cv_scores = cross_val_score(xgbr, xtrain, ytrain, cv=kfold )
    print(“K-fold CV average score: %.2f” % kf_cv_scores.mean())
    K-fold CV average score: 0.87

    Thanks,
    Marco

  3. Markus January 20, 2020 at 3:10 am #

    Hi

    While trying out the example of evaluating a model with random oversampling and undersampling, I changed the order of oversampling and undersampling as following:

    steps = [(‘u’, under), (‘o’, over), (‘m’, DecisionTreeClassifier())]

    And then the F1 score became nan with the following warning:

    FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
    ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio.

    Do you know why?

    Thanks

    • Jason Brownlee January 20, 2020 at 8:43 am #

      Not off hand, perhaps experiment/investigate to discover the answer?

    • Abhilash August 25, 2020 at 5:03 am #

      I got the same message early on; typically this is because arrival at the specified ratio would require undersampling and not oversampling. Please check a) more instances than expected in minority class, to begin with, b) you have already done oversampling on data using a different technique and the specified step comes after that.

    • Sunix August 27, 2020 at 7:36 pm #

      After you exchanged the order and got nan, try this pipeline.fit(X,y), the you got error info, it seems you need to change the ratio.Here’s the info I got:

      ValueError: The specified ratio required to remove samples from the minority class while trying to generate new samples. Please increase the ratio.

  4. San February 7, 2020 at 5:23 am #

    Hi

    I applied random over sampling on my dataset because it has an imbalanced distribution, with sampling_strategy=’minority’. In particular I’m dealing with a multi-class classification problem & has 22 classes. After training a random forest model I’m making making predictions using it.

    According to the classification report, the model’s accuracy is very low (0.44) & always for 2 classes the precision and recall are high and for 5 classes they are medium and for the rest of the classes both the precision & recall are 0.0.

    Can I know what I’m doing wrong here?

    So far I have basically,
    1. filled the missing values
    2. applied one-hot encoding
    3. performed train_test_split
    4. applied random oversampling on the training data
    5. trained models
    6. Selected a model
    7. made predictions

    Thanks
    San

  5. San February 10, 2020 at 3:45 am #

    Hi

    The precision and recall of classes have increased a bit now. My dataset has 6 minority classes & those classes have either 1 or 2 instances. After I applied random over sampling I noticed that the instance of 1 minority class, has increased from 1 to 56, while all the other minority classes remain as it is. The precision & recall for the minority classes still remain 0.0.

    Also I’m unable to apply SMOTE or any other technique related with SMOTE, since I’m getting this error ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6. As there are classes with less than 6 instances in my dataset.

    I want to improve the overall performance of the model, including for those minority classes. Can you suggest me an approach to solve this problem?

    Thanks
    San

    • Jason Brownlee February 10, 2020 at 6:36 am #

      Nice work.

      I wonder if you can drop the class that only has 6 examples? It does not sound like enough data.

  6. Tapan Jain April 26, 2020 at 4:33 am #

    Hi Jason,

    I understand that random oversampling leads to overfitting. But I am unable to wrap my head around how this happens mathematically while training. Could you please throw some more light on this.

    Thanks in advance.

    • Jason Brownlee April 26, 2020 at 6:20 am #

      Models can overfit when they have duplicates in the training data as they will put too much weight on these examples at the expense of increases generalization error.

  7. Tapan Jain April 27, 2020 at 4:01 am #

    Thanks for the response Jason. I understand that the loss would be more weighted towards duplicate samples and thus biasing the weights learned.

    Could you also help me understand which class of models are impacted by class imbalance and why they predict only the majority class and not the minority class. I understand it intuitively but looking for a mathematical explanation.

    Thanks!!

    • Jason Brownlee April 27, 2020 at 5:39 am #

      Thanks for the suggestion. I might prepare something in the future.

  8. Tapan Jain April 27, 2020 at 7:11 am #

    Thanks Jason!!

  9. Meenakshi May 3, 2020 at 7:10 pm #

    Hi Jason Brownie,
    Thank you for your wonderful lessons and they are easy to follow. Can you please highlight which line you mention the specific class as minority, I mean, in case of multi-label classification problem how to mention if more than one class is imbalanced.
    Thanks in advance.

  10. Claudemir May 16, 2020 at 4:33 am #

    Hello, thanks for the article.
    Is it possible that a over and undersampling applied to a set with a 1:1000 binary class distribution actually decreases the models perfomance? I’m using the area under the precision-recall curve for scoring and the classifiers: KNeighboors, Random Forest, Logistic Regression, Stochastic GD,LinearSVM and this happens to all of them.

  11. Zina May 17, 2020 at 2:05 am #

    Hi
    Thanks for the article.
    Can I use one of the similarity measure techniques to undersample the majority data?

  12. Omar June 12, 2020 at 10:03 pm #

    Thanks Jason.

    One question, if we do random undersampling or oversampling (or even SMOTE), will accuracy give us any insight? Or in another words, do we even need to calculate accuracy after random oversampling/undersampling?

    I haven’t found any literature regarding the connection between doing random over/undersampling and using other metric like recall, precision, or F1.

  13. James Hutton July 26, 2020 at 12:30 am #

    Hi Jason,

    Any clue on the use of these techniques (under/over/hybrid sampling) to which type of imbalance datasets? kind of rule of thumb? Or it is heavily can be evaluated from computation-intensive experimentation?

    Thanks

  14. Venkatesh Gandi August 20, 2020 at 1:47 am #

    Hi Jason,

    Thanks for the great content.

    Can you please let me know why the traditional sklearn pipeline will not be used? What happens if we use sklearn pipeline?

    • Jason Brownlee August 20, 2020 at 6:49 am #

      The sklearn pipeline does not allow you to change the number of rows, the imbalanced learn pipeline does.

Leave a Reply