Why Is Imbalanced Classification Difficult?

Imbalanced classification is primarily challenging as a predictive modeling task because of the severely skewed class distribution.

This is the cause for poor performance with traditional machine learning models and evaluation metrics that assume a balanced class distribution.

Nevertheless, there are additional properties of a classification dataset that are not only challenging for predictive modeling but also increase or compound the difficulty when modeling imbalanced datasets.

In this tutorial, you will discover data characteristics that compound the challenge of imbalanced classification.

After completing this tutorial, you will know:

  • Imbalanced classification is specifically hard because of the severely skewed class distribution and the unequal misclassification costs.
  • The difficulty of imbalanced classification is compounded by properties such as dataset size, label noise, and data distribution.
  • How to develop an intuition for the compounding effects on modeling difficulty posed by different dataset properties.

Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Problem Characteristics That Make Imbalanced Classification Hard

Problem Characteristics That Make Imbalanced Classification Hard
Photo by Joshua Damasio, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

  1. Why Imbalanced Classification Is Hard
  2. Compounding Effect of Dataset Size
  3. Compounding Effect of Label Noise
  4. Compounding Effect of Data Distribution

Why Imbalanced Classification Is Hard

Imbalanced classification is defined by a dataset with a skewed class distribution.

This is often exemplified by a binary (two-class) classification task where most of the examples belong to class 0 with only a few examples in class 1. The distribution may range in severity from 1:2, 1:10, 1:100, or even 1:1000.

Because the class distribution is not balanced, most machine learning algorithms will perform poorly and require modification to avoid simply predicting the majority class in all cases. Additionally, metrics like classification lose their meaning and alternate methods for evaluating predictions on imbalanced examples are required, like ROC area under curve.

This is the foundational challenge of imbalanced classification.

  • Skewed Class Distribution

An additional level of complexity comes from the problem domain from which the examples were drawn.

It is common for the majority class to represent a normal case in the domain, whereas the minority class represents an abnormal case, such as a fault, fraud, outlier, anomaly, disease state, and so on. As such, the interpretation of misclassification errors may differ across the classes.

For example, misclassifying an example from the majority class as an example from the minority class called a false-positive is often not desired, but less critical than classifying an example from the minority class as belonging to the majority class, a so-called false negative.

This is referred to as cost sensitivity of misclassification errors and is a second foundational challenge of imbalanced classification.

  • Unequal Cost of Misclassification Errors

These two aspects, the skewed class distribution and cost sensitivity, are typically referenced when describing the difficulty of imbalanced classification.

Nevertheless, there are other characteristics of the classification problem that, when combined with these properties, compound their effect. These are general characteristics of classification predictive modeling that magnify the difficulty of the imbalanced classification task.

Class imbalance was widely acknowledged as a complicating factor for classification. However, some studies also argue that the imbalance ratio is not the only cause of performance degradation in learning from imbalanced data.

— Page 253, Learning from Imbalanced Data Sets, 2018.

There are many such characteristics, but perhaps three of the most common include:

  • Dataset Size.
  • Label Noise.
  • Data Distribution.

It is important to not only acknowledge these properties but to also specifically develop an intuition for their impact. This will allow you to select and develop techniques to address them in your own predictive modeling projects.

Understanding these data intrinsic characteristics, as well as their relationship with class imbalance, is crucial for applying existing and developing new techniques to deal with imbalance data.

— Pages 253-254, Learning from Imbalanced Data Sets, 2018.

In the following sections, we will take a closer look at each of these properties and their impact on imbalanced classification.

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Compounding Effect of Dataset Size

Dataset size simply refers to the number of examples collected from the domain to fit and evaluate a predictive model.

Typically, more data is better as it provides more coverage of the domain, perhaps to a point of diminishing returns.

Specifically, more data provides better representation of combinations and variance of features in the feature space and their mapping to class labels. From this, a model can better learn and generalize a class boundary to discriminate new examples in the future.

If the ratio of examples in the majority class to the minority class is somewhat fixed, then we would expect that we would have more examples in the minority class as the size of the dataset is scaled up.

This is good if we can collect more examples.

It is a problem typically because data is hard or expensive to collect and we often collect and work with a lot less data than we might prefer. As such, this can dramatically impact our ability to gain a large enough or representative sample of examples from the minority class.

A problem that often arises in classification is the small number of training instances. This issue, often reported as data rarity or lack of data, is related to the “lack of density” or “insufficiency of information”.

— Page 261, Learning from Imbalanced Data Sets, 2018.

For example, for a modest classification task with a balanced class distribution, we might be satisfied with thousands or tens of thousands of examples in order to develop, evaluate, and select a model.

A balanced binary classification with 10,000 examples would have 5,000 examples of each class. An imbalanced dataset with a 1:100 distribution with the same number of examples would only have 100 examples of the minority class.

As such, the size of the dataset dramatically impacts the imbalanced classification task, and datasets that are thought large in general are, in fact, probably not large enough when working with an imbalanced classification problem.

Without a sufficient large training set, a classifier may not generalize characteristics of the data. Furthermore, the classifier could also overfit the training data, with a poor performance in out-of-sample tests instances.

— Page 261, Learning from Imbalanced Data Sets, 2018.

To help, let’s make this concrete with a worked example.

We can use the make_classification() scikit-learn function to create a dataset of a given size with a ratio of about 1:100 examples (1 percent to 99 percent) in the minority class to the majority class.

We can then create a scatter plot of the dataset and color the points for each class with a septate color to get an idea of the spatial relationship for the examples.

This process can then be repeated with different datasets sizes to show how the class imbalance is impacted visually. We will compare datasets with 100, 1,000, 10,000, and 100,000 examples.

The complete example is listed below.

Running the example creates and plots the same dataset with a 1:100 class distribution using four different sizes.

First, the class distribution is displayed for each dataset size. We can see that with a small dataset of 100 examples, we only get one example in the minority class as we might expect. Even with 100,000 examples in the dataset, we only get 1,000 examples in the minority class.

Scatter plots are created for each differently sized dataset.

We can see that it is not until very large sample sizes that the underlying structure of the class distributions becomes obvious.

These plots highlight the critical role that dataset size plays in imbalanced classification. It is hard to see how a model given 990 examples of the majority class and 10 of the minority class could hope to do well on the same problem depicted after 100,000 examples are drawn.

Scatter Plots of an Imbalanced Classification Dataset With Different Dataset Sizes

Scatter Plots of an Imbalanced Classification Dataset With Different Dataset Sizes

Compounding Effect of Label Noise

Label noise refers to examples that belong to one class that are assigned to another class.

This can make determining the class boundary in feature space problematic for most machine learning algorithms, and this difficulty typically increases in proportion to the percentage of noise in the labels.

Two types of noise are distinguished in the literature: feature (or attribute) and class noise. Class noise is generally assumed to be more harmful than attribute noise in ML […] class noise somehow affects the observed class values (e.g., by somehow flipping the label of a minority class instance to the majority class label).

— Page 264, Learning from Imbalanced Data Sets, 2018.

The cause is often inherent in the problem domain, such as ambiguous observations on the class boundary or even errors in the data collection that could impact observations anywhere in the feature space.

For imbalanced classification, noisy labels have an even more dramatic effect.

Given that examples in the positive class are so few, losing some to noise reduces the amount of information available about the minorty class.

Additionally, having examples from the majority class incorrectly marked as belonging to the minority class can cause a disjoint or fragmentation of the minority class that is already sparse because of the lack of observations.

We can imagine that if there are examples along the class boundary that are ambiguous, we could identify and remove or correct them. Examples marked for the minority class that are in areas of the feature space that are high density for the majority class are also likely easy to identify and remove or correct.

It is the case where observations for both classes are sparse in the feature space where this problem becomes particularly difficult in general, and especially for imbalanced classification. It is these situations where unmodified machine learning algorithms will define the class boundary in favor of the majority class at the expense of the minority class.

Mislabeled minority class instances will contribute to increase the perceived imbalance ratio, as well as introduce mislabeled noisy instances inside the class region of the minority class. On the other hand, mislabeled majority class instances may lead the learning algorithm, or imbalanced treatment methods, focus on wrong areas of input space.

— Page 264, Learning from Imbalanced Data Sets, 2018.

We can develop an example to give a flavor of this challenge.

We can hold the dataset size constant as well as the 1:100 class ratio and vary the amount of label noise. This can be achieved by setting the “flip_y” argument to the make_classification() function which is a percentage of the number of examples in each class to change or flip the label.

We will explore varying this from 0 percent, 1 percent, 5 percent, and 7 percent.

The complete example is listed below.

Running the example creates and plots the same dataset with a 1:100 class distribution using four different amounts of label noise.

First, the class distribution is printed for each dataset with differing amounts of label noise. We can see that, as we might expect, as the noise is increased, the number of examples in the minority class is increased, most of which are incorrectly labeled.

We might expect these additional 30 examples in the minority class with 7 percent label noise to be quite damaging to a model trying to define a crisp class boundary in the feature space.

Scatter plots are created for each dataset with the differing label noise.

In this specific case, we don’t see many examples of confusion on the class boundary. Instead, we can see that as the label noise is increased, the number of examples in the mass of the minority class (orange points in the blue area) increases, representing false positives that really should be identified and removed from the dataset prior to modeling.

Scatter Plots of an Imbalanced Classification Dataset With Different Label Noise

Scatter Plots of an Imbalanced Classification Dataset With Different Label Noise

Compounding Effect of Data Distribution

Another important consideration is the distribution of examples in feature space.

If we think about feature space spatially, we might like all examples in one class to be located on one part of the space, and those from the other class to appear in another part of the space.

If this is the case, we have good class separability and machine learning models can draw crisp class boundaries and achieve good classification performance. This holds on datasets with a balanced or imbalanced class distribution.

This is rarely the case, and it is more likely that each class has multiple “concepts” resulting in multiple different groups or clusters of examples in feature space.

… it is common that the “concept” beneath a class is split into several sub-concepts, spread over the input space.

— Page 255, Learning from Imbalanced Data Sets, 2018.

These groups are formally referred to as “disjuncts,” coming from a definition in the of rule-based systems for a rule that covers a group of cases comprised of sub-concepts. A small disjunct is one that relates or “covers” few examples in the training dataset.

Systems that learn from examples do not usually succeed in creating a purely conjunctive definition for each concept. Instead, they create a definition that consists of several disjuncts, where each disjunct is a conjunctive definition of a subconcept of the original concept.

Concept Learning And The Problem Of Small Disjuncts, 1989.

This grouping makes class separability hard, requiring each group or cluster to be identified and included in the definition of the class boundary, implicitly or explicitly.

In the case of imbalanced datasets, this is a particular problem if the minority class has multiple concepts or clusters in the feature space. This is because the density of examples in this class is already sparse and it is difficult to discern separate groupings with so few examples. It may look like one large sparse grouping.

This lack of homogeneity is particularly problematic in algorithms based on the strategy of dividing-and-conquering […] where the sub-concepts lead to the creation of small disjuncts.

— Page 255, Learning from Imbalanced Data Sets, 2018.

For example, we might consider data that describes whether a patient is healthy (majority class) or sick (minority class). The data may capture many different types of illnesses, and there may be groups of similar illnesses, but if there are so few cases, then any grouping or concepts within the class may not be apparent and may look like a diffuse set mixed in with healthy cases.

To make this concrete, we can look at an example.

We can use the number of clusters in the dataset as a proxy for “concepts” and compare a dataset with one cluster of examples per class to a second dataset with two clusters per class.

This can be achieved by varying the “n_clusters_per_class” argument for the make_classification() function used to create the dataset.

We would expect that in an imbalanced dataset, such as a 1:100 class distribution, that the increase in the number of clusters is obvious for the majority class, but not so for the minority class.

The complete example is listed below.

Running the example creates and plots the same dataset with a 1:100 class distribution using two different numbers of clusters.

In the first scatter plot (left), we can see one cluster per class. The majority class (blue) quite clearly has one cluster, whereas the structure of the minority class (orange) is less obvious. In the second plot (right), we can again clearly see that the majority class has two clusters, and again the structure of the minority class (orange) is diffuse and it is not apparent that samples were drawn from two clusters.

This highlights the relationship between the size of the dataset and its ability to expose the underlying density or distribution of examples in the minority class. With so few examples, generalization by machine learning models is challenging, if not very problematic.

Scatter Plots of an Imbalanced Classification Dataset With Different Numbers of Clusters

Scatter Plots of an Imbalanced Classification Dataset With Different Numbers of Clusters

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Papers

Books

APIs

Summary

In this tutorial, you discovered data characteristics that compound the challenge of imbalanced classification.

Specifically, you learned:

  • Imbalanced classification is specifically hard because of the severely skewed class distribution and the unequal misclassification costs.
  • The difficulty of imbalanced classification is compounded by properties such as dataset size, label noise, and data distribution.
  • How to develop an intuition for the compounding effects on modeling difficulty posed by different dataset properties.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

24 Responses to Why Is Imbalanced Classification Difficult?

  1. Avatar
    marco February 18, 2020 at 2:40 am #

    Hello Jason,
    I have a question, I’m using time series for prediction (using FP Prophet).
    An algorithm using a time series is a regression? what type of regression is?
    Thanks,
    Marco

    • Avatar
      Jason Brownlee February 18, 2020 at 6:23 am #

      It is time series forecasting which is a special type of regression where the observations are ordered by time.

      • Avatar
        Daphne Cooper December 16, 2021 at 8:17 pm #

        Thanks, this article helped me a lot!

        • Adrian Tam
          Adrian Tam December 17, 2021 at 7:24 am #

          Thanks. Glad you liked it.

  2. Avatar
    marco February 18, 2020 at 8:20 pm #

    Hello Jason,
    given a time series are there algorithms to forecast production (except fbprophet – Facebook’s Time Series Forecasting)?
    Is it possible with Keras or Scikit-Learn?
    Thanks

  3. Avatar
    Madhava February 20, 2020 at 10:41 am #

    Hi Jason, awesome article.
    Do you think that uncertainty and active learning is a good solution to this problem, allowing the ability to use lower numbers of training data and try to balance out the dataset in a stepwise iterative fashion? Have you tried that?

    • Avatar
      Jason Brownlee February 20, 2020 at 11:28 am #

      Thanks1

      There is no silver bullet. We must test manny different approaches for a given task to discover what works best.

  4. Avatar
    Abid Saber February 21, 2020 at 11:12 pm #

    Thank you jason for this article which was very clear
    what do you think of the resampling technique,
    Are they a good solution to this problem,

  5. Avatar
    Rima March 11, 2020 at 4:30 pm #

    Hello Jason,
    I have a dataset, the ratio is 1003 for the minority class and 2921 for the total sample
    Do you think it is imbalanced?

  6. Avatar
    JG May 6, 2020 at 4:57 pm #

    Ho Jason,

    Great tutorial with immediate applications for many problems such as new deseases (e.g.Covid-19) diagnosis per chest-X ray images, (e.g), where fortunately still it is a minority class vs other classical lungs diseases classifications.

    Could you list or summarize for us your best machine learning tutorials where would be well explained the most efficient techniques to approach the solutions to this imbalance datasets?
    I would say nowadays it is typically between 1:10 and 1:100 the imbalanced ratio!

  7. Avatar
    Ruhi May 18, 2020 at 9:45 pm #

    Hi Jason,

    To calculate the imbalance ratio, I use Shanon entropy which gives a score between 0 (highly imbalanced data) – 1(balanced data). However, it doesn’t capture the effect of data size i.e, the effect of 1:10, 10:100, 100:1000 dataset.

    What is the best way to capture it in the score?

    Thanks!

    • Avatar
      Jason Brownlee May 19, 2020 at 6:04 am #

      There is no best scores, just different approaches to the same question.

      I have seen many different attempts at scoring, perhaps check the literature.

  8. Avatar
    Lokesh Rathi July 8, 2020 at 9:03 am #

    1. Scaling the data.
    2. Using cross validation with stratifiedkfold, with let’s say 10 cv’s could solve the problem?

    • Avatar
      Jason Brownlee July 8, 2020 at 1:42 pm #

      Solves what problem? Sorry, Id on’t follow. Perhaps you could elaborate on your question?

  9. Avatar
    Kevin November 18, 2020 at 1:53 am #

    Dear Dr. Jason,
    How can we use scatter plot to visualize the data distribution for real world datasets from repositories such as Kaggle or UCI? For example, in a row of a pandas dataframe, what are supposed to be the X-axis and Y-axis when the real world datasets has a Y value but many X instances in a row of a dataframe? How about when we are dealing with multiple rows of the whole dataset? How should we process the datasets to visualize the imbalance available in the dataset?

    • Avatar
      Jason Brownlee November 18, 2020 at 6:45 am #

      You can use a scatter plot for each pair of variables and color dots by class label, but you may have many plots if you have many input variables.

      A better approach is to summarise the number of examples in each class as a count and as a percentage of the size of the dataset.

      • Avatar
        Kevin November 19, 2020 at 12:19 pm #

        Hi Dr. Jason, thank you for your reply. Can I get an example of your suggested approach or any reference that demonstrate this approach. I hope to learn more because I do not understand your suggested approach very well. For binary classification, we have two classes. Do you mean we could plot the target variable’s positive and negative counts vs %size of dataset? How does it work?

  10. Avatar
    shahriar Sourav July 3, 2021 at 4:15 pm #

    Hi,
    To handle imbalance classification, can we collect historical real life minority class data which is from near past?

Leave a Reply