Imbalanced Classification With Python (7-Day Mini-Course)

Imbalanced Classification Crash Course.
Get on top of imbalanced classification in 7 days.

Classification predictive modeling is the task of assigning a label to an example.

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal.

Practical imbalanced classification requires the use of a suite of specialized techniques, data preparation techniques, learning algorithms, and performance metrics.

In this crash course, you will discover how you can get started and confidently work through an imbalanced classification project with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

  • Updated Jan/2021: Updated links for API documentation.
Imbalanced Classification With Python (7-Day Mini-Course)

Imbalanced Classification With Python (7-Day Mini-Course)
Photo by Arches National Park, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

  • You know your way around basic Python for programming.
  • You may know some basic NumPy for array manipulation.
  • You may know some basic scikit-learn for modeling.

You do NOT need to be:

  • A math wiz!
  • A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can navigate an imbalanced classification project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with imbalanced classification in Python:

  • Lesson 01: Challenge of Imbalanced Classification
  • Lesson 02: Intuition for Imbalanced Data
  • Lesson 03: Evaluate Imbalanced Classification Models
  • Lesson 04: Undersampling the Majority Class
  • Lesson 05: Oversampling the Minority Class
  • Lesson 06: Combine Data Undersampling and Oversampling
  • Lesson 07: Cost-Sensitive Algorithms

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the algorithms and the best-of-breed tools in Python. (Hint: I have all of the answers directly on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Imbalanced Classification with Python.”

Lesson 01: Challenge of Imbalanced Classification

In this lesson, you will discover the challenge of imbalanced classification problems.

Imbalanced classification problems pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

  • Majority Class: More than half of the examples belong to this class, often the negative or normal case.
  • Minority Class: Less than half of the examples belong to this class, often the positive or abnormal case.

A classification problem may be a little skewed, such as if there is a slight imbalance. Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset.

  • Slight Imbalance. Where the distribution of examples is uneven by a small amount in the training dataset (e.g. 4:6).
  • Severe Imbalance. Where the distribution of examples is uneven by a large amount in the training dataset (e.g. 1:100 or more).

Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.

As such, it is surprising that imbalanced classification does not get more attention than it does.

Your Task

For this lesson, you must list five general examples of problems that inherently have a class imbalance.

One example might be fraud detection, another might be intrusion detection.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to develop an intuition for skewed class distributions.

Lesson 02: Intuition for Imbalanced Data

In this lesson, you will discover how to develop a practical intuition for imbalanced classification datasets.

A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means. For example, what is the difference and implication for a 1:10 vs. a 1:100 class ratio?

The make_classification() scikit-learn function can be used to define a synthetic dataset with a desired class imbalance. The “weights” argument specifies the ratio of examples in the negative class, e.g. [0.99, 0.01] means that 99 percent of the examples will belong to the majority class, and the remaining 1 percent will belong to the minority class.

Once defined, we can summarize the class distribution using a Counter object to get an idea of exactly how many examples belong to each class.

We can also create a scatter plot of the dataset because there are only two input variables. The dots can then be colored by each class. This plot provides a visual intuition for what exactly a 99 percent vs. 1 percent majority/minority class imbalance looks like in practice.

The complete example of creating and summarizing an imbalanced classification dataset is listed below.

Your Task

For this lesson, you must run the example and review the plot.

For bonus points, you can test different class ratios and review the results.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to evaluate models for imbalanced classification.

Lesson 03: Evaluate Imbalanced Classification Models

In this lesson, you will discover how to evaluate models on imbalanced classification problems.

Prediction accuracy is the most common metric for classification tasks, although it is inappropriate and potentially dangerously misleading when used on imbalanced classification tasks.

The reason for this is because if 98 percent of the data belongs to the negative class, you can achieve 98 percent accuracy on average by simply predicting the negative class all the time, achieving a score that naively looks good, but in practice has no skill.

Instead, alternate performance metrics must be adopted.

Popular alternatives are the precision and recall scores that allow the performance of the model to be considered by focusing on the minority class, called the positive class.

Precision calculates the ratio of the number of correctly predicted positive examples divided by the total number of positive examples that were predicted. Maximizing the precision will minimize the false positives.

  • Precision = TruePositives / (TruePositives + FalsePositives)

Recall predicts the ratio of the total number of correctly predicted positive examples divided by the total number of positive examples that could have been predicted. Maximizing recall will minimize false negatives.

  • Recall = TruePositives / (TruePositives + FalseNegatives)

The performance of a model can be summarized by a single score that averages both the precision and the recall, called the F-Measure. Maximizing the F-Measure will maximize both the precision and recall at the same time.

  • F-measure = (2 * Precision * Recall) / (Precision + Recall)

The example below fits a logistic regression model on an imbalanced classification problem and calculates the accuracy, which can then be compared to the precision, recall, and F-measure.

Your Task

For this lesson, you must run the example and compare the classification accuracy to the other metrics, such as precision, recall, and F-measure.

For bonus points, try other metrics such as Fbeta-measure and ROC AUC scores.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to undersample the majority class.

Lesson 04: Undersampling the Majority Class

In this lesson, you will discover how to undersample the majority class in the training dataset.

A simple approach to using standard machine learning algorithms on an imbalanced dataset is to change the training dataset to have a more balanced class distribution.

This can be achieved by deleting examples from the majority class, referred to as “undersampling.” A possible downside is that examples from the majority class that are helpful during modeling may be deleted.

The imbalanced-learn library provides many examples of undersampling algorithms. This library can be installed easily using pip; for example:

A fast and reliable approach is to randomly delete examples from the majority class to reduce the imbalance to a ratio that is less severe or even so that the classes are even.

The example below creates a synthetic imbalanced classification data, then uses RandomUnderSampler class to change the class distribution from 1:100 minority to majority classes to the less severe 1:2.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after undersampling the majority class.

For bonus points, try other undersampling ratios or even try other undersampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to oversample the minority class.

Lesson 05: Oversampling the Minority Class

In this lesson, you will discover how to oversample the minority class in the training dataset.

An alternative to deleting examples from the majority class is to add new examples from the minority class.

This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information. Instead, new examples from the minority can be synthesized using existing examples in the training dataset. These new examples will be “close” to existing examples in the feature space, but different in small but random ways.

The SMOTE algorithm is a popular approach for oversampling the minority class. This technique can be used to reduce the imbalance or to make the class distribution even.

The example below demonstrates using the SMOTE class provided by the imbalanced-learn library on a synthetic dataset. The initial class distribution is 1:100 and the minority class is oversampled to a 1:2 distribution.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after oversampling the minority class.

For bonus points, try other oversampling ratios, or even try other oversampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to combine undersampling and oversampling techniques.

Lesson 06: Combine Data Undersampling and Oversampling

In this lesson, you will discover how to combine data undersampling and oversampling on a training dataset.

Data undersampling will delete examples from the majority class, whereas data oversampling will add examples to the minority class. These two approaches can be combined and used on a single training dataset.

Given that there are so many different data sampling techniques to choose from, it can be confusing as to which methods to combine. Thankfully, there are common combinations that have been shown to work well in practice; some examples include:

  • Random Undersampling with SMOTE oversampling.
  • Tomek Links Undersampling with SMOTE oversampling.
  • Edited Nearest Neighbors Undersampling with SMOTE oversampling.

These combinations can be applied manually to a given training dataset by first applying one sampling algorithm, then another. Thankfully, the imbalanced-learn library provides implementations of common combined data sampling techniques.

The example below demonstrates how to use the SMOTEENN that combines both SMOTE oversampling of the minority class and Edited Nearest Neighbors undersampling of the majority class.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after the data sampling.

For bonus points, try other combined data sampling techniques or even try manually applying oversampling followed by undersampling on the dataset.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Lesson 07: Cost-Sensitive Algorithms

In this lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Most machine learning algorithms assume that all misclassification errors made by a model are equal. This is often not the case for imbalanced classification problems, where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. Many machine learning algorithms can be updated to be cost-sensitive, where the model is penalized for misclassification errors from one class more than the other, such as the minority class.

The scikit-learn library provides this capability for a range of algorithms via the class_weight attribute specified when defining the model. A weighting can be specified that is inversely proportional to the class distribution.

If the class distribution was 0.99 to 0.01 for the majority and minority classes, then the class_weight argument could be defined as a dictionary that defines a penalty of 0.01 for errors made for the majority class and a penalty of 0.99 for errors made with the minority class, e.g. {0:0.01, 1:0.99}.

This is a useful heuristic and can be configured automatically by setting the class_weight argument to the string ‘balanced‘.

The example below demonstrates how to define and fit a cost-sensitive logistic regression model on an imbalanced classification dataset.

Your Task

For this lesson, you must run the example and review the performance of the cost-sensitive model.

For bonus points, compare the performance to the cost-insensitive version of logistic regression.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson of the mini-course.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

  • The challenge of imbalanced classification is the lack of examples for the minority class and the difference in importance of classification errors across the classes.
  • How to develop a spatial intuition for imbalanced classification datasets that might inform data preparation and algorithm selection.
  • The failure of classification accuracy and how alternate metrics like precision, recall, and the F-measure can better summarize model performance on imbalanced datasets.
  • How to delete examples from the majority class in the training dataset, referred to as data undersampling.
  • How to synthesize new examples in the minority class in the training dataset, referred to as data oversampling.
  • How to combine data oversampling and undersampling techniques on the training dataset, and common combinations that result in good performance.
  • How to use cost-sensitive modified versions of machine learning algorithms to improve performance on imbalanced classification datasets.

Take the next step and check out my book on Imbalanced Classification with Python.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

147 Responses to Imbalanced Classification With Python (7-Day Mini-Course)

  1. Avatar
    Ken Jones January 17, 2020 at 8:32 am #

    A real-world example I am working on right now is predicting no shows for medical providers. In the data set I am working with, about 5% of appointments are no shows. The balance are shows or occurred appointments. The interesting class are the no shows.

  2. Avatar
    Mark Littlewood January 17, 2020 at 6:43 pm #

    A real world example I am concerned with is predicting the winners of horse races. Typically each race has on average around 11 runners and ten of those will have a ‘0’ meaning they did not win the race whilst one row will be the winner with a ‘1’. Funnily enough, using GBM I have not found balancing to be helpful but maybe I am coming at it from a wrong perspective with my balancing technique

    • Avatar
      Jason Brownlee January 18, 2020 at 8:38 am #

      Interesting problem.

      I would recommend looking at rating systems.

  3. Avatar
    Kate Strydom January 17, 2020 at 6:50 pm #

    I work on lead data, predicting call centre sales on various telecommunication and lead generation campaigns where the responses are always imbalanced. That is, sales versus no sale, and hot lead versus no lead. I usually just take a random sample of the negative response equivalent to the positive response in my sample prior to pulling the data into Python. Call centre data responses are always imbalanced due to the nature of the business. I would be keen to learn other ways to balance the responses.

  4. Avatar
    Mark Littlewood January 17, 2020 at 7:36 pm #

    With flip_y set to zero the 1 and 0s are created in a pretty distinct manner, there is little overlap. They also appear to be pretty linear in relation to the predictors

  5. Avatar
    Mark Littlewood January 17, 2020 at 8:41 pm #

    The FBeta-score is interesting as a combination of precision and recall but you can weight it towards precision or recall. With betting precision is more important perhaps as bets that lose cost you money where as false negatives are annoying but not financially damaging.

  6. Avatar
    Alexander Binkovsky January 17, 2020 at 8:57 pm #

    A real-world example of imbalanced classification I’m working on right now is anomaly detection in monitoring data of a huge distributed application.

  7. Avatar
    Ciprian Saramet January 18, 2020 at 2:07 am #

    a real-world example of imbalanced data is server monitoring logs and trying to predict service failures

  8. Avatar
    Joy January 18, 2020 at 6:17 pm #

    Hi Jason. Thank you for the insightful article!!
    Your code runs fine Jason. But I cannot understand this piece:

    for label, _ in counter.items(): (<— this is for iterating over the dictionary of counter.items()

    row_ix = where(y == label)[0] (<— Does this mean row_ix is equal to x values of only those counter items belonging to class 0?)

    pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) (<— plotting the points of both classes 0 and 1)

    Can you please explain a bit? I am not that proficient in Python

    • Avatar
      Jason Brownlee January 19, 2020 at 7:14 am #

      Thanks.

      We iterate over the labels.

      We can get the row indexes for all the examples in y that have the label.

      We then plot the points with those row indexes.

  9. Avatar
    Vijay Pal January 18, 2020 at 6:56 pm #

    Customer churn and customer chargeback are two classical cases.

  10. Avatar
    Moi khalile January 18, 2020 at 8:17 pm #

    a real-world example of imbalanced data is medical images classification

  11. Avatar
    Shay January 18, 2020 at 11:07 pm #

    Malware detection in the real world is inherently severely imbalanced.

  12. Avatar
    Damir Zunic January 20, 2020 at 6:57 am #

    I worked on predicting risk for type 2 diabetes using 3 classes per A1C levels: no-diabetes (77%), pre-diabetes (6%) and diabetes(17%).

  13. Avatar
    Sanchit Bhavsar January 21, 2020 at 2:39 am #

    The highly imbalanced data problem I worked on was to predict user ad clicks with a majority of non-clicks (99%) and clicks (1%).

  14. Avatar
    James January 21, 2020 at 1:26 pm #

    Lesson 1: Five general examples of class imbalance
    1. Populations of rare and/or endangered species
    2. Incidence of rare diseases
    3. Extreme weather patterns
    4. Excessive spending for non-essential items
    5. Mechanical problems leading to highly probable breakdowns

  15. Avatar
    Nwe January 21, 2020 at 3:59 pm #

    I think that the misclassification error cost of imbalanced data classification is not same for each class because the number of training samples are not same. So, the performance metrics may be depend on all training class.

    Please, give me suggestion in my opinion.

    • Avatar
      Jason Brownlee January 22, 2020 at 6:18 am #

      Yes, although it will depend on the specifics of the dataset.

  16. Avatar
    Sachin Prabhu January 23, 2020 at 1:42 am #

    Answer for Lesson 01: Challenge of Imbalanced Classification

    1. Cancer cell prediction
    2. Spam/Ham classification
    3. Earthquake prediction
    4. Page Blocks Classification
    5. Glass Identification

  17. Avatar
    Sudhansu R. Lenka January 24, 2020 at 7:52 pm #

    Dataset will be balanced on original dataset, or we need to split into train and test first and then balance the training set

  18. Avatar
    stanislav January 26, 2020 at 6:01 am #

    Coming examples only from social field:
    1. number of rich/poor
    2. —“– wealthy/ill
    3. —“– birth/death
    4. —“– buyer/seller
    5. —“– desires (wishes)/achievements

  19. Avatar
    Vinod Kumar February 6, 2020 at 1:30 pm #

    Sir what are the techniques to fix the imbalance in data set

    • Avatar
      Jason Brownlee February 6, 2020 at 1:48 pm #

      There are many:

      – choose the right metric
      – try data over/undersampling
      – try cost sensitive models
      – try threshold moving
      – try one class models
      – try specialized ensembles
      – …

      Compare everything to default models to demonstrate they add value/skill.

  20. Avatar
    Animesh February 9, 2020 at 4:36 am #

    One more real time Example can be Sales return prediction for an online portal…

  21. Avatar
    Cheyu February 18, 2020 at 3:00 pm #

    Great course, it is really helpful.

    Here I have two questions about the proposed weights for the majority and minority classes.

    (1) As in the example, suppose that IR is given 99 (label 0 is 0.99, label 1 is 0.01) and you suggest to give weight 0.01 for the majority class and give weight 0.99 for the minority class. the results will be balanced after multiplying with the weights. Are there any references and papers to support?

    (2) If the cost for misclassification error is pre-defined (for both FP & FN) but no rewards are given (for both TP & TN), is it a good way to follow the cost-sensitivity manipulation (i.e., assign weights) still?

  22. Avatar
    Luis M. February 21, 2020 at 4:51 pm #

    One example of imbalanced classes is present in biometric recognition.
    An usual dataset has a number of Nu individuals/users and Ns_u samples per individual.
    From the genuine pairs and impostor pairs of a given dataset, we obtain two classes of matching scores: genuine scores and impostor scores.
    This is Nu*(Ns_u-1)*(Ns_u)/2 genuine scores and (Ns_u^2)*Nu*(Nu-1)/2 impostor scores.
    Let us say that you have 100 individuals and 10 sample per individual.
    Then, you would have 4500 genuine scores and 495000 impostor scores.
    This is an important problem for multi-biometric recognition where you need to train a classifier for score fusion.

  23. Avatar
    elli February 21, 2020 at 11:55 pm #

    Five examples of imbalanced classes might be: cancer detection in cells, detecting students with learning differences, calls to a call center which are about an unexpected topic, loan defaults, or detecting a rare disease from patient medical records.

  24. Avatar
    Nitish Khairnar February 26, 2020 at 6:23 pm #

    Hi Jason,
    sharing my examples as task of Day -1 of Imbalance Classification.

    1- detection of patients having high level of blood sugar
    2- lottery winning ratio
    3- call drop due to technical glitches
    4- getting a right swipe on tinder(severely imbalaced)
    5- chances of raining in the month of February (here in India it is very very rare)
    6- getting a cashback after scratching a gift card in Google Pay(folks using this app for payment can connect well with this example)

    Thank you

  25. Avatar
    Naresh Sharma March 4, 2020 at 7:25 pm #

    I’m working on predicting of a person with certain demographic features, is likely to become a customer based on a mail campaign. The dataset shows that only 1.3% of folks become customers.

  26. Avatar
    Retsilisitsoe Raymond Moholisa March 7, 2020 at 1:42 am #

    Hi Jason, sharing my examples for Day 1 on Imbalance Classification
    1.Detecting patients with rare genetic diseases in medical databases
    2.Predicting customer churn behavior in telemarketing
    3.Predicting multiview face recognition
    4.Predicting skin cancer from medical images
    5.Leopard/Cheetah Classification

  27. Avatar
    Saman Tamkeen March 27, 2020 at 5:51 pm #

    Task for Lesson 01:

    I am trying to model a l2r algorithm. The measure of relevance is that, given a list of items, the items that were clicked are more relevant as compared to those that were seen and not clicked.

    In my example dataset, only 6.5% of items are relevant while remaining 93.5% are not

  28. Avatar
    milad May 1, 2020 at 2:01 pm #

    1-How to Handling Imbalanced data for multi-label classification in Keras?
    I’m dealing with a multi-label classification problem with a imbalanced dataset. I want to oversample with SMOTE. However, I don’t know how to achieve it since the label is like [0,1,0,0,1,0,1,1,0,0,0,0].

    2-how to set class_weight for multi label dataset in keras?

  29. Avatar
    JG May 7, 2020 at 2:49 am #

    Hi Jason,

    In addition to these Oversampling Smote or variants techniques such as (“smoteenn” including Undersamplig ENN) or just RandomUnderSampler, I recently have known that Keras also provide a toll as an argument for the “.fit” method, called “class_weight” where you can specified as a dictionary the integer class label vs percentage of weighting to be applied to the loss function to each class label. See reference : https://keras.io/models/sequential/

    Do you have any tutorial on this keras technique? It is basic undersampling(oversampling techniques) such as eliminating some data during training? which it is the main differences between this fit argument keras method and the ones explained here (over and under sampling) ?

    thanks in advance
    regards

  30. Avatar
    JG May 8, 2020 at 3:49 am #

    Thank you very much Jason!

    You are not only a great ML/DL computer Engineer but also an excellent Professor that achieve incredible outreach performance and the best one value, at least for me, a great person!

    regards,
    JG

  31. Avatar
    John Sammut May 8, 2020 at 6:04 am #

    Hi Jason, thanks for this course.

    Here’s the list that I thought of:

    1. road anomaly vs good road surface
    2. normal seismic activity sensor data vs volcano eruption seismic activity
    3. healthy lung x-ray image vs lung cancer x-ray image
    4. legitimate vs spam email
    5. normal ocean waves sensor data vs tsunami waves

  32. Avatar
    John Sammut May 8, 2020 at 8:22 am #

    Lesson 5 – Comparing class distribution after the oversampling

    from collections import Counter
    from sklearn.datasets import make_classification
    from imblearn.over_sampling import SMOTE

    # generate dataset
    X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, weights=[0.99, 0.01],
    flip_y=0, random_state=11)

    # summarize class distribution
    counter=Counter(y)
    print(‘Before oversampling:’,counter)

    fig, ax = plt.subplots(1,2,figsize=(15,5),constrained_layout=True)

    # scatter plot of examples by class label
    for label, amount in counter.items():
    row_ix = where(y == label)[0]
    ax[0].scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

    # define oversample strategy
    oversample = SMOTE(sampling_strategy=0.5)

    # fit and apply the transform
    X_over, y_over = oversample.fit_resample(X, y)

    # summarize class distribution
    counter=Counter(y_over)
    print(‘After oversampling:’,counter)

    # scatter plot of examples by class label
    for label, amount in counter.items():
    row_ix = where(y_over == label)[0]
    ax[1].scatter(X_over[row_ix, 0], X_over[row_ix, 1], label=str(label))

    ax[1].legend
    ax[1].set_title((‘After oversampling’),fontsize=14)
    ax[0].legend
    ax[0].set_title((‘Before oversampling’),fontsize=14)

  33. Avatar
    Skylar May 20, 2020 at 4:36 am #

    Very nice course, thank you Jason! It is especially clear for Python newbie

  34. Avatar
    Kristina June 11, 2020 at 7:51 am #

    But these oversampling and undersampling methods don’t work, if you have sklearn 0.20+…
    ModuleNotFoundError: No module named ‘sklearn.externals.six’

    • Avatar
      Jason Brownlee June 11, 2020 at 1:29 pm #

      They work fine, but you must update to at least sklearn v0.23.1 and imblearn v0.6.2.

  35. Avatar
    Kondor June 13, 2020 at 6:43 am #

    Examples of imbalanced problems:
    1. Any anomaly detection (predicting Oscar winner among all movies, picking the stock that will raise 1,000% in next three months, hurricane or tsunami prediction, etc.)
    2. Contraband detection at the border
    3. Automatic detection of defects in mass-produced products
    4. Picking future high-school valedictorians among first grade students

    In fact, imbalanced problems == anomaly detection, so 2-4 can be seen as more examples of 1

  36. Avatar
    Tony June 17, 2020 at 12:38 pm #

    An example I am currently working on is land cover classification. Classes are generally imbalanced with 1:100 differences between the least and most common classes (i.e. severe imbalance) being commonplace.

  37. Avatar
    hana June 18, 2020 at 5:34 pm #

    Fraud Detection.
    Claim Prediction
    Default Prediction.
    Churn Prediction.
    Spam Detection.
    Anomaly Detection.
    Outlier Detection.
    Intrusion Detection
    Conversion Prediction.

  38. Avatar
    Mo June 20, 2020 at 1:34 am #

    Hi Jason,
    Thanks for your courses!
    I have a quick question. I am working on a classification problem where I had the imbalance data set and to resolve the issue, I simply collected more data and the imbalance is solved for the training set. I built a classifier and test it with a very good accuracy, precision, and recall. Now I want to evaluate it with a newly collected data (unseen) , but I have imbalance issue with this evaluation set. This set is used to test the model for deployment. It seems to me that this imbalance issue will be forever. What should I do to make sure my model is ready for deployment? Thanks

  39. Avatar
    Justin July 4, 2020 at 4:03 am #

    Lesson 4: Undersampling
    I’m working on a project to predict tennis match upsets. my dataset has roughly 1:4 upsets to not upsets. After undersampling non-upsets, I saw improvement in my Logistic Regression model.

    Before resampling: 734 non-upsets, 279 upsets.
    [[139 1]
    [ 61 2]]
    precision recall f1-score support

    0 0.69 0.99 0.82 140
    1 0.67 0.03 0.06 63

    accuracy 0.69 203
    macro avg 0.68 0.51 0.44 203
    weighted avg 0.69 0.69 0.58 203

    ROC AUC: mean 0.668 (sd 0.051)

    After undersampling: 558 non-upsets, 279 upsets
    [[99 8]
    [39 22]]
    precision recall f1-score support

    0 0.72 0.93 0.81 107
    1 0.73 0.36 0.48 61

    accuracy 0.72 168
    macro avg 0.73 0.64 0.65 168
    weighted avg 0.72 0.72 0.69 168

    ROC AUC: mean 0.728 (0.084)

    That’s a big improvement in recall for the positive case! I also see a higher ROC AUC.

    Thanks Jason!

  40. Avatar
    Justin July 8, 2020 at 4:34 am #

    Lesson 5: Oversampling
    Compared to my random undersampling, the oversampling method resulted in a higher ROC AUC and f1 score. (By random experimentation I found that the sweet spot was sampling_strategy=0.62 resulted in the highest for both metrics.)

    Undersampling results: ROC AUC = 0.728 (0.084) f1 = 0.48 for positive case
    Oversampling results: ROC AUC = 0.763 (0.056) f1 = 0.56 for positive case

    An interesting result, when I set sampling_strategy = 1, meaning it would balance exactly the two cases, the ROC AUC and f1 scores dropped drastically. I imagine this is because I’ve added so much randomness to the underrepresented side that it obscures the information in those cases.

    Another thing I need to learn: when evaluating f1 score, I’ve been looking At the score for the case I care about (the undersampled side). My goal is to be able to predict when that will happen with few false positives. Is there a better score for me to evaluate?

    Thanks Jason!

  41. Avatar
    Alexandre K July 8, 2020 at 1:11 pm #

    An exemple of unbalanced data is the severe occurrence while perfurating/revesting an oil well. My job is to predict revesting problems since I had perfuration issues.

  42. Avatar
    Richik Majumder July 29, 2020 at 4:39 pm #

    Hi sir. Been a follower of your tutorials and they have proven useful to me many-a-times. So first of all thanks for that.

    I had a question here, that is, what is your experience of using oversampling / undersampling / combinations like SMOTEENN and cost sensitive learning? What I mean is how do you choose one method over the other? I now have an intuition about how to decide the under vs over vs combined sampling. But these are seemingly one option altogether, that is, sampling. And cost sensitive learning seems a totally different method. So how to decide when and why to go for which method at different scenarios?

    (My mind has made 2 clusters. sampling and cost-sensitive learning!!)

  43. Avatar
    Safa Bouguezzi August 23, 2020 at 7:23 pm #

    An example I am currently working on is Fake Job Posting classification
    Class 0 (Not fraudulent) : 17014
    Cass 1 (Fraudulent) : 866

  44. Avatar
    Saurabh Sawhney August 27, 2020 at 12:18 pm #

    First up, many thanks for all the work you put in, Jason.

    An example of imbalanced learning that I thought of immediately pertains to the diagnosis of rare diseases. That led me to thinking that in fact, trying to detect anything that is rare, by definition, leads to imbalanced classification problems. So the list can be extending to finding rare things in a sea of commonality, for example, detecting fake currency notes, predicting a hole-in-one in golf, or classifying a popular OS version as buggy or bug-free.

  45. Avatar
    Kari September 5, 2020 at 8:49 am #

    An example of an imbalance class would be whether or not an applicant is accepted by medical programs at colleges like the University of New Mexico, University of Washington, and Florida State University.

  46. Avatar
    Teni September 6, 2020 at 11:51 am #

    Hi Jason,
    Thanks for always providing great tutorial. I tried to sign up for the free class but never received the email. Can you please help. Thanks

  47. Avatar
    Soniya September 14, 2020 at 7:27 pm #

    hi Jason,I work for Telco client and most of the problems i see ,seems to be the case imbalanced classification:
    1.Telco Churn problem
    2.Port Out classification
    3.Email optout prediction
    4.Device AAL/Upgrade

    As per my understanding,any imbalanced problem is the one where their in ratio of 1:100 or more b/w majority and minority class.When we give such skewed data to model,the model is unable to learn the minority class properly.

    However,my question is ,what if my data is skewed like 1:100 ratio,but I have huge volume of data,and say minority class sample is more than 500K,do we still need to balance our data..or this much records of minority class are enough for model understanding?

  48. Avatar
    Kingshuk October 21, 2020 at 10:20 am #

    Hi Jason,

    Few examples of imbalanced classification

    1. Airport security profiling travelers for terrorist threat. (TSA pre-check etc.)
    2. Unusual credit card transactions – fraud detection
    3. Weed detection from plant attributes.
    4. Datacenter malfunction conditions of servers.
    5. COVID-19 exposure detection.

    Kingshuk

  49. Avatar
    Shalini December 22, 2020 at 3:02 pm #

    Hi Jason,
    Thanks so much for the email tutorials….I have a few questions:

    I was doing modelling on bank loans dataset. There is a class imbalance of 1:10 in the traget class. I understand from your lessons that in this case the Recall score is important because mimizing False Positive is important ( The bank is interested in attracting through a campaign targetted at those who are likely to buy loans and i think wont mind targetting even if there are customers who are not likely to buy but still a balance is required).

    So my first question is :
    1. Do I think about only maximizing Recall scores here and not use the F1 score which is more of a balance between these two?

    2. If I however think of balancing between Recall and Precision score, I understand that I chose the F1 Score as well as the AUC for Precision Recall Curve for a balance to evaluate and compare model performance with others. Is my understanding right?

    3. I have a question regarding F1 Score vs Precision Recall AUC. Is these two scores equivalent and I can use any one among them for comparing and evaluating different model performances, or does the AUC-PR hold some information more than the F1 score, if so what is it, and how do I interpret this?

    Thanks
    Shalini

  50. Avatar
    Mallikharjuna Rao K January 13, 2021 at 10:18 pm #

    C-Section prediction
    Email classification,
    Movie ratings,
    Tumor prediction/analysis,
    Diabetes analysis
    Agriculture Crop predictions

  51. Avatar
    Mallikharjuna Rao K January 18, 2021 at 2:32 am #

    Hi Jason,

    In my observation
    the classification is imbalanced
    990 ‘0’ points- Majority Class
    10 ‘1’ points – 10 Minority class

  52. Avatar
    Chandana Kithalagama January 21, 2021 at 1:15 pm #

    Here are some problems that leads to class imbalance.
    – Identify defects of a product during the manufacturing process.
    – Detect cracks in mobile phone screens for claiming insurance.
    – Detect rare decease from a chest x-ray or MRI scan image.
    – Detecting failure conditions in a nuclear reactor.
    – Detect suspicious behaviour of a bank customer.

  53. Avatar
    Chandana Kithalagama January 21, 2021 at 4:37 pm #

    Running the Lesson 2 code couple of times and got this observation. This happens when 1s and 0s are completely overlapping. Hope I could show an image here.

    Accuracy: 0.987
    Precision: 0.000
    Recall: 0.000
    F-measure: 0.000

  54. Avatar
    Lorenzo Pesce January 22, 2021 at 1:04 am #

    Real world examples:
    1-Who after contracting Covid will develope Acute Respiratory Distress Syndrome (ARDS,~1/20)
    2-Who after being intubated for Covid will likely die and therefore is candidate for more desperate measures (~1/5)
    3-Which teenagers are more at risk of suicide giving social media postings and behavior (percentage unknown, but smaller than 1~100)
    4-Which mammograms contain sufficient evidence of invasive cancer ( ~ .5%)
    5- Which subset of girls (or boys) is likely to make a person happy if they become her/is partner (estimated 10E-7).

    • Avatar
      Jason Brownlee January 22, 2021 at 7:22 am #

      Well done!

      • Avatar
        Oladoja Ilobekemen Perpetual January 22, 2021 at 7:07 pm #

        Five general examples of imbalance classification problems :

        .
        Spam Detection.
        Anomaly Detection.
        Outlier Detection.
        Intrusion Detection
        Conversion Prediction.

  55. Avatar
    Anis Ben Ben Aicha January 25, 2021 at 6:44 am #

    I am a researcher and I am currently working in many fields involving imbalanced data such as:
    1- earlier detection of precancerous lesions
    2- speech spoofing detection for human/machine interfaces
    3- Urban sound analysis

  56. Avatar
    Jorrit Voigt January 25, 2021 at 8:47 pm #

    Task 3:
    In case of overlap and weights=0.9, I had the following results:

    Accuracy: 0.898
    Precision: 0.491
    Recall: 0.540
    F-measure: 0.514
    ROC-AUC: 0.739
    PR-AUC: 0.538

    Here the ROC-AUC seems to overestimate the classifier. When is it advisable to give the precision recall AUC to avoid overestimation?

  57. Avatar
    Sourabh Agarwal January 28, 2021 at 10:03 pm #

    Anomaly Detection
    Spam/Ham Detection
    Electricity theft
    Churn Rate
    Engine Rating Prediction (0/1)

  58. Avatar
    Gamze January 29, 2021 at 4:05 pm #

    Dear Jason,

    I do thank you so much for sharing.

    I am trying to make predictions by using an ML model fitted on the imbalanced dataset. I just want to ask how can I exactly reproduce the training distribution in prediction distribution? This is very important for my study.

    Thank you in advance.
    Kind Regards

    • Avatar
      Jason Brownlee January 30, 2021 at 6:31 am #

      We use a random sample or stratified random sample of the data for training and evaluating the model.

      Most models assume the samples are iid:
      https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables

      • Avatar
        Gamze January 30, 2021 at 7:36 am #

        So, could not we reproduce exactly the distribution of classes for predictions?

        • Avatar
          Jason Brownlee January 30, 2021 at 8:03 am #

          Sorry, I don’t follow – I think we are talking past each other. Perhaps you can elaborate on your question?

          • Avatar
            Gamze January 30, 2021 at 8:32 am #

            I am so sorry for the misunderstanding.

            For example, I have a training dataset with two classes (35 % class 1 and 65% class 2).

            I will make a prediction with a fitted model on a prediction dataset. I must produce the same class distribution (35 % class 1 and 65% class 2) for this dataset.

            Training dataset- 100 samples – 35 % class 1 and 65% class 2
            Prediction dataset – 500 samples – 35 % class 1 and 65% class 2

            Is it possible?

          • Avatar
            Jason Brownlee January 30, 2021 at 12:34 pm #

            Yes, this is called a stratified sample, you can learn more here:
            https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/

  59. Avatar
    kolliboina samba April 12, 2021 at 8:35 pm #

    while using smoteenn it was taking very long time resampling

    • Avatar
      Jason Brownlee April 13, 2021 at 6:07 am #

      Perhaps try using less data until you see if it helps or not?

  60. Avatar
    Faten April 16, 2021 at 10:45 pm #

    Thank you for this tutorial!
    Can the code in lesson 7 (cost sensitive learning) be used in sequential model not only regression?

  61. Avatar
    Begovega July 28, 2021 at 7:37 pm #

    Class imbalance examples:
    ->Conversion rate for customers who receive an email and purchase an expensive product
    ->Any type of disease or disorder detection
    ->Customers expending more than 80% the average expenditure (High value customers detection)
    ->Customer Churn prediction

  62. Avatar
    Victor August 13, 2021 at 2:42 pm #

    Hi Jason, brief and clear explanation of basic methods to tackle imbalance!

    How can we effectively use them in keras? Or there are some special tools?
    Of course, we may over\under-sample the whole data set before feeding into neural network, but this approach looks to be rough…
    Should we put imbalancing to some kind of generators that update inputs before each epoch or even on batch level?

    • Avatar
      Adrian Tam August 14, 2021 at 2:36 am #

      No special tools. There might be some handy functions from scikit-learn to help you preprocess the data but most importantly is to know the concept and apply it before you feed into the network. Which tool to use is not so crucial.

  63. Avatar
    Ismael Miranda August 27, 2021 at 9:29 am #

    I might to predict the failure of a process in the industry. It’s a rare situation, but in summation has a great impact.

    After finished the crash-course, it stay one doubt.

    How I use the under and over samples created to train my model?

    Should I create X_train, X_test, y_train, y_test from my new under and over samples or might I just under and over sampled the X_train and y_train from my original base?

    • Avatar
      Adrian Tam August 28, 2021 at 3:57 am #

      I think creating from original base is simpler. Did you see any problem with this approach?

  64. Avatar
    Ismael Miranda August 28, 2021 at 5:07 am #

    Actually, I’ve been trying to see it by myself.

    https://github.com/IsmaelMiranda11/cienciadedados/blob/main/classes%20desbalanceadas

  65. Avatar
    Neha October 21, 2021 at 8:19 pm #

    Class imbalance can be seen in
    1. Sentiment analysis where we see more positive reviews
    2. Email is spam or not – Most emails are spam
    3. Service required for a product after warranty expiry – Most cases will be yes
    4. Symptoms after vaccination – Most cases will be yes
    5. Screen guard bought with mobile – Most cases yes.

    I have a question regarding True Positive (TP) cases. Let me take an example of the Titanic Kaggle problem where the challenge was to predict whether a person will survive or die. If Dead was 0 and survive was 1. In this case, TP is the one who survived and was reported to survive. If in the same scenario, dead is 1 and survived is 0. Then, in this case, TP is one who died and was reported dead.

    • Avatar
      Adrian Tam October 22, 2021 at 4:11 am #

      You’re correct. Whether a label is positive or negative is a subjective design choice. But sometimes we prefer to call something positive because we want to focus on that (e.g., disease, we care about the infected and assume not infected is normal and pay less attention)

  66. Avatar
    Divine November 11, 2021 at 8:54 am #

    Hello Jason,

    Thank you for the great course!

    I have a question regarding SMOTE to balance data with rare event. My data have an event rete of about 7%. I used SMOTE to increase the event rate to about 40%. Now I want to use the model with SMOTE to predict the risk of a new patient how do I adjust or correct the risk probability for the new patient to reflect to the risk probability from the original data of 7% without SMOTE.
    The risk probability for the new data is 30% – 50% with SMOTE and the feedback I get is the risk shouldn’t be far from 7%.

    I hope my question makes sense.

    Thank you!

    • Avatar
      Adrian Tam November 14, 2021 at 1:52 pm #

      I am afraid you misunderstood something here. SMOTE is to help you generate more data, so that you help your model training (since otherwise, guess the majority class blindly automatically give you accuracy of 1-7% = 93%); but once your model is trained, and you apply the original data before SMOTE, you should approximately see how it performs in practice. You don’t need to do anything else.

  67. Avatar
    Lily June 6, 2022 at 7:06 am #

    Thanks a lot, Dr. Jason for this great short course, it answered many questions I had on the imbalanced classifications.
    I need your advice, as I am working on a multi-class model which is severely imbalanced. The output labels are 1,2,3 and they are ( class1: 94%, class2: 3% and class3 :3%). As you suspect capturing classes 2,3 are more important than class1. I am using the multi-nomial logistic regression,I tried the oversampling but the model performance is low for example the f1-score is 45%. I tried with the class weight as a dictionary but I didn’t find the right combination, and my problem is that most articles treat the imbalance classification for binary output, not multi classes. do you have any advice? or an article/resource I can’ refer to?

    Many thanks

  68. Avatar
    Lily June 7, 2022 at 7:04 pm #

    Thank you, James, the problem with the ensemble is that I can only use the logistic regression as a contributing model in the VotingClassifier because the multinomial logistic regression is the only model I could find which predicts multi-classes (not binomial only). do you have any advice?
    Thanks a lot

  69. Avatar
    Fikret July 27, 2022 at 11:39 pm #

    Hi Jason,
    Some examples of imbalanced classification:
    – predicting power outages from power quality analyser data
    – conditional monitoring for detecting equipment failures
    – medical diagnosis image classification
    – fraud detection
    – detecting anomalies

    • Avatar
      James Carmichael July 28, 2022 at 5:40 am #

      Absolutely Fikret! Keep up the great work!

Leave a Reply