Imbalanced Classification With Python (7-Day Mini-Course)

Last Updated on August 18, 2020

Imbalanced Classification Crash Course.
Get on top of imbalanced classification in 7 days.

Classification predictive modeling is the task of assigning a label to an example.

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal.

Practical imbalanced classification requires the use of a suite of specialized techniques, data preparation techniques, learning algorithms, and performance metrics.

In this crash course, you will discover how you can get started and confidently work through an imbalanced classification project with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Imbalanced Classification With Python (7-Day Mini-Course)

Imbalanced Classification With Python (7-Day Mini-Course)
Photo by Arches National Park, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

  • You know your way around basic Python for programming.
  • You may know some basic NumPy for array manipulation.
  • You may know some basic scikit-learn for modeling.

You do NOT need to be:

  • A math wiz!
  • A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can navigate an imbalanced classification project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with imbalanced classification in Python:

  • Lesson 01: Challenge of Imbalanced Classification
  • Lesson 02: Intuition for Imbalanced Data
  • Lesson 03: Evaluate Imbalanced Classification Models
  • Lesson 04: Undersampling the Majority Class
  • Lesson 05: Oversampling the Minority Class
  • Lesson 06: Combine Data Undersampling and Oversampling
  • Lesson 07: Cost-Sensitive Algorithms

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the algorithms and the best-of-breed tools in Python. (Hint: I have all of the answers directly on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Imbalanced Classification with Python.”

Lesson 01: Challenge of Imbalanced Classification

In this lesson, you will discover the challenge of imbalanced classification problems.

Imbalanced classification problems pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

  • Majority Class: More than half of the examples belong to this class, often the negative or normal case.
  • Minority Class: Less than half of the examples belong to this class, often the positive or abnormal case.

A classification problem may be a little skewed, such as if there is a slight imbalance. Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset.

  • Slight Imbalance. Where the distribution of examples is uneven by a small amount in the training dataset (e.g. 4:6).
  • Severe Imbalance. Where the distribution of examples is uneven by a large amount in the training dataset (e.g. 1:100 or more).

Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.

As such, it is surprising that imbalanced classification does not get more attention than it does.

Your Task

For this lesson, you must list five general examples of problems that inherently have a class imbalance.

One example might be fraud detection, another might be intrusion detection.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to develop an intuition for skewed class distributions.

Lesson 02: Intuition for Imbalanced Data

In this lesson, you will discover how to develop a practical intuition for imbalanced classification datasets.

A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means. For example, what is the difference and implication for a 1:10 vs. a 1:100 class ratio?

The make_classification() scikit-learn function can be used to define a synthetic dataset with a desired class imbalance. The “weights” argument specifies the ratio of examples in the negative class, e.g. [0.99, 0.01] means that 99 percent of the examples will belong to the majority class, and the remaining 1 percent will belong to the minority class.

Once defined, we can summarize the class distribution using a Counter object to get an idea of exactly how many examples belong to each class.

We can also create a scatter plot of the dataset because there are only two input variables. The dots can then be colored by each class. This plot provides a visual intuition for what exactly a 99 percent vs. 1 percent majority/minority class imbalance looks like in practice.

The complete example of creating and summarizing an imbalanced classification dataset is listed below.

Your Task

For this lesson, you must run the example and review the plot.

For bonus points, you can test different class ratios and review the results.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to evaluate models for imbalanced classification.

Lesson 03: Evaluate Imbalanced Classification Models

In this lesson, you will discover how to evaluate models on imbalanced classification problems.

Prediction accuracy is the most common metric for classification tasks, although it is inappropriate and potentially dangerously misleading when used on imbalanced classification tasks.

The reason for this is because if 98 percent of the data belongs to the negative class, you can achieve 98 percent accuracy on average by simply predicting the negative class all the time, achieving a score that naively looks good, but in practice has no skill.

Instead, alternate performance metrics must be adopted.

Popular alternatives are the precision and recall scores that allow the performance of the model to be considered by focusing on the minority class, called the positive class.

Precision calculates the ratio of the number of correctly predicted positive examples divided by the total number of positive examples that were predicted. Maximizing the precision will minimize the false positives.

  • Precision = TruePositives / (TruePositives + FalsePositives)

Recall predicts the ratio of the total number of correctly predicted positive examples divided by the total number of positive examples that could have been predicted. Maximizing recall will minimize false negatives.

  • Recall = TruePositives / (TruePositives + FalseNegatives)

The performance of a model can be summarized by a single score that averages both the precision and the recall, called the F-Measure. Maximizing the F-Measure will maximize both the precision and recall at the same time.

  • F-measure = (2 * Precision * Recall) / (Precision + Recall)

The example below fits a logistic regression model on an imbalanced classification problem and calculates the accuracy, which can then be compared to the precision, recall, and F-measure.

Your Task

For this lesson, you must run the example and compare the classification accuracy to the other metrics, such as precision, recall, and F-measure.

For bonus points, try other metrics such as Fbeta-measure and ROC AUC scores.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to undersample the majority class.

Lesson 04: Undersampling the Majority Class

In this lesson, you will discover how to undersample the majority class in the training dataset.

A simple approach to using standard machine learning algorithms on an imbalanced dataset is to change the training dataset to have a more balanced class distribution.

This can be achieved by deleting examples from the majority class, referred to as “undersampling.” A possible downside is that examples from the majority class that are helpful during modeling may be deleted.

The imbalanced-learn library provides many examples of undersampling algorithms. This library can be installed easily using pip; for example:

A fast and reliable approach is to randomly delete examples from the majority class to reduce the imbalance to a ratio that is less severe or even so that the classes are even.

The example below creates a synthetic imbalanced classification data, then uses RandomUnderSampler class to change the class distribution from 1:100 minority to majority classes to the less severe 1:2.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after undersampling the majority class.

For bonus points, try other undersampling ratios or even try other undersampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to oversample the minority class.

Lesson 05: Oversampling the Minority Class

In this lesson, you will discover how to oversample the minority class in the training dataset.

An alternative to deleting examples from the majority class is to add new examples from the minority class.

This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information. Instead, new examples from the minority can be synthesized using existing examples in the training dataset. These new examples will be “close” to existing examples in the feature space, but different in small but random ways.

The SMOTE algorithm is a popular approach for oversampling the minority class. This technique can be used to reduce the imbalance or to make the class distribution even.

The example below demonstrates using the SMOTE class provided by the imbalanced-learn library on a synthetic dataset. The initial class distribution is 1:100 and the minority class is oversampled to a 1:2 distribution.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after oversampling the minority class.

For bonus points, try other oversampling ratios, or even try other oversampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to combine undersampling and oversampling techniques.

Lesson 06: Combine Data Undersampling and Oversampling

In this lesson, you will discover how to combine data undersampling and oversampling on a training dataset.

Data undersampling will delete examples from the majority class, whereas data oversampling will add examples to the minority class. These two approaches can be combined and used on a single training dataset.

Given that there are so many different data sampling techniques to choose from, it can be confusing as to which methods to combine. Thankfully, there are common combinations that have been shown to work well in practice; some examples include:

  • Random Undersampling with SMOTE oversampling.
  • Tomek Links Undersampling with SMOTE oversampling.
  • Edited Nearest Neighbors Undersampling with SMOTE oversampling.

These combinations can be applied manually to a given training dataset by first applying one sampling algorithm, then another. Thankfully, the imbalanced-learn library provides implementations of common combined data sampling techniques.

The example below demonstrates how to use the SMOTEENN that combines both SMOTE oversampling of the minority class and Edited Nearest Neighbors undersampling of the majority class.

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after the data sampling.

For bonus points, try other combined data sampling techniques or even try manually applying oversampling followed by undersampling on the dataset.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Lesson 07: Cost-Sensitive Algorithms

In this lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Most machine learning algorithms assume that all misclassification errors made by a model are equal. This is often not the case for imbalanced classification problems, where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. Many machine learning algorithms can be updated to be cost-sensitive, where the model is penalized for misclassification errors from one class more than the other, such as the minority class.

The scikit-learn library provides this capability for a range of algorithms via the class_weight attribute specified when defining the model. A weighting can be specified that is inversely proportional to the class distribution.

If the class distribution was 0.99 to 0.01 for the majority and minority classes, then the class_weight argument could be defined as a dictionary that defines a penalty of 0.01 for errors made for the majority class and a penalty of 0.99 for errors made with the minority class, e.g. {0:0.01, 1:0.99}.

This is a useful heuristic and can be configured automatically by setting the class_weight argument to the string ‘balanced‘.

The example below demonstrates how to define and fit a cost-sensitive logistic regression model on an imbalanced classification dataset.

Your Task

For this lesson, you must run the example and review the performance of the cost-sensitive model.

For bonus points, compare the performance to the cost-insensitive version of logistic regression.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson of the mini-course.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

  • The challenge of imbalanced classification is the lack of examples for the minority class and the difference in importance of classification errors across the classes.
  • How to develop a spatial intuition for imbalanced classification datasets that might inform data preparation and algorithm selection.
  • The failure of classification accuracy and how alternate metrics like precision, recall, and the F-measure can better summarize model performance on imbalanced datasets.
  • How to delete examples from the majority class in the training dataset, referred to as data undersampling.
  • How to synthesize new examples in the minority class in the training dataset, referred to as data oversampling.
  • How to combine data oversampling and undersampling techniques on the training dataset, and common combinations that result in good performance.
  • How to use cost-sensitive modified versions of machine learning algorithms to improve performance on imbalanced classification datasets.

Take the next step and check out my book on Imbalanced Classification with Python.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

96 Responses to Imbalanced Classification With Python (7-Day Mini-Course)

  1. Ken Jones January 17, 2020 at 8:32 am #

    A real-world example I am working on right now is predicting no shows for medical providers. In the data set I am working with, about 5% of appointments are no shows. The balance are shows or occurred appointments. The interesting class are the no shows.

  2. Mark Littlewood January 17, 2020 at 6:43 pm #

    A real world example I am concerned with is predicting the winners of horse races. Typically each race has on average around 11 runners and ten of those will have a ‘0’ meaning they did not win the race whilst one row will be the winner with a ‘1’. Funnily enough, using GBM I have not found balancing to be helpful but maybe I am coming at it from a wrong perspective with my balancing technique

    • Jason Brownlee January 18, 2020 at 8:38 am #

      Interesting problem.

      I would recommend looking at rating systems.

  3. Kate Strydom January 17, 2020 at 6:50 pm #

    I work on lead data, predicting call centre sales on various telecommunication and lead generation campaigns where the responses are always imbalanced. That is, sales versus no sale, and hot lead versus no lead. I usually just take a random sample of the negative response equivalent to the positive response in my sample prior to pulling the data into Python. Call centre data responses are always imbalanced due to the nature of the business. I would be keen to learn other ways to balance the responses.

  4. Mark Littlewood January 17, 2020 at 7:36 pm #

    With flip_y set to zero the 1 and 0s are created in a pretty distinct manner, there is little overlap. They also appear to be pretty linear in relation to the predictors

  5. Mark Littlewood January 17, 2020 at 8:41 pm #

    The FBeta-score is interesting as a combination of precision and recall but you can weight it towards precision or recall. With betting precision is more important perhaps as bets that lose cost you money where as false negatives are annoying but not financially damaging.

  6. Alexander Binkovsky January 17, 2020 at 8:57 pm #

    A real-world example of imbalanced classification I’m working on right now is anomaly detection in monitoring data of a huge distributed application.

  7. Ciprian Saramet January 18, 2020 at 2:07 am #

    a real-world example of imbalanced data is server monitoring logs and trying to predict service failures

  8. Joy January 18, 2020 at 6:17 pm #

    Hi Jason. Thank you for the insightful article!!
    Your code runs fine Jason. But I cannot understand this piece:

    for label, _ in counter.items(): (<— this is for iterating over the dictionary of counter.items()

    row_ix = where(y == label)[0] (<— Does this mean row_ix is equal to x values of only those counter items belonging to class 0?)

    pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) (<— plotting the points of both classes 0 and 1)

    Can you please explain a bit? I am not that proficient in Python

    • Jason Brownlee January 19, 2020 at 7:14 am #

      Thanks.

      We iterate over the labels.

      We can get the row indexes for all the examples in y that have the label.

      We then plot the points with those row indexes.

  9. Vijay Pal January 18, 2020 at 6:56 pm #

    Customer churn and customer chargeback are two classical cases.

  10. Moi khalile January 18, 2020 at 8:17 pm #

    a real-world example of imbalanced data is medical images classification

  11. Shay January 18, 2020 at 11:07 pm #

    Malware detection in the real world is inherently severely imbalanced.

  12. Damir Zunic January 20, 2020 at 6:57 am #

    I worked on predicting risk for type 2 diabetes using 3 classes per A1C levels: no-diabetes (77%), pre-diabetes (6%) and diabetes(17%).

  13. Sanchit Bhavsar January 21, 2020 at 2:39 am #

    The highly imbalanced data problem I worked on was to predict user ad clicks with a majority of non-clicks (99%) and clicks (1%).

  14. James January 21, 2020 at 1:26 pm #

    Lesson 1: Five general examples of class imbalance
    1. Populations of rare and/or endangered species
    2. Incidence of rare diseases
    3. Extreme weather patterns
    4. Excessive spending for non-essential items
    5. Mechanical problems leading to highly probable breakdowns

  15. Nwe January 21, 2020 at 3:59 pm #

    I think that the misclassification error cost of imbalanced data classification is not same for each class because the number of training samples are not same. So, the performance metrics may be depend on all training class.

    Please, give me suggestion in my opinion.

    • Jason Brownlee January 22, 2020 at 6:18 am #

      Yes, although it will depend on the specifics of the dataset.

  16. Sachin Prabhu January 23, 2020 at 1:42 am #

    Answer for Lesson 01: Challenge of Imbalanced Classification

    1. Cancer cell prediction
    2. Spam/Ham classification
    3. Earthquake prediction
    4. Page Blocks Classification
    5. Glass Identification

  17. Sudhansu R. Lenka January 24, 2020 at 7:52 pm #

    Dataset will be balanced on original dataset, or we need to split into train and test first and then balance the training set

  18. stanislav January 26, 2020 at 6:01 am #

    Coming examples only from social field:
    1. number of rich/poor
    2. —“– wealthy/ill
    3. —“– birth/death
    4. —“– buyer/seller
    5. —“– desires (wishes)/achievements

  19. Vinod Kumar February 6, 2020 at 1:30 pm #

    Sir what are the techniques to fix the imbalance in data set

    • Jason Brownlee February 6, 2020 at 1:48 pm #

      There are many:

      – choose the right metric
      – try data over/undersampling
      – try cost sensitive models
      – try threshold moving
      – try one class models
      – try specialized ensembles
      – …

      Compare everything to default models to demonstrate they add value/skill.

  20. Animesh February 9, 2020 at 4:36 am #

    One more real time Example can be Sales return prediction for an online portal…

  21. Cheyu February 18, 2020 at 3:00 pm #

    Great course, it is really helpful.

    Here I have two questions about the proposed weights for the majority and minority classes.

    (1) As in the example, suppose that IR is given 99 (label 0 is 0.99, label 1 is 0.01) and you suggest to give weight 0.01 for the majority class and give weight 0.99 for the minority class. the results will be balanced after multiplying with the weights. Are there any references and papers to support?

    (2) If the cost for misclassification error is pre-defined (for both FP & FN) but no rewards are given (for both TP & TN), is it a good way to follow the cost-sensitivity manipulation (i.e., assign weights) still?

  22. Luis M. February 21, 2020 at 4:51 pm #

    One example of imbalanced classes is present in biometric recognition.
    An usual dataset has a number of Nu individuals/users and Ns_u samples per individual.
    From the genuine pairs and impostor pairs of a given dataset, we obtain two classes of matching scores: genuine scores and impostor scores.
    This is Nu*(Ns_u-1)*(Ns_u)/2 genuine scores and (Ns_u^2)*Nu*(Nu-1)/2 impostor scores.
    Let us say that you have 100 individuals and 10 sample per individual.
    Then, you would have 4500 genuine scores and 495000 impostor scores.
    This is an important problem for multi-biometric recognition where you need to train a classifier for score fusion.

  23. elli February 21, 2020 at 11:55 pm #

    Five examples of imbalanced classes might be: cancer detection in cells, detecting students with learning differences, calls to a call center which are about an unexpected topic, loan defaults, or detecting a rare disease from patient medical records.

  24. Nitish Khairnar February 26, 2020 at 6:23 pm #

    Hi Jason,
    sharing my examples as task of Day -1 of Imbalance Classification.

    1- detection of patients having high level of blood sugar
    2- lottery winning ratio
    3- call drop due to technical glitches
    4- getting a right swipe on tinder(severely imbalaced)
    5- chances of raining in the month of February (here in India it is very very rare)
    6- getting a cashback after scratching a gift card in Google Pay(folks using this app for payment can connect well with this example)

    Thank you

  25. Naresh Sharma March 4, 2020 at 7:25 pm #

    I’m working on predicting of a person with certain demographic features, is likely to become a customer based on a mail campaign. The dataset shows that only 1.3% of folks become customers.

  26. Retsilisitsoe Raymond Moholisa March 7, 2020 at 1:42 am #

    Hi Jason, sharing my examples for Day 1 on Imbalance Classification
    1.Detecting patients with rare genetic diseases in medical databases
    2.Predicting customer churn behavior in telemarketing
    3.Predicting multiview face recognition
    4.Predicting skin cancer from medical images
    5.Leopard/Cheetah Classification

  27. Saman Tamkeen March 27, 2020 at 5:51 pm #

    Task for Lesson 01:

    I am trying to model a l2r algorithm. The measure of relevance is that, given a list of items, the items that were clicked are more relevant as compared to those that were seen and not clicked.

    In my example dataset, only 6.5% of items are relevant while remaining 93.5% are not

  28. milad May 1, 2020 at 2:01 pm #

    1-How to Handling Imbalanced data for multi-label classification in Keras?
    I’m dealing with a multi-label classification problem with a imbalanced dataset. I want to oversample with SMOTE. However, I don’t know how to achieve it since the label is like [0,1,0,0,1,0,1,1,0,0,0,0].

    2-how to set class_weight for multi label dataset in keras?

  29. JG May 7, 2020 at 2:49 am #

    Hi Jason,

    In addition to these Oversampling Smote or variants techniques such as (“smoteenn” including Undersamplig ENN) or just RandomUnderSampler, I recently have known that Keras also provide a toll as an argument for the “.fit” method, called “class_weight” where you can specified as a dictionary the integer class label vs percentage of weighting to be applied to the loss function to each class label. See reference : https://keras.io/models/sequential/

    Do you have any tutorial on this keras technique? It is basic undersampling(oversampling techniques) such as eliminating some data during training? which it is the main differences between this fit argument keras method and the ones explained here (over and under sampling) ?

    thanks in advance
    regards

  30. JG May 8, 2020 at 3:49 am #

    Thank you very much Jason!

    You are not only a great ML/DL computer Engineer but also an excellent Professor that achieve incredible outreach performance and the best one value, at least for me, a great person!

    regards,
    JG

  31. John Sammut May 8, 2020 at 6:04 am #

    Hi Jason, thanks for this course.

    Here’s the list that I thought of:

    1. road anomaly vs good road surface
    2. normal seismic activity sensor data vs volcano eruption seismic activity
    3. healthy lung x-ray image vs lung cancer x-ray image
    4. legitimate vs spam email
    5. normal ocean waves sensor data vs tsunami waves

  32. John Sammut May 8, 2020 at 8:22 am #

    Lesson 5 – Comparing class distribution after the oversampling

    from collections import Counter
    from sklearn.datasets import make_classification
    from imblearn.over_sampling import SMOTE

    # generate dataset
    X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
    n_clusters_per_class=1, weights=[0.99, 0.01],
    flip_y=0, random_state=11)

    # summarize class distribution
    counter=Counter(y)
    print(‘Before oversampling:’,counter)

    fig, ax = plt.subplots(1,2,figsize=(15,5),constrained_layout=True)

    # scatter plot of examples by class label
    for label, amount in counter.items():
    row_ix = where(y == label)[0]
    ax[0].scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

    # define oversample strategy
    oversample = SMOTE(sampling_strategy=0.5)

    # fit and apply the transform
    X_over, y_over = oversample.fit_resample(X, y)

    # summarize class distribution
    counter=Counter(y_over)
    print(‘After oversampling:’,counter)

    # scatter plot of examples by class label
    for label, amount in counter.items():
    row_ix = where(y_over == label)[0]
    ax[1].scatter(X_over[row_ix, 0], X_over[row_ix, 1], label=str(label))

    ax[1].legend
    ax[1].set_title((‘After oversampling’),fontsize=14)
    ax[0].legend
    ax[0].set_title((‘Before oversampling’),fontsize=14)

  33. Skylar May 20, 2020 at 4:36 am #

    Very nice course, thank you Jason! It is especially clear for Python newbie

  34. Kristina June 11, 2020 at 7:51 am #

    But these oversampling and undersampling methods don’t work, if you have sklearn 0.20+…
    ModuleNotFoundError: No module named ‘sklearn.externals.six’

    • Jason Brownlee June 11, 2020 at 1:29 pm #

      They work fine, but you must update to at least sklearn v0.23.1 and imblearn v0.6.2.

  35. Kondor June 13, 2020 at 6:43 am #

    Examples of imbalanced problems:
    1. Any anomaly detection (predicting Oscar winner among all movies, picking the stock that will raise 1,000% in next three months, hurricane or tsunami prediction, etc.)
    2. Contraband detection at the border
    3. Automatic detection of defects in mass-produced products
    4. Picking future high-school valedictorians among first grade students

    In fact, imbalanced problems == anomaly detection, so 2-4 can be seen as more examples of 1

  36. Tony June 17, 2020 at 12:38 pm #

    An example I am currently working on is land cover classification. Classes are generally imbalanced with 1:100 differences between the least and most common classes (i.e. severe imbalance) being commonplace.

  37. hana June 18, 2020 at 5:34 pm #

    Fraud Detection.
    Claim Prediction
    Default Prediction.
    Churn Prediction.
    Spam Detection.
    Anomaly Detection.
    Outlier Detection.
    Intrusion Detection
    Conversion Prediction.

  38. Mo June 20, 2020 at 1:34 am #

    Hi Jason,
    Thanks for your courses!
    I have a quick question. I am working on a classification problem where I had the imbalance data set and to resolve the issue, I simply collected more data and the imbalance is solved for the training set. I built a classifier and test it with a very good accuracy, precision, and recall. Now I want to evaluate it with a newly collected data (unseen) , but I have imbalance issue with this evaluation set. This set is used to test the model for deployment. It seems to me that this imbalance issue will be forever. What should I do to make sure my model is ready for deployment? Thanks

  39. Justin July 4, 2020 at 4:03 am #

    Lesson 4: Undersampling
    I’m working on a project to predict tennis match upsets. my dataset has roughly 1:4 upsets to not upsets. After undersampling non-upsets, I saw improvement in my Logistic Regression model.

    Before resampling: 734 non-upsets, 279 upsets.
    [[139 1]
    [ 61 2]]
    precision recall f1-score support

    0 0.69 0.99 0.82 140
    1 0.67 0.03 0.06 63

    accuracy 0.69 203
    macro avg 0.68 0.51 0.44 203
    weighted avg 0.69 0.69 0.58 203

    ROC AUC: mean 0.668 (sd 0.051)

    After undersampling: 558 non-upsets, 279 upsets
    [[99 8]
    [39 22]]
    precision recall f1-score support

    0 0.72 0.93 0.81 107
    1 0.73 0.36 0.48 61

    accuracy 0.72 168
    macro avg 0.73 0.64 0.65 168
    weighted avg 0.72 0.72 0.69 168

    ROC AUC: mean 0.728 (0.084)

    That’s a big improvement in recall for the positive case! I also see a higher ROC AUC.

    Thanks Jason!

  40. Justin July 8, 2020 at 4:34 am #

    Lesson 5: Oversampling
    Compared to my random undersampling, the oversampling method resulted in a higher ROC AUC and f1 score. (By random experimentation I found that the sweet spot was sampling_strategy=0.62 resulted in the highest for both metrics.)

    Undersampling results: ROC AUC = 0.728 (0.084) f1 = 0.48 for positive case
    Oversampling results: ROC AUC = 0.763 (0.056) f1 = 0.56 for positive case

    An interesting result, when I set sampling_strategy = 1, meaning it would balance exactly the two cases, the ROC AUC and f1 scores dropped drastically. I imagine this is because I’ve added so much randomness to the underrepresented side that it obscures the information in those cases.

    Another thing I need to learn: when evaluating f1 score, I’ve been looking At the score for the case I care about (the undersampled side). My goal is to be able to predict when that will happen with few false positives. Is there a better score for me to evaluate?

    Thanks Jason!

  41. Alexandre K July 8, 2020 at 1:11 pm #

    An exemple of unbalanced data is the severe occurrence while perfurating/revesting an oil well. My job is to predict revesting problems since I had perfuration issues.

  42. Richik Majumder July 29, 2020 at 4:39 pm #

    Hi sir. Been a follower of your tutorials and they have proven useful to me many-a-times. So first of all thanks for that.

    I had a question here, that is, what is your experience of using oversampling / undersampling / combinations like SMOTEENN and cost sensitive learning? What I mean is how do you choose one method over the other? I now have an intuition about how to decide the under vs over vs combined sampling. But these are seemingly one option altogether, that is, sampling. And cost sensitive learning seems a totally different method. So how to decide when and why to go for which method at different scenarios?

    (My mind has made 2 clusters. sampling and cost-sensitive learning!!)

  43. Safa Bouguezzi August 23, 2020 at 7:23 pm #

    An example I am currently working on is Fake Job Posting classification
    Class 0 (Not fraudulent) : 17014
    Cass 1 (Fraudulent) : 866

  44. Saurabh Sawhney August 27, 2020 at 12:18 pm #

    First up, many thanks for all the work you put in, Jason.

    An example of imbalanced learning that I thought of immediately pertains to the diagnosis of rare diseases. That led me to thinking that in fact, trying to detect anything that is rare, by definition, leads to imbalanced classification problems. So the list can be extending to finding rare things in a sea of commonality, for example, detecting fake currency notes, predicting a hole-in-one in golf, or classifying a popular OS version as buggy or bug-free.

  45. Kari September 5, 2020 at 8:49 am #

    An example of an imbalance class would be whether or not an applicant is accepted by medical programs at colleges like the University of New Mexico, University of Washington, and Florida State University.

  46. Teni September 6, 2020 at 11:51 am #

    Hi Jason,
    Thanks for always providing great tutorial. I tried to sign up for the free class but never received the email. Can you please help. Thanks

  47. Soniya September 14, 2020 at 7:27 pm #

    hi Jason,I work for Telco client and most of the problems i see ,seems to be the case imbalanced classification:
    1.Telco Churn problem
    2.Port Out classification
    3.Email optout prediction
    4.Device AAL/Upgrade

    As per my understanding,any imbalanced problem is the one where their in ratio of 1:100 or more b/w majority and minority class.When we give such skewed data to model,the model is unable to learn the minority class properly.

    However,my question is ,what if my data is skewed like 1:100 ratio,but I have huge volume of data,and say minority class sample is more than 500K,do we still need to balance our data..or this much records of minority class are enough for model understanding?

Leave a Reply