Imbalanced Classification With Python (7-Day Mini-Course)

By Jason Brownlee on January 5, 2021 in Imbalanced Classification 150

Imbalanced Classification Crash Course.
Get on top of imbalanced classification in 7 days.

Classification predictive modeling is the task of assigning a label to an example.

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal.

Practical imbalanced classification requires the use of a suite of specialized techniques, data preparation techniques, learning algorithms, and performance metrics.

In this crash course, you will discover how you can get started and confidently work through an imbalanced classification project with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Updated Jan/2021: Updated links for API documentation.

Imbalanced Classification With Python (7-Day Mini-Course)
Photo by Arches National Park, some rights reserved.

Who Is This Crash-Course For?

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

You know your way around basic Python for programming.
You may know some basic NumPy for array manipulation.
You may know some basic scikit-learn for modeling.

You do NOT need to be:

A math wiz!
A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can navigate an imbalanced classification project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

How to Setup Your Python Environment for Machine Learning With Anaconda

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Crash-Course Overview

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with imbalanced classification in Python:

Lesson 01: Challenge of Imbalanced Classification
Lesson 02: Intuition for Imbalanced Data
Lesson 03: Evaluate Imbalanced Classification Models
Lesson 04: Undersampling the Majority Class
Lesson 05: Oversampling the Minority Class
Lesson 06: Combine Data Undersampling and Oversampling
Lesson 07: Cost-Sensitive Algorithms

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the algorithms and the best-of-breed tools in Python. (Hint: I have all of the answers directly on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

Note: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Imbalanced Classification with Python.”

Lesson 01: Challenge of Imbalanced Classification

In this lesson, you will discover the challenge of imbalanced classification problems.

Imbalanced classification problems pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

Majority Class: More than half of the examples belong to this class, often the negative or normal case.
Minority Class: Less than half of the examples belong to this class, often the positive or abnormal case.

A classification problem may be a little skewed, such as if there is a slight imbalance. Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset.

Slight Imbalance. Where the distribution of examples is uneven by a small amount in the training dataset (e.g. 4:6).
Severe Imbalance. Where the distribution of examples is uneven by a large amount in the training dataset (e.g. 1:100 or more).

Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.

As such, it is surprising that imbalanced classification does not get more attention than it does.

Your Task

For this lesson, you must list five general examples of problems that inherently have a class imbalance.

One example might be fraud detection, another might be intrusion detection.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to develop an intuition for skewed class distributions.

Lesson 02: Intuition for Imbalanced Data

In this lesson, you will discover how to develop a practical intuition for imbalanced classification datasets.

A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means. For example, what is the difference and implication for a 1:10 vs. a 1:100 class ratio?

The make_classification() scikit-learn function can be used to define a synthetic dataset with a desired class imbalance. The “weights” argument specifies the ratio of examples in the negative class, e.g. [0.99, 0.01] means that 99 percent of the examples will belong to the majority class, and the remaining 1 percent will belong to the minority class.

...
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)

...

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)

Once defined, we can summarize the class distribution using a Counter object to get an idea of exactly how many examples belong to each class.

...
# summarize class distribution
counter = Counter(y)
print(counter)

...

# summarize class distribution

counter = Counter(y)

print(counter)

We can also create a scatter plot of the dataset because there are only two input variables. The dots can then be colored by each class. This plot provides a visual intuition for what exactly a 99 percent vs. 1 percent majority/minority class imbalance looks like in practice.

The complete example of creating and summarizing an imbalanced classification dataset is listed below.

# plot imbalanced classification problem
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
from numpy import where
# define dataset
X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)
# summarize class distribution
counter = Counter(y)
print(counter)
# scatter plot of examples by class label
for label, _ in counter.items():
	row_ix = where(y == label)[0]
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()

# plot imbalanced classification problem

from collections import Counter

from sklearn.datasets import make_classification

from matplotlib import pyplot

from numpy import where

# define dataset

X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)

# summarize class distribution

counter = Counter(y)

print(counter)

# scatter plot of examples by class label

for label, _ in counter.items():

row_ix = where(y == label)[0]

pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

pyplot.legend()

pyplot.show()

Your Task

For this lesson, you must run the example and review the plot.

For bonus points, you can test different class ratios and review the results.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to evaluate models for imbalanced classification.

Lesson 03: Evaluate Imbalanced Classification Models

In this lesson, you will discover how to evaluate models on imbalanced classification problems.

Prediction accuracy is the most common metric for classification tasks, although it is inappropriate and potentially dangerously misleading when used on imbalanced classification tasks.

The reason for this is because if 98 percent of the data belongs to the negative class, you can achieve 98 percent accuracy on average by simply predicting the negative class all the time, achieving a score that naively looks good, but in practice has no skill.

Instead, alternate performance metrics must be adopted.

Popular alternatives are the precision and recall scores that allow the performance of the model to be considered by focusing on the minority class, called the positive class.

Precision calculates the ratio of the number of correctly predicted positive examples divided by the total number of positive examples that were predicted. Maximizing the precision will minimize the false positives.

Precision = TruePositives / (TruePositives + FalsePositives)

Recall predicts the ratio of the total number of correctly predicted positive examples divided by the total number of positive examples that could have been predicted. Maximizing recall will minimize false negatives.

Recall = TruePositives / (TruePositives + FalseNegatives)

The performance of a model can be summarized by a single score that averages both the precision and the recall, called the F-Measure. Maximizing the F-Measure will maximize both the precision and recall at the same time.

F-measure = (2 * Precision * Recall) / (Precision + Recall)

The example below fits a logistic regression model on an imbalanced classification problem and calculates the accuracy, which can then be compared to the precision, recall, and F-measure.

# evaluate imbalanced classification model with different metrics
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y)
# define model
model = LogisticRegression(solver='liblinear')
# fit model
model.fit(trainX, trainy)
# predict on test set
yhat = model.predict(testX)
# evaluate predictions
print('Accuracy: %.3f' % accuracy_score(testy, yhat))
print('Precision: %.3f' % precision_score(testy, yhat))
print('Recall: %.3f' % recall_score(testy, yhat))
print('F-measure: %.3f' % f1_score(testy, yhat))

# evaluate imbalanced classification model with different metrics

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

from sklearn.metrics import precision_score

from sklearn.metrics import recall_score

from sklearn.metrics import f1_score

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)

# split into train/test sets with same class ratio

trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y)

# define model

model = LogisticRegression(solver='liblinear')

# fit model

model.fit(trainX, trainy)

# predict on test set

yhat = model.predict(testX)

# evaluate predictions

print('Accuracy: %.3f' % accuracy_score(testy, yhat))

print('Precision: %.3f' % precision_score(testy, yhat))

print('Recall: %.3f' % recall_score(testy, yhat))

print('F-measure: %.3f' % f1_score(testy, yhat))

Your Task

For this lesson, you must run the example and compare the classification accuracy to the other metrics, such as precision, recall, and F-measure.

For bonus points, try other metrics such as Fbeta-measure and ROC AUC scores.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to undersample the majority class.

Lesson 04: Undersampling the Majority Class

In this lesson, you will discover how to undersample the majority class in the training dataset.

A simple approach to using standard machine learning algorithms on an imbalanced dataset is to change the training dataset to have a more balanced class distribution.

This can be achieved by deleting examples from the majority class, referred to as “undersampling.” A possible downside is that examples from the majority class that are helpful during modeling may be deleted.

The imbalanced-learn library provides many examples of undersampling algorithms. This library can be installed easily using pip; for example:

pip install imbalanced-learn

1	pip install imbalanced-learn

A fast and reliable approach is to randomly delete examples from the majority class to reduce the imbalance to a ratio that is less severe or even so that the classes are even.

The example below creates a synthetic imbalanced classification data, then uses RandomUnderSampler class to change the class distribution from 1:100 minority to majority classes to the less severe 1:2.

# example of undersampling the majority class
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import RandomUnderSampler
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)
# summarize class distribution
print(Counter(y))
# define undersample strategy
undersample = RandomUnderSampler(sampling_strategy=0.5)
# fit and apply the transform
X_under, y_under = undersample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_under))

# example of undersampling the majority class

from collections import Counter

from sklearn.datasets import make_classification

from imblearn.under_sampling import RandomUnderSampler

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)

# summarize class distribution

print(Counter(y))

# define undersample strategy

undersample = RandomUnderSampler(sampling_strategy=0.5)

# fit and apply the transform

X_under, y_under = undersample.fit_resample(X, y)

# summarize class distribution

print(Counter(y_under))

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after undersampling the majority class.

For bonus points, try other undersampling ratios or even try other undersampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to oversample the minority class.

Lesson 05: Oversampling the Minority Class

In this lesson, you will discover how to oversample the minority class in the training dataset.

An alternative to deleting examples from the majority class is to add new examples from the minority class.

This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information. Instead, new examples from the minority can be synthesized using existing examples in the training dataset. These new examples will be “close” to existing examples in the feature space, but different in small but random ways.

The SMOTE algorithm is a popular approach for oversampling the minority class. This technique can be used to reduce the imbalance or to make the class distribution even.

The example below demonstrates using the SMOTE class provided by the imbalanced-learn library on a synthetic dataset. The initial class distribution is 1:100 and the minority class is oversampled to a 1:2 distribution.

# example of oversampling the minority class
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)
# summarize class distribution
print(Counter(y))
# define oversample strategy
oversample = SMOTE(sampling_strategy=0.5)
# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over))

# example of oversampling the minority class

from collections import Counter

from sklearn.datasets import make_classification

from imblearn.over_sampling import SMOTE

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)

# summarize class distribution

print(Counter(y))

# define oversample strategy

oversample = SMOTE(sampling_strategy=0.5)

# fit and apply the transform

X_over, y_over = oversample.fit_resample(X, y)

# summarize class distribution

print(Counter(y_over))

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after oversampling the minority class.

For bonus points, try other oversampling ratios, or even try other oversampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to combine undersampling and oversampling techniques.

Lesson 06: Combine Data Undersampling and Oversampling

In this lesson, you will discover how to combine data undersampling and oversampling on a training dataset.

Data undersampling will delete examples from the majority class, whereas data oversampling will add examples to the minority class. These two approaches can be combined and used on a single training dataset.

Given that there are so many different data sampling techniques to choose from, it can be confusing as to which methods to combine. Thankfully, there are common combinations that have been shown to work well in practice; some examples include:

Random Undersampling with SMOTE oversampling.
Tomek Links Undersampling with SMOTE oversampling.
Edited Nearest Neighbors Undersampling with SMOTE oversampling.

These combinations can be applied manually to a given training dataset by first applying one sampling algorithm, then another. Thankfully, the imbalanced-learn library provides implementations of common combined data sampling techniques.

The example below demonstrates how to use the SMOTEENN that combines both SMOTE oversampling of the minority class and Edited Nearest Neighbors undersampling of the majority class.

# example of both undersampling and oversampling
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.combine import SMOTEENN
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)
# summarize class distribution
print(Counter(y))
# define sampling strategy
sample = SMOTEENN(sampling_strategy=0.5)
# fit and apply the transform
X_over, y_over = sample.fit_resample(X, y)
# summarize class distribution
print(Counter(y_over))

# example of both undersampling and oversampling

from collections import Counter

from sklearn.datasets import make_classification

from imblearn.combine import SMOTEENN

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0)

# summarize class distribution

print(Counter(y))

# define sampling strategy

sample = SMOTEENN(sampling_strategy=0.5)

# fit and apply the transform

X_over, y_over = sample.fit_resample(X, y)

# summarize class distribution

print(Counter(y_over))

Your Task

For this lesson, you must run the example and note the change in the class distribution before and after the data sampling.

For bonus points, try other combined data sampling techniques or even try manually applying oversampling followed by undersampling on the dataset.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Lesson 07: Cost-Sensitive Algorithms

In this lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Most machine learning algorithms assume that all misclassification errors made by a model are equal. This is often not the case for imbalanced classification problems, where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. Many machine learning algorithms can be updated to be cost-sensitive, where the model is penalized for misclassification errors from one class more than the other, such as the minority class.

The scikit-learn library provides this capability for a range of algorithms via the class_weight attribute specified when defining the model. A weighting can be specified that is inversely proportional to the class distribution.

If the class distribution was 0.99 to 0.01 for the majority and minority classes, then the class_weight argument could be defined as a dictionary that defines a penalty of 0.01 for errors made for the majority class and a penalty of 0.99 for errors made with the minority class, e.g. {0:0.01, 1:0.99}.

This is a useful heuristic and can be configured automatically by setting the class_weight argument to the string ‘balanced‘.

The example below demonstrates how to define and fit a cost-sensitive logistic regression model on an imbalanced classification dataset.

# example of cost sensitive logistic regression for imbalanced classification
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)
# split into train/test sets with same class ratio
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y)
# define model
model = LogisticRegression(solver='liblinear', class_weight='balanced')
# fit model
model.fit(trainX, trainy)
# predict on test set
yhat = model.predict(testX)
# evaluate predictions
print('F-Measure: %.3f' % f1_score(testy, yhat))

# example of cost sensitive logistic regression for imbalanced classification

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import f1_score

# generate dataset

X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)

# split into train/test sets with same class ratio

trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y)

# define model

model = LogisticRegression(solver='liblinear', class_weight='balanced')

# fit model

model.fit(trainX, trainy)

# predict on test set

yhat = model.predict(testX)

# evaluate predictions

print('F-Measure: %.3f' % f1_score(testy, yhat))

Your Task

For this lesson, you must run the example and review the performance of the cost-sensitive model.

For bonus points, compare the performance to the cost-insensitive version of logistic regression.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson of the mini-course.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

The challenge of imbalanced classification is the lack of examples for the minority class and the difference in importance of classification errors across the classes.
How to develop a spatial intuition for imbalanced classification datasets that might inform data preparation and algorithm selection.
The failure of classification accuracy and how alternate metrics like precision, recall, and the F-measure can better summarize model performance on imbalanced datasets.
How to delete examples from the majority class in the training dataset, referred to as data undersampling.
How to synthesize new examples in the minority class in the training dataset, referred to as data oversampling.
How to combine data oversampling and undersampling techniques on the training dataset, and common combinations that result in good performance.
How to use cost-sensitive modified versions of machine learning algorithms to improve performance on imbalanced classification datasets.

Take the next step and check out my book on Imbalanced Classification with Python.

Summary

How did you do with the mini-course?
Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.

Get a Handle on Imbalanced Classification!

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

150 Responses to Imbalanced Classification With Python (7-Day Mini-Course)

Ken Jones January 17, 2020 at 8:32 am #

A real-world example I am working on right now is predicting no shows for medical providers. In the data set I am working with, about 5% of appointments are no shows. The balance are shows or occurred appointments. The interesting class are the no shows.

Reply
- Jason Brownlee January 17, 2020 at 1:49 pm #
  
  Very nice, thanks for sharing Ken!
  
  Reply
Mark Littlewood January 17, 2020 at 6:43 pm #

A real world example I am concerned with is predicting the winners of horse races. Typically each race has on average around 11 runners and ten of those will have a ‘0’ meaning they did not win the race whilst one row will be the winner with a ‘1’. Funnily enough, using GBM I have not found balancing to be helpful but maybe I am coming at it from a wrong perspective with my balancing technique

Reply
- Jason Brownlee January 18, 2020 at 8:38 am #
  
  Interesting problem.
  
  I would recommend looking at rating systems.
  
  Reply
Kate Strydom January 17, 2020 at 6:50 pm #

I work on lead data, predicting call centre sales on various telecommunication and lead generation campaigns where the responses are always imbalanced. That is, sales versus no sale, and hot lead versus no lead. I usually just take a random sample of the negative response equivalent to the positive response in my sample prior to pulling the data into Python. Call centre data responses are always imbalanced due to the nature of the business. I would be keen to learn other ways to balance the responses.

Reply
- Jason Brownlee January 18, 2020 at 8:39 am #
  
  That sounds like a type of random undersampling.
  
  There’s a range of methods you can try. I hope to share a framework on the blog soon, or you can see my new book on the topic:
  https://machinelearningmastery.com/imbalanced-classification-with-python/
  
  Reply
Mark Littlewood January 17, 2020 at 7:36 pm #

With flip_y set to zero the 1 and 0s are created in a pretty distinct manner, there is little overlap. They also appear to be pretty linear in relation to the predictors

Reply
- Jason Brownlee January 18, 2020 at 8:41 am #
  
  Yes.
  
  Reply
Mark Littlewood January 17, 2020 at 8:41 pm #

The FBeta-score is interesting as a combination of precision and recall but you can weight it towards precision or recall. With betting precision is more important perhaps as bets that lose cost you money where as false negatives are annoying but not financially damaging.

Reply
- Jason Brownlee January 18, 2020 at 8:44 am #
  
  Nice. It really depends on the domain.
  
  Reply
Alexander Binkovsky January 17, 2020 at 8:57 pm #

A real-world example of imbalanced classification I’m working on right now is anomaly detection in monitoring data of a huge distributed application.

Reply
- Jason Brownlee January 18, 2020 at 8:44 am #
  
  Very cool, thanks!
  
  Reply
Ciprian Saramet January 18, 2020 at 2:07 am #

a real-world example of imbalanced data is server monitoring logs and trying to predict service failures

Reply
- Jason Brownlee January 18, 2020 at 8:50 am #
  
  Sounds great! I’m eager to hear how you go!
  
  Reply
Joy January 18, 2020 at 6:17 pm #

Hi Jason. Thank you for the insightful article!!
Your code runs fine Jason. But I cannot understand this piece:

for label, _ in counter.items(): (<— this is for iterating over the dictionary of counter.items()

row_ix = where(y == label)[0] (<— Does this mean row_ix is equal to x values of only those counter items belonging to class 0?)

pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) (<— plotting the points of both classes 0 and 1)

Can you please explain a bit? I am not that proficient in Python

Reply
- Jason Brownlee January 19, 2020 at 7:14 am #
  
  Thanks.
  
  We iterate over the labels.
  
  We can get the row indexes for all the examples in y that have the label.
  
  We then plot the points with those row indexes.
  
  Reply
Vijay Pal January 18, 2020 at 6:56 pm #

Customer churn and customer chargeback are two classical cases.

Reply
- Jason Brownlee January 19, 2020 at 7:14 am #
  
  Great examples!
  
  Reply
Moi khalile January 18, 2020 at 8:17 pm #

a real-world example of imbalanced data is medical images classification

Reply
- Jason Brownlee January 19, 2020 at 7:16 am #
  
  Great example!
  
  Reply
Shay January 18, 2020 at 11:07 pm #

Malware detection in the real world is inherently severely imbalanced.

Reply
- Jason Brownlee January 19, 2020 at 7:16 am #
  
  Very nice!
  
  Reply
Damir Zunic January 20, 2020 at 6:57 am #

I worked on predicting risk for type 2 diabetes using 3 classes per A1C levels: no-diabetes (77%), pre-diabetes (6%) and diabetes(17%).

Reply
- Jason Brownlee January 20, 2020 at 8:47 am #
  
  Nice, thank you for sharing.
  
  Reply
Sanchit Bhavsar January 21, 2020 at 2:39 am #

The highly imbalanced data problem I worked on was to predict user ad clicks with a majority of non-clicks (99%) and clicks (1%).

Reply
- Jason Brownlee January 21, 2020 at 7:17 am #
  
  Nice, thanks for sharing!
  
  Reply
James January 21, 2020 at 1:26 pm #

Lesson 1: Five general examples of class imbalance
1. Populations of rare and/or endangered species
2. Incidence of rare diseases
3. Extreme weather patterns
4. Excessive spending for non-essential items
5. Mechanical problems leading to highly probable breakdowns

Reply
- Jason Brownlee January 21, 2020 at 2:53 pm #
  
  Well done!
  
  Reply
Nwe January 21, 2020 at 3:59 pm #

I think that the misclassification error cost of imbalanced data classification is not same for each class because the number of training samples are not same. So, the performance metrics may be depend on all training class.

Please, give me suggestion in my opinion.

Reply
- Jason Brownlee January 22, 2020 at 6:18 am #
  
  Yes, although it will depend on the specifics of the dataset.
  
  Reply
Sachin Prabhu January 23, 2020 at 1:42 am #

Answer for Lesson 01: Challenge of Imbalanced Classification

1. Cancer cell prediction
2. Spam/Ham classification
3. Earthquake prediction
4. Page Blocks Classification
5. Glass Identification

Reply
- Jason Brownlee January 23, 2020 at 6:38 am #
  
  Nice work!
  
  Reply
Sudhansu R. Lenka January 24, 2020 at 7:52 pm #

Dataset will be balanced on original dataset, or we need to split into train and test first and then balance the training set

Reply
- Jason Brownlee January 25, 2020 at 8:34 am #
  
  We only balance the training dataset.
  
  Reply
stanislav January 26, 2020 at 6:01 am #

Coming examples only from social field:
1. number of rich/poor
2. —“– wealthy/ill
3. —“– birth/death
4. —“– buyer/seller
5. —“– desires (wishes)/achievements

Reply
- Jason Brownlee January 27, 2020 at 6:58 am #
  
  Nice!
  
  Reply
Vinod Kumar February 6, 2020 at 1:30 pm #

Sir what are the techniques to fix the imbalance in data set

Reply
- Jason Brownlee February 6, 2020 at 1:48 pm #
  
  There are many:
  
  – choose the right metric
  – try data over/undersampling
  – try cost sensitive models
  – try threshold moving
  – try one class models
  – try specialized ensembles
  – …
  
  Compare everything to default models to demonstrate they add value/skill.
  
  Reply
Animesh February 9, 2020 at 4:36 am #

One more real time Example can be Sales return prediction for an online portal…

Reply
- Jason Brownlee February 9, 2020 at 6:28 am #
  
  Great example!
  
  Reply
Cheyu February 18, 2020 at 3:00 pm #

Great course, it is really helpful.

Here I have two questions about the proposed weights for the majority and minority classes.

(1) As in the example, suppose that IR is given 99 (label 0 is 0.99, label 1 is 0.01) and you suggest to give weight 0.01 for the majority class and give weight 0.99 for the minority class. the results will be balanced after multiplying with the weights. Are there any references and papers to support?

(2) If the cost for misclassification error is pre-defined (for both FP & FN) but no rewards are given (for both TP & TN), is it a good way to follow the cost-sensitivity manipulation (i.e., assign weights) still?

Reply
- Jason Brownlee February 19, 2020 at 7:55 am #
  
  Thanks.
  
  Yes, perhaps start here, it explains the calculation further:
  https://machinelearningmastery.com/cost-sensitive-logistic-regression/
  
  Yes, perhaps start here:
  https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/
  
  Reply
Luis M. February 21, 2020 at 4:51 pm #

One example of imbalanced classes is present in biometric recognition.
An usual dataset has a number of Nu individuals/users and Ns_u samples per individual.
From the genuine pairs and impostor pairs of a given dataset, we obtain two classes of matching scores: genuine scores and impostor scores.
This is Nu*(Ns_u-1)*(Ns_u)/2 genuine scores and (Ns_u^2)*Nu*(Nu-1)/2 impostor scores.
Let us say that you have 100 individuals and 10 sample per individual.
Then, you would have 4500 genuine scores and 495000 impostor scores.
This is an important problem for multi-biometric recognition where you need to train a classifier for score fusion.

Reply
- Jason Brownlee February 22, 2020 at 6:18 am #
  
  Nice, thank you for sharing!
  
  Reply
elli February 21, 2020 at 11:55 pm #

Five examples of imbalanced classes might be: cancer detection in cells, detecting students with learning differences, calls to a call center which are about an unexpected topic, loan defaults, or detecting a rare disease from patient medical records.

Reply
- Jason Brownlee February 22, 2020 at 6:30 am #
  
  Very nice, thank you for sharing!
  
  Reply
Nitish Khairnar February 26, 2020 at 6:23 pm #

Hi Jason,
sharing my examples as task of Day -1 of Imbalance Classification.

1- detection of patients having high level of blood sugar
2- lottery winning ratio
3- call drop due to technical glitches
4- getting a right swipe on tinder(severely imbalaced)
5- chances of raining in the month of February (here in India it is very very rare)
6- getting a cashback after scratching a gift card in Google Pay(folks using this app for payment can connect well with this example)

Thank you

Reply
- Jason Brownlee February 27, 2020 at 5:40 am #
  
  Well done!
  
  Reply
Naresh Sharma March 4, 2020 at 7:25 pm #

I’m working on predicting of a person with certain demographic features, is likely to become a customer based on a mail campaign. The dataset shows that only 1.3% of folks become customers.

Reply
- Jason Brownlee March 5, 2020 at 6:32 am #
  
  Sounds like a great project!
  
  Reply
Retsilisitsoe Raymond Moholisa March 7, 2020 at 1:42 am #

Hi Jason, sharing my examples for Day 1 on Imbalance Classification
1.Detecting patients with rare genetic diseases in medical databases
2.Predicting customer churn behavior in telemarketing
3.Predicting multiview face recognition
4.Predicting skin cancer from medical images
5.Leopard/Cheetah Classification

Reply
- Jason Brownlee March 7, 2020 at 7:20 am #
  
  Very nice!
  
  Reply
Saman Tamkeen March 27, 2020 at 5:51 pm #

Task for Lesson 01:

I am trying to model a l2r algorithm. The measure of relevance is that, given a list of items, the items that were clicked are more relevant as compared to those that were seen and not clicked.

In my example dataset, only 6.5% of items are relevant while remaining 93.5% are not

Reply
- Jason Brownlee March 28, 2020 at 6:15 am #
  
  Thanks for sharing.
  
  Reply
milad May 1, 2020 at 2:01 pm #

1-How to Handling Imbalanced data for multi-label classification in Keras?
I’m dealing with a multi-label classification problem with a imbalanced dataset. I want to oversample with SMOTE. However, I don’t know how to achieve it since the label is like [0,1,0,0,1,0,1,1,0,0,0,0].

2-how to set class_weight for multi label dataset in keras?

Reply
- Jason Brownlee May 1, 2020 at 2:05 pm #
  
  You must define which classes are “positive” and which are “negative” for the given metric or sampling algorithm.
  
  Here’s how to set class weight in keras:
  https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/
  
  Reply
JG May 7, 2020 at 2:49 am #

Hi Jason,

In addition to these Oversampling Smote or variants techniques such as (“smoteenn” including Undersamplig ENN) or just RandomUnderSampler, I recently have known that Keras also provide a toll as an argument for the “.fit” method, called “class_weight” where you can specified as a dictionary the integer class label vs percentage of weighting to be applied to the loss function to each class label. See reference : https://keras.io/models/sequential/

Do you have any tutorial on this keras technique? It is basic undersampling(oversampling techniques) such as eliminating some data during training? which it is the main differences between this fit argument keras method and the ones explained here (over and under sampling) ?

thanks in advance
regards

Reply
- Jason Brownlee May 7, 2020 at 6:54 am #
  
  You’re always asking the best questions! I need to hire you 🙂
  
  Yes, right here:
  https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/
  
  Data sampling operates on the dataset prior to fitting any model.
  https://machinelearningmastery.com/data-sampling-methods-for-imbalanced-classification/
  
  Cost sensitive learning is a modification to the model itself:
  https://machinelearningmastery.com/cost-sensitive-learning-for-imbalanced-classification/
  
  Reply
JG May 8, 2020 at 3:49 am #

Thank you very much Jason!

You are not only a great ML/DL computer Engineer but also an excellent Professor that achieve incredible outreach performance and the best one value, at least for me, a great person!

regards,
JG

Reply
- Jason Brownlee May 8, 2020 at 6:41 am #
  
  Thanks JG.
  
  Reply
John Sammut May 8, 2020 at 6:04 am #

Hi Jason, thanks for this course.

Here’s the list that I thought of:

1. road anomaly vs good road surface
2. normal seismic activity sensor data vs volcano eruption seismic activity
3. healthy lung x-ray image vs lung cancer x-ray image
4. legitimate vs spam email
5. normal ocean waves sensor data vs tsunami waves

Reply
- Jason Brownlee May 8, 2020 at 6:43 am #
  
  Well done!
  
  Reply
John Sammut May 8, 2020 at 8:22 am #

Lesson 5 – Comparing class distribution after the oversampling

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE

# generate dataset
X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.99, 0.01],
flip_y=0, random_state=11)

# summarize class distribution
counter=Counter(y)
print(‘Before oversampling:’,counter)

fig, ax = plt.subplots(1,2,figsize=(15,5),constrained_layout=True)

# scatter plot of examples by class label
for label, amount in counter.items():
row_ix = where(y == label)[0]
ax[0].scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))

# define oversample strategy
oversample = SMOTE(sampling_strategy=0.5)

# fit and apply the transform
X_over, y_over = oversample.fit_resample(X, y)

# summarize class distribution
counter=Counter(y_over)
print(‘After oversampling:’,counter)

# scatter plot of examples by class label
for label, amount in counter.items():
row_ix = where(y_over == label)[0]
ax[1].scatter(X_over[row_ix, 0], X_over[row_ix, 1], label=str(label))

ax[1].legend
ax[1].set_title((‘After oversampling’),fontsize=14)
ax[0].legend
ax[0].set_title((‘Before oversampling’),fontsize=14)

Reply
- Jason Brownlee May 8, 2020 at 11:20 am #
  
  Well done!
  
  Reply
Skylar May 20, 2020 at 4:36 am #

Very nice course, thank you Jason! It is especially clear for Python newbie

Reply
- Jason Brownlee May 20, 2020 at 6:28 am #
  
  Thanks!
  
  Reply
Kristina June 11, 2020 at 7:51 am #

But these oversampling and undersampling methods don’t work, if you have sklearn 0.20+…
ModuleNotFoundError: No module named ‘sklearn.externals.six’

Reply
- Jason Brownlee June 11, 2020 at 1:29 pm #
  
  They work fine, but you must update to at least sklearn v0.23.1 and imblearn v0.6.2.
  
  Reply
Kondor June 13, 2020 at 6:43 am #

Examples of imbalanced problems:
1. Any anomaly detection (predicting Oscar winner among all movies, picking the stock that will raise 1,000% in next three months, hurricane or tsunami prediction, etc.)
2. Contraband detection at the border
3. Automatic detection of defects in mass-produced products
4. Picking future high-school valedictorians among first grade students

In fact, imbalanced problems == anomaly detection, so 2-4 can be seen as more examples of 1

Reply
- Jason Brownlee June 14, 2020 at 6:27 am #
  
  Well done!
  
  Reply
Tony June 17, 2020 at 12:38 pm #

An example I am currently working on is land cover classification. Classes are generally imbalanced with 1:100 differences between the least and most common classes (i.e. severe imbalance) being commonplace.

Reply
- Jason Brownlee June 17, 2020 at 1:44 pm #
  
  If it is image data, this may help:
  https://machinelearningmastery.com/best-practices-for-preparing-and-augmenting-image-data-for-convolutional-neural-networks/
  
  Reply
hana June 18, 2020 at 5:34 pm #

Fraud Detection.
Claim Prediction
Default Prediction.
Churn Prediction.
Spam Detection.
Anomaly Detection.
Outlier Detection.
Intrusion Detection
Conversion Prediction.

Reply
- Jason Brownlee June 19, 2020 at 6:10 am #
  
  Nice work!
  
  Reply
Mo June 20, 2020 at 1:34 am #

Hi Jason,
Thanks for your courses!
I have a quick question. I am working on a classification problem where I had the imbalance data set and to resolve the issue, I simply collected more data and the imbalance is solved for the training set. I built a classifier and test it with a very good accuracy, precision, and recall. Now I want to evaluate it with a newly collected data (unseen) , but I have imbalance issue with this evaluation set. This set is used to test the model for deployment. It seems to me that this imbalance issue will be forever. What should I do to make sure my model is ready for deployment? Thanks

Reply
- Jason Brownlee June 20, 2020 at 6:17 am #
  
  Good question.
  
  Perhaps you can carefully choose a metric that best captures what is important about the model:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Perhaps you can also get more data for the test set?
  
  Reply
Justin July 4, 2020 at 4:03 am #

Lesson 4: Undersampling
I’m working on a project to predict tennis match upsets. my dataset has roughly 1:4 upsets to not upsets. After undersampling non-upsets, I saw improvement in my Logistic Regression model.

Before resampling: 734 non-upsets, 279 upsets.
[[139 1]
[ 61 2]]
precision recall f1-score support

0 0.69 0.99 0.82 140
1 0.67 0.03 0.06 63

accuracy 0.69 203
macro avg 0.68 0.51 0.44 203
weighted avg 0.69 0.69 0.58 203

ROC AUC: mean 0.668 (sd 0.051)

After undersampling: 558 non-upsets, 279 upsets
[[99 8]
[39 22]]
precision recall f1-score support

0 0.72 0.93 0.81 107
1 0.73 0.36 0.48 61

accuracy 0.72 168
macro avg 0.73 0.64 0.65 168
weighted avg 0.72 0.72 0.69 168

ROC AUC: mean 0.728 (0.084)

That’s a big improvement in recall for the positive case! I also see a higher ROC AUC.

Thanks Jason!

Reply
- Jason Brownlee July 4, 2020 at 6:05 am #
  
  Well done!
  
  Reply
Justin July 8, 2020 at 4:34 am #

Lesson 5: Oversampling
Compared to my random undersampling, the oversampling method resulted in a higher ROC AUC and f1 score. (By random experimentation I found that the sweet spot was sampling_strategy=0.62 resulted in the highest for both metrics.)

Undersampling results: ROC AUC = 0.728 (0.084) f1 = 0.48 for positive case
Oversampling results: ROC AUC = 0.763 (0.056) f1 = 0.56 for positive case

An interesting result, when I set sampling_strategy = 1, meaning it would balance exactly the two cases, the ROC AUC and f1 scores dropped drastically. I imagine this is because I’ve added so much randomness to the underrepresented side that it obscures the information in those cases.

Another thing I need to learn: when evaluating f1 score, I’ve been looking At the score for the case I care about (the undersampled side). My goal is to be able to predict when that will happen with few false positives. Is there a better score for me to evaluate?

Thanks Jason!

Reply
- Jason Brownlee July 8, 2020 at 6:36 am #
  
  Nice work!
  
  This can help in choosing a score:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Reply
Alexandre K July 8, 2020 at 1:11 pm #

An exemple of unbalanced data is the severe occurrence while perfurating/revesting an oil well. My job is to predict revesting problems since I had perfuration issues.

Reply
- Jason Brownlee July 8, 2020 at 1:44 pm #
  
  Nice work!
  
  Reply
Richik Majumder July 29, 2020 at 4:39 pm #

Hi sir. Been a follower of your tutorials and they have proven useful to me many-a-times. So first of all thanks for that.

I had a question here, that is, what is your experience of using oversampling / undersampling / combinations like SMOTEENN and cost sensitive learning? What I mean is how do you choose one method over the other? I now have an intuition about how to decide the under vs over vs combined sampling. But these are seemingly one option altogether, that is, sampling. And cost sensitive learning seems a totally different method. So how to decide when and why to go for which method at different scenarios?

(My mind has made 2 clusters. sampling and cost-sensitive learning!!)

Reply
- Jason Brownlee July 30, 2020 at 6:19 am #
  
  Thanks.
  
  Good question, use this framework:
  https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/
  
  Reply
Safa Bouguezzi August 23, 2020 at 7:23 pm #

An example I am currently working on is Fake Job Posting classification
Class 0 (Not fraudulent) : 17014
Cass 1 (Fraudulent) : 866

Reply
- Jason Brownlee August 24, 2020 at 6:21 am #
  
  Great, thanks for sharing!
  
  Reply
Saurabh Sawhney August 27, 2020 at 12:18 pm #

First up, many thanks for all the work you put in, Jason.

An example of imbalanced learning that I thought of immediately pertains to the diagnosis of rare diseases. That led me to thinking that in fact, trying to detect anything that is rare, by definition, leads to imbalanced classification problems. So the list can be extending to finding rare things in a sea of commonality, for example, detecting fake currency notes, predicting a hole-in-one in golf, or classifying a popular OS version as buggy or bug-free.

Reply
- Jason Brownlee August 27, 2020 at 1:36 pm #
  
  Nice, thanks for sharing.
  
  Reply
Kari September 5, 2020 at 8:49 am #

An example of an imbalance class would be whether or not an applicant is accepted by medical programs at colleges like the University of New Mexico, University of Washington, and Florida State University.

Reply
- Jason Brownlee September 6, 2020 at 5:58 am #
  
  Thanks!
  
  Reply
Teni September 6, 2020 at 11:51 am #

Hi Jason,
Thanks for always providing great tutorial. I tried to sign up for the free class but never received the email. Can you please help. Thanks

Reply
- Jason Brownlee September 7, 2020 at 8:24 am #
  
  Sorry to hear that, perhaps try here:
  https://machinelearningmastery.lpages.co/icwp-mini-course/
  
  Then check all of your email folders, including spam.
  
  Reply
  - Teni September 7, 2020 at 10:57 am #
    
    Thanks. This link works
    
    Reply
    - Jason Brownlee September 7, 2020 at 1:20 pm #
      
      Great!
      
      Reply
Soniya September 14, 2020 at 7:27 pm #

hi Jason,I work for Telco client and most of the problems i see ,seems to be the case imbalanced classification:
1.Telco Churn problem
2.Port Out classification
3.Email optout prediction
4.Device AAL/Upgrade

As per my understanding,any imbalanced problem is the one where their in ratio of 1:100 or more b/w majority and minority class.When we give such skewed data to model,the model is unable to learn the minority class properly.

However,my question is ,what if my data is skewed like 1:100 ratio,but I have huge volume of data,and say minority class sample is more than 500K,do we still need to balance our data..or this much records of minority class are enough for model understanding?

Reply
- Jason Brownlee September 15, 2020 at 5:20 am #
  
  Nice work!
  
  Yes, there are a suite of techniques designed for this situation, for example:
  https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/
  
  And this is the focus of my book:
  https://machinelearningmastery.com/imbalanced-classification-with-python/
  
  Reply
Kingshuk October 21, 2020 at 10:20 am #

Hi Jason,

Few examples of imbalanced classification

1. Airport security profiling travelers for terrorist threat. (TSA pre-check etc.)
2. Unusual credit card transactions – fraud detection
3. Weed detection from plant attributes.
4. Datacenter malfunction conditions of servers.
5. COVID-19 exposure detection.

Kingshuk

Reply
- Jason Brownlee October 21, 2020 at 1:44 pm #
  
  Nice work!
  
  Reply
Shalini December 22, 2020 at 3:02 pm #

Hi Jason,
Thanks so much for the email tutorials….I have a few questions:

I was doing modelling on bank loans dataset. There is a class imbalance of 1:10 in the traget class. I understand from your lessons that in this case the Recall score is important because mimizing False Positive is important ( The bank is interested in attracting through a campaign targetted at those who are likely to buy loans and i think wont mind targetting even if there are customers who are not likely to buy but still a balance is required).

So my first question is :
1. Do I think about only maximizing Recall scores here and not use the F1 score which is more of a balance between these two?

2. If I however think of balancing between Recall and Precision score, I understand that I chose the F1 Score as well as the AUC for Precision Recall Curve for a balance to evaluate and compare model performance with others. Is my understanding right?

3. I have a question regarding F1 Score vs Precision Recall AUC. Is these two scores equivalent and I can use any one among them for comparing and evaluating different model performances, or does the AUC-PR hold some information more than the F1 score, if so what is it, and how do I interpret this?

Thanks
Shalini

Reply
- Shalini December 22, 2020 at 3:05 pm #
  
  Sorry, I need to correct ‘ I understand from your lessons that in this case the Recall score is important because mimizing False Positive is important ‘
  to
  ‘ I understand from your lessons that in this case the Recall score is important because mimizing False Negatives is important ‘
  
  Reply
- Jason Brownlee December 23, 2020 at 5:27 am #
  
  You’re welcome.
  
  I think it is better to focus on a f-measure score than just recall or precision, e.g. to seek some balance of the concerns.
  
  Perhaps consider f-beta and configure it for the requirements of your project;
  https://machinelearningmastery.com/fbeta-measure-for-machine-learning/
  
  No, F1 and precision-recall AUC are not equivalent, I recommend this tutorial:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Reply
Mallikharjuna Rao K January 13, 2021 at 10:18 pm #

C-Section prediction
Email classification,
Movie ratings,
Tumor prediction/analysis,
Diabetes analysis
Agriculture Crop predictions

Reply
- Jason Brownlee January 14, 2021 at 6:13 am #
  
  Great work!
  
  Reply
Mallikharjuna Rao K January 18, 2021 at 2:32 am #

Hi Jason,

In my observation
the classification is imbalanced
990 ‘0’ points- Majority Class
10 ‘1’ points – 10 Minority class

Reply
- Jason Brownlee January 18, 2021 at 6:11 am #
  
  Nice work!
  
  Reply
Chandana Kithalagama January 21, 2021 at 1:15 pm #

Here are some problems that leads to class imbalance.
– Identify defects of a product during the manufacturing process.
– Detect cracks in mobile phone screens for claiming insurance.
– Detect rare decease from a chest x-ray or MRI scan image.
– Detecting failure conditions in a nuclear reactor.
– Detect suspicious behaviour of a bank customer.

Reply
- Jason Brownlee January 22, 2021 at 7:14 am #
  
  Nice work!
  
  Reply
Chandana Kithalagama January 21, 2021 at 4:37 pm #

Running the Lesson 2 code couple of times and got this observation. This happens when 1s and 0s are completely overlapping. Hope I could show an image here.

Accuracy: 0.987
Precision: 0.000
Recall: 0.000
F-measure: 0.000

Reply
- Chandana Kithalagama January 21, 2021 at 4:38 pm #
  
  Sorry lesson 3 code.
  
  Reply
- Chandana Kithalagama January 21, 2021 at 4:52 pm #
  
  Here’s the image link – https://drive.google.com/file/d/1vlqXC9FPPmDSSE_J-yVUK9SaFIqccc7-/view?usp=sharing
  
  Reply
- Jason Brownlee January 22, 2021 at 7:16 am #
  
  Nice work!
  
  Reply
Lorenzo Pesce January 22, 2021 at 1:04 am #

Real world examples:
1-Who after contracting Covid will develope Acute Respiratory Distress Syndrome (ARDS,~1/20)
2-Who after being intubated for Covid will likely die and therefore is candidate for more desperate measures (~1/5)
3-Which teenagers are more at risk of suicide giving social media postings and behavior (percentage unknown, but smaller than 1~100)
4-Which mammograms contain sufficient evidence of invasive cancer ( ~ .5%)
5- Which subset of girls (or boys) is likely to make a person happy if they become her/is partner (estimated 10E-7).

Reply
- Jason Brownlee January 22, 2021 at 7:22 am #
  
  Well done!
  
  Reply
  - Oladoja Ilobekemen Perpetual January 22, 2021 at 7:07 pm #
    
    Five general examples of imbalance classification problems :
    
    .
    Spam Detection.
    Anomaly Detection.
    Outlier Detection.
    Intrusion Detection
    Conversion Prediction.
    
    Reply
    - Jason Brownlee January 23, 2021 at 7:00 am #
      
      Nice work!
      
      Reply
Anis Ben Ben Aicha January 25, 2021 at 6:44 am #

I am a researcher and I am currently working in many fields involving imbalanced data such as:
1- earlier detection of precancerous lesions
2- speech spoofing detection for human/machine interfaces
3- Urban sound analysis

Reply
- Jason Brownlee January 25, 2021 at 7:48 am #
  
  Nice!
  
  Reply
Jorrit Voigt January 25, 2021 at 8:47 pm #

Task 3:
In case of overlap and weights=0.9, I had the following results:

Accuracy: 0.898
Precision: 0.491
Recall: 0.540
F-measure: 0.514
ROC-AUC: 0.739
PR-AUC: 0.538

Here the ROC-AUC seems to overestimate the classifier. When is it advisable to give the precision recall AUC to avoid overestimation?

Reply
- Jason Brownlee January 26, 2021 at 5:52 am #
  
  Well done!
  
  Great question, see this tutorial on how to choose a metric:
  https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
  
  Reply
Sourabh Agarwal January 28, 2021 at 10:03 pm #

Anomaly Detection
Spam/Ham Detection
Electricity theft
Churn Rate
Engine Rating Prediction (0/1)

Reply
- Jason Brownlee January 29, 2021 at 6:04 am #
  
  Well done!
  
  Reply
Gamze January 29, 2021 at 4:05 pm #

Dear Jason,

I do thank you so much for sharing.

I am trying to make predictions by using an ML model fitted on the imbalanced dataset. I just want to ask how can I exactly reproduce the training distribution in prediction distribution? This is very important for my study.

Thank you in advance.
Kind Regards

Reply
- Jason Brownlee January 30, 2021 at 6:31 am #
  
  We use a random sample or stratified random sample of the data for training and evaluating the model.
  
  Most models assume the samples are iid:
  https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables
  
  Reply
  - Gamze January 30, 2021 at 7:36 am #
    
    So, could not we reproduce exactly the distribution of classes for predictions?
    
    Reply
    - Jason Brownlee January 30, 2021 at 8:03 am #
      
      Sorry, I don’t follow – I think we are talking past each other. Perhaps you can elaborate on your question?
      
      Reply
      - Gamze January 30, 2021 at 8:32 am #
        
        I am so sorry for the misunderstanding.
        
        For example, I have a training dataset with two classes (35 % class 1 and 65% class 2).
        
        I will make a prediction with a fitted model on a prediction dataset. I must produce the same class distribution (35 % class 1 and 65% class 2) for this dataset.
        
        Training dataset- 100 samples – 35 % class 1 and 65% class 2
        Prediction dataset – 500 samples – 35 % class 1 and 65% class 2
        
        Is it possible?
      - Jason Brownlee January 30, 2021 at 12:34 pm #
        
        Yes, this is called a stratified sample, you can learn more here:
        https://machinelearningmastery.com/cross-validation-for-imbalanced-classification/
kolliboina samba April 12, 2021 at 8:35 pm #

while using smoteenn it was taking very long time resampling

Reply
- Jason Brownlee April 13, 2021 at 6:07 am #
  
  Perhaps try using less data until you see if it helps or not?
  
  Reply
Faten April 16, 2021 at 10:45 pm #

Thank you for this tutorial!
Can the code in lesson 7 (cost sensitive learning) be used in sequential model not only regression?

Reply
- Jason Brownlee April 17, 2021 at 6:10 am #
  
  I don’t see why not. Try it and see.
  
  Reply
Begovega July 28, 2021 at 7:37 pm #

Class imbalance examples:
->Conversion rate for customers who receive an email and purchase an expensive product
->Any type of disease or disorder detection
->Customers expending more than 80% the average expenditure (High value customers detection)
->Customer Churn prediction

Reply
- Jason Brownlee July 29, 2021 at 5:11 am #
  
  Well done!
  
  Reply
Victor August 13, 2021 at 2:42 pm #

Hi Jason, brief and clear explanation of basic methods to tackle imbalance!

How can we effectively use them in keras? Or there are some special tools?
Of course, we may over\under-sample the whole data set before feeding into neural network, but this approach looks to be rough…
Should we put imbalancing to some kind of generators that update inputs before each epoch or even on batch level?

Reply
- Adrian Tam August 14, 2021 at 2:36 am #
  
  No special tools. There might be some handy functions from scikit-learn to help you preprocess the data but most importantly is to know the concept and apply it before you feed into the network. Which tool to use is not so crucial.
  
  Reply
Ismael Miranda August 27, 2021 at 9:29 am #

I might to predict the failure of a process in the industry. It’s a rare situation, but in summation has a great impact.

After finished the crash-course, it stay one doubt.

How I use the under and over samples created to train my model?

Should I create X_train, X_test, y_train, y_test from my new under and over samples or might I just under and over sampled the X_train and y_train from my original base?

Reply
- Adrian Tam August 28, 2021 at 3:57 am #
  
  I think creating from original base is simpler. Did you see any problem with this approach?
  
  Reply
Ismael Miranda August 28, 2021 at 5:07 am #

Actually, I’ve been trying to see it by myself.

https://github.com/IsmaelMiranda11/cienciadedados/blob/main/classes%20desbalanceadas

Reply
Neha October 21, 2021 at 8:19 pm #

Class imbalance can be seen in
1. Sentiment analysis where we see more positive reviews
2. Email is spam or not – Most emails are spam
3. Service required for a product after warranty expiry – Most cases will be yes
4. Symptoms after vaccination – Most cases will be yes
5. Screen guard bought with mobile – Most cases yes.

I have a question regarding True Positive (TP) cases. Let me take an example of the Titanic Kaggle problem where the challenge was to predict whether a person will survive or die. If Dead was 0 and survive was 1. In this case, TP is the one who survived and was reported to survive. If in the same scenario, dead is 1 and survived is 0. Then, in this case, TP is one who died and was reported dead.

Reply
- Adrian Tam October 22, 2021 at 4:11 am #
  
  You’re correct. Whether a label is positive or negative is a subjective design choice. But sometimes we prefer to call something positive because we want to focus on that (e.g., disease, we care about the infected and assume not infected is normal and pay less attention)
  
  Reply
Divine November 11, 2021 at 8:54 am #

Hello Jason,

Thank you for the great course!

I have a question regarding SMOTE to balance data with rare event. My data have an event rete of about 7%. I used SMOTE to increase the event rate to about 40%. Now I want to use the model with SMOTE to predict the risk of a new patient how do I adjust or correct the risk probability for the new patient to reflect to the risk probability from the original data of 7% without SMOTE.
The risk probability for the new data is 30% – 50% with SMOTE and the feedback I get is the risk shouldn’t be far from 7%.

I hope my question makes sense.

Thank you!

Reply
- Adrian Tam November 14, 2021 at 1:52 pm #
  
  I am afraid you misunderstood something here. SMOTE is to help you generate more data, so that you help your model training (since otherwise, guess the majority class blindly automatically give you accuracy of 1-7% = 93%); but once your model is trained, and you apply the original data before SMOTE, you should approximately see how it performs in practice. You don’t need to do anything else.
  
  Reply
Lily June 6, 2022 at 7:06 am #

Thanks a lot, Dr. Jason for this great short course, it answered many questions I had on the imbalanced classifications.
I need your advice, as I am working on a multi-class model which is severely imbalanced. The output labels are 1,2,3 and they are ( class1: 94%, class2: 3% and class3 :3%). As you suspect capturing classes 2,3 are more important than class1. I am using the multi-nomial logistic regression,I tried the oversampling but the model performance is low for example the f1-score is 45%. I tried with the class weight as a dictionary but I didn’t find the right combination, and my problem is that most articles treat the imbalance classification for binary output, not multi classes. do you have any advice? or an article/resource I can’ refer to?

Many thanks

Reply
- James Carmichael June 6, 2022 at 9:01 am #
  
  Hi Lily…You may want to consider ensemble learning to improve accuracy:
  
  https://machinelearningmastery.com/ensemble-diversity-for-machine-learning/
  
  Reply
Lily June 7, 2022 at 7:04 pm #

Thank you, James, the problem with the ensemble is that I can only use the logistic regression as a contributing model in the VotingClassifier because the multinomial logistic regression is the only model I could find which predicts multi-classes (not binomial only). do you have any advice?
Thanks a lot

Reply
Fikret July 27, 2022 at 11:39 pm #

Hi Jason,
Some examples of imbalanced classification:
– predicting power outages from power quality analyser data
– conditional monitoring for detecting equipment failures
– medical diagnosis image classification
– fraud detection
– detecting anomalies

Reply
- James Carmichael July 28, 2022 at 5:40 am #
  
  Absolutely Fikret! Keep up the great work!
  
  Reply
HR February 2, 2025 at 8:21 pm #

Examples of imbalanced classification:

terrorist passenger detection (airport etc.)
Cinema ticket reservation
Anomaly Detection.
Outlier Detection.
Conversion

Reply
- James Carmichael February 3, 2025 at 3:55 am #
  
  Thank you HR for your contribution to our discussion! Keep us posted on your progress!
  
  Reply
HR February 5, 2025 at 12:47 am #

Having oversampling, why under sampling at all?

Reply

Navigation

Imbalanced Classification With Python (7-Day Mini-Course)

Imbalanced Classification Crash Course.
Get on top of imbalanced classification in 7 days.

Who Is This Crash-Course For?

Want to Get Started With Imbalance Classification?

Crash-Course Overview

Lesson 01: Challenge of Imbalanced Classification

Your Task

Lesson 02: Intuition for Imbalanced Data

Your Task

Lesson 03: Evaluate Imbalanced Classification Models

Your Task

Lesson 04: Undersampling the Majority Class

Your Task

Lesson 05: Oversampling the Minority Class

Your Task

Lesson 06: Combine Data Undersampling and Oversampling

Your Task

Lesson 07: Cost-Sensitive Algorithms

Your Task

The End!
(Look How Far You Have Come)

Summary

Get a Handle on Imbalanced Classification!

Develop Imbalanced Learning Models in Minutes

Bring Imbalanced Classification Methods to Your Machine Learning Projects

More On This Topic

150 Responses to Imbalanced Classification With Python (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

Navigation

Imbalanced Classification Crash Course. Get on top of imbalanced classification in 7 days.

Who Is This Crash-Course For?

Want to Get Started With Imbalance Classification?

Crash-Course Overview

Lesson 01: Challenge of Imbalanced Classification

Your Task

Lesson 02: Intuition for Imbalanced Data

Your Task

Lesson 03: Evaluate Imbalanced Classification Models

Your Task

Lesson 04: Undersampling the Majority Class

Your Task

Lesson 05: Oversampling the Minority Class

Your Task

Lesson 06: Combine Data Undersampling and Oversampling

Your Task

Lesson 07: Cost-Sensitive Algorithms

Your Task

The End! (Look How Far You Have Come)

Summary

Get a Handle on Imbalanced Classification!

Develop Imbalanced Learning Models in Minutes

Bring Imbalanced Classification Methods to Your Machine Learning Projects

More On This Topic

150 Responses to Imbalanced Classification With Python (7-Day Mini-Course)

Leave a Reply Click here to cancel reply.

Imbalanced Classification Crash Course.
Get on top of imbalanced classification in 7 days.

The End!
(Look How Far You Have Come)