The post Undersampling Algorithms for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Resampling methods are designed to change the composition of a training dataset for an imbalanced classification task.

Most of the attention of resampling methods for imbalanced classification is put on oversampling the minority class. Nevertheless, a suite of techniques has been developed for undersampling the majority class that can be used in conjunction with effective oversampling methods.

There are many different types of undersampling techniques, although most can be grouped into those that select examples to keep in the transformed dataset, those that select examples to delete, and hybrids that combine both types of methods.

In this tutorial, you will discover undersampling methods for imbalanced classification.

After completing this tutorial, you will know:

- How to use the Near-Miss and Condensed Nearest Neighbor Rule methods that select examples to keep from the majority class.
- How to use Tomek Links and the Edited Nearest Neighbors Rule methods that select examples to delete from the majority class.
- How to use One-Sided Selection and the Neighborhood Cleaning Rule that combine methods for choosing examples to keep and delete from the majority class.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

This tutorial is divided into five parts; they are:

- Undersampling for Imbalanced Classification
- Imbalanced-Learn Library
- Methods that Select Examples to Keep
- Near Miss Undersampling
- Condensed Nearest Neighbor Rule for Undersampling

- Methods that Select Examples to Delete
- Tomek Links for Undersampling
- Edited Nearest Neighbors Rule for Undersampling

- Combinations of Keep and Delete Methods
- One-Sided Selection for Undersampling
- Neighborhood Cleaning Rule for Undersampling

Undersampling refers to a group of techniques designed to balance the class distribution for a classification dataset that has a skewed class distribution.

An imbalanced class distribution will have one or more classes with few examples (the minority classes) and one or more classes with many examples (the majority classes). It is best understood in the context of a binary (two-class) classification problem where class 0 is the majority class and class 1 is the minority class.

Undersampling techniques remove examples from the training dataset that belong to the majority class in order to better balance the class distribution, such as reducing the skew from a 1:100 to a 1:10, 1:2, or even a 1:1 class distribution. This is different from oversampling that involves adding examples to the minority class in an effort to reduce the skew in the class distribution.

… undersampling, that consists of reducing the data by eliminating examples belonging to the majority class with the objective of equalizing the number of examples of each class …

— Page 82, Learning from Imbalanced Data Sets, 2018.

Undersampling methods can be used directly on a training dataset that can then, in turn, be used to fit a machine learning model. Typically, undersampling methods are used in conjunction with an oversampling technique for the minority class, and this combination often results in better performance than using oversampling or undersampling alone on the training dataset.

The simplest undersampling technique involves randomly selecting examples from the majority class and deleting them from the training dataset. This is referred to as random undersampling. Although simple and effective, a limitation of this technique is that examples are removed without any concern for how useful or important they might be in determining the decision boundary between the classes. This means it is possible, or even likely, that useful information will be deleted.

The major drawback of random undersampling is that this method can discard potentially useful data that could be important for the induction process. The removal of data is a critical decision to be made, hence many the proposal of undersampling use heuristics in order to overcome the limitations of the non- heuristics decisions.

— Page 83, Learning from Imbalanced Data Sets, 2018.

An extension of this approach is to be more discerning regarding the examples from the majority class that are deleted. This typically involves heuristics or learning models that attempt to identify redundant examples for deletion or useful examples for non-deletion.

There are many undersampling techniques that use these types of heuristics. In the following sections, we will review some of the more common methods and develop an intuition for their operation on a synthetic imbalanced binary classification dataset.

We can define a synthetic binary classification dataset using the make_classification() function from the scikit-learn library. For example, we can create 10,000 examples with two input variables and a 1:100 distribution as follows:

... # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

We can then create a scatter plot of the dataset via the scatter() Matplotlib function to understand the spatial relationship of the examples in each class and their imbalance.

... # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Tying this together, the complete example of creating an imbalanced classification dataset and plotting the examples is listed below.

# Generate and plot a synthetic imbalanced classification dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the class distribution, showing an approximate 1:100 class distribution with about 10,000 examples with class 0 and 100 with class 1.

Counter({0: 9900, 1: 100})

Next, a scatter plot is created showing all of the examples in the dataset. We can see a large mass of examples for class 0 (blue) and a small number of examples for class 1 (orange). We can also see that the classes overlap with some examples from class 1 clearly within the part of the feature space that belongs to class 0.

This plot provides the starting point for developing the intuition for the effect that different undersampling techniques have on the majority class.

Next, we can begin to review popular undersampling methods made available via the imbalanced-learn Python library.

There are many different methods to choose from. We will divide them into methods that select what examples from the majority class to keep, methods that select examples to delete, and combinations of both approaches.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In these examples, we will use the implementations provided by the imbalanced-learn Python library, which can be installed via pip as follows:

sudo pip install imbalanced-learn

You can confirm that the installation was successful by printing the version of the installed library:

# check version number import imblearn print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.5.0

In this section, we will take a closer look at two methods that choose which examples from the majority class to keep, the near-miss family of methods, and the popular condensed nearest neighbor rule.

Near Miss refers to a collection of undersampling methods that select examples based on the distance of majority class examples to minority class examples.

The approaches were proposed by Jianping Zhang and Inderjeet Mani in their 2003 paper titled “KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction.”

There are three versions of the technique, named NearMiss-1, NearMiss-2, and NearMiss-3.

**NearMiss-1** selects examples from the majority class that have the smallest average distance to the three closest examples from the minority class. **NearMiss-2** selects examples from the majority class that have the smallest average distance to the three furthest examples from the minority class. **NearMiss-3** involves selecting a given number of majority class examples for each example in the minority class that are closest.

Here, distance is determined in feature space using Euclidean distance or similar.

**NearMiss-1**: Majority class examples with minimum average distance to three closest minority class examples.**NearMiss-2**: Majority class examples with minimum average distance to three furthest minority class examples.**NearMiss-3**: Majority class examples with minimum distance to each minority class example.

The NearMiss-3 seems desirable, given that it will only keep those majority class examples that are on the decision boundary.

We can implement the Near Miss methods using the NearMiss imbalanced-learn class.

The type of near-miss strategy used is defined by the “*version*” argument, which by default is set to 1 for NearMiss-1, but can be set to 2 or 3 for the other two methods.

... # define the undersampling method undersample = NearMiss(version=1)

By default, the technique will undersample the majority class to have the same number of examples as the minority class, although this can be changed by setting the *sampling_strategy* argument to a fraction of the minority class.

First, we can demonstrate NearMiss-1 that selects only those majority class examples that have a minimum distance to three majority class instances, defined by the *n_neighbors* argument.

We would expect clusters of majority class examples around the minority class examples that overlap.

The complete example is listed below.

# Undersample imbalanced dataset with NearMiss-1 from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import NearMiss from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = NearMiss(version=1, n_neighbors=3) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example undersamples the majority class and creates a scatter plot of the transformed dataset.

We can see that, as expected, only those examples in the majority class that are closest to the minority class examples in the overlapping area were retained.

Next, we can demonstrate the NearMiss-2 strategy, which is an inverse to NearMiss-1. It selects examples that are closest to the most distant examples from the minority class, defined by the *n_neighbors* argument.

This is not an intuitive strategy from the description alone.

The complete example is listed below.

# Undersample imbalanced dataset with NearMiss-2 from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import NearMiss from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = NearMiss(version=2, n_neighbors=3) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example, we can see that the NearMiss-2 selects examples that appear to be in the center of mass for the overlap between the two classes.

Finally, we can try NearMiss-3 that selects the closest examples from the majority class for each minority class.

The *n_neighbors_ver3* argument determines the number of examples to select for each minority example, although the desired balancing ratio set via *sampling_strategy* will filter this so that the desired balance is achieved.

The complete example is listed below.

# Undersample imbalanced dataset with NearMiss-3 from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import NearMiss from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = NearMiss(version=3, n_neighbors_ver3=3) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

As expected, we can see that each example in the minority class that was in the region of overlap with the majority class has up to three neighbors from the majority class.

Condensed Nearest Neighbors, or CNN for short, is an undersampling technique that seeks a subset of a collection of samples that results in no loss in model performance, referred to as a minimal consistent set.

… the notion of a consistent subset of a sample set. This is a subset which, when used as a stored reference set for the NN rule, correctly classifies all of the remaining points in the sample set.

— The Condensed Nearest Neighbor Rule (Corresp.), 1968.

It is achieved by enumerating the examples in the dataset and adding them to the “*store*” only if they cannot be classified correctly by the current contents of the store. This approach was proposed to reduce the memory requirements for the k-Nearest Neighbors (KNN) algorithm by Peter Hart in the 1968 correspondence titled “The Condensed Nearest Neighbor Rule.”

When used for imbalanced classification, the store is comprised of all examples in the minority set and only examples from the majority set that cannot be classified correctly are added incrementally to the store.

We can implement the Condensed Nearest Neighbor for undersampling using the CondensedNearestNeighbour class from the imbalanced-learn library.

During the procedure, the KNN algorithm is used to classify points to determine if they are to be added to the store or not. The k value is set via the *n_neighbors* argument and defaults to 1.

... # define the undersampling method undersample = CondensedNearestNeighbour(n_neighbors=1)

It’s a relatively slow procedure, so small datasets and small *k* values are preferred.

The complete example of demonstrating the Condensed Nearest Neighbor rule for undersampling is listed below.

# Undersample and plot imbalanced dataset with the Condensed Nearest Neighbor Rule from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import CondensedNearestNeighbour from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = CondensedNearestNeighbour(n_neighbors=1) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first reports the skewed distribution of the raw dataset, then the more balanced distribution for the transformed dataset.

We can see that the resulting distribution is about 1:2 minority to majority examples. This highlights that although the *sampling_strategy* argument seeks to balance the class distribution, the algorithm will continue to add misclassified examples to the store (transformed dataset). This is a desirable property.

Counter({0: 9900, 1: 100}) Counter({0: 188, 1: 100})

A scatter plot of the resulting dataset is created. We can see that the focus of the algorithm is those examples in the minority class along the decision boundary between the two classes, specifically, those majority examples around the minority class examples.

In this section, we will take a closer look at methods that select examples from the majority class to delete, including the popular Tomek Links method and the Edited Nearest Neighbors rule.

A criticism of the Condensed Nearest Neighbor Rule is that examples are selected randomly, especially initially.

This has the effect of allowing redundant examples into the store and in allowing examples that are internal to the mass of the distribution, rather than on the class boundary, into the store.

The condensed nearest-neighbor (CNN) method chooses samples randomly. This results in a)retention of unnecessary samples and b) occasional retention of internal rather than boundary samples.

— Two modifications of CNN, 1976.

Two modifications to the CNN procedure were proposed by Ivan Tomek in his 1976 paper titled “Two modifications of CNN.” One of the modifications (Method2) is a rule that finds pairs of examples, one from each class; they together have the smallest Euclidean distance to each other in feature space.

This means that in a binary classification problem with classes 0 and 1, a pair would have an example from each class and would be closest neighbors across the dataset.

In words, instances a and b define a Tomek Link if: (i) instance a’s nearest neighbor is b, (ii) instance b’s nearest neighbor is a, and (iii) instances a and b belong to different classes.

— Page 46, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

These cross-class pairs are now generally referred to as “*Tomek Links*” and are valuable as they define the class boundary.

Method 2 has another potentially important property: It finds pairs of boundary points which participate in the formation of the (piecewise-linear) boundary. […] Such methods could use these pairs to generate progressively simpler descriptions of acceptably accurate approximations of the original completely specified boundaries.

— Two modifications of CNN, 1976.

The procedure for finding Tomek Links can be used to locate all cross-class nearest neighbors. If the examples in the minority class are held constant, the procedure can be used to find all of those examples in the majority class that are closest to the minority class, then removed. These would be the ambiguous examples.

From this definition, we see that instances that are in Tomek Links are either boundary instances or noisy instances. This is due to the fact that only boundary instances and noisy instances will have nearest neighbors, which are from the opposite class.

— Page 46, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

We can implement Tomek Links method for undersampling using the TomekLinks imbalanced-learn class.

... # define the undersampling method undersample = TomekLinks()

The complete example of demonstrating the Tomek Links for undersampling is listed below.

Because the procedure only removes so-named “*Tomek Links*“, we would not expect the resulting transformed dataset to be balanced, only less ambiguous along the class boundary.

# Undersample and plot imbalanced dataset with Tomek Links from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import TomekLinks from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = TomekLinks() # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the class distribution for the raw dataset, then the transformed dataset.

We can see that only 26 examples from the majority class were removed.

Counter({0: 9900, 1: 100}) Counter({0: 9874, 1: 100})

The scatter plot of the transformed dataset does not make the minor editing to the majority class obvious.

This highlights that although finding the ambiguous examples on the class boundary is useful, alone, it is not a great undersampling technique. In practice, the Tomek Links procedure is often combined with other methods, such as the Condensed Nearest Neighbor Rule.

The choice to combine Tomek Links and CNN is natural, as Tomek Links can be said to remove borderline and noisy instances, while CNN removes redundant instances.

— Page 46, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Another rule for finding ambiguous and noisy examples in a dataset is called Edited Nearest Neighbors, or sometimes ENN for short.

This rule involves using *k=3* nearest neighbors to locate those examples in a dataset that are misclassified and that are then removed before a k=1 classification rule is applied. This approach of resampling and classification was proposed by Dennis Wilson in his 1972 paper titled “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data.”

The modified three-nearest neighbor rule which uses the three-nearest neighbor rule to edit the preclassified samples and then uses a single-nearest neighbor rule to make decisions is a particularly attractive rule.

— Asymptotic Properties of Nearest Neighbor Rules Using Edited Data, 1972.

When used as an undersampling procedure, the rule can be applied to each example in the majority class, allowing those examples that are misclassified as belonging to the minority class to be removed, and those correctly classified to remain.

It is also applied to each example in the minority class where those examples that are misclassified have their nearest neighbors from the majority class deleted.

… for each instance a in the dataset, its three nearest neighbors are computed. If a is a majority class instance and is misclassified by its three nearest neighbors, then a is removed from the dataset. Alternatively, if a is a minority class instance and is misclassified by its three nearest neighbors, then the majority class instances among a’s neighbors are removed.

— Page 46, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

The Edited Nearest Neighbors rule can be implemented using the EditedNearestNeighbours imbalanced-learn class.

The *n_neighbors* argument controls the number of neighbors to use in the editing rule, which defaults to three, as in the paper.

... # define the undersampling method undersample = EditedNearestNeighbours(n_neighbors=3)

The complete example of demonstrating the ENN rule for undersampling is listed below.

Like Tomek Links, the procedure only removes noisy and ambiguous points along the class boundary. As such, we would not expect the resulting transformed dataset to be balanced.

# Undersample and plot imbalanced dataset with the Edited Nearest Neighbor rule from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import EditedNearestNeighbours from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = EditedNearestNeighbours(n_neighbors=3) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the class distribution for the raw dataset, then the transformed dataset.

We can see that only 94 examples from the majority class were removed.

Counter({0: 9900, 1: 100}) Counter({0: 9806, 1: 100})

Given the small amount of undersampling performed, the change to the mass of majority examples is not obvious from the plot.

Also, like Tomek Links, the Edited Nearest Neighbor Rule gives best results when combined with another undersampling method.

Ivan Tomek, developer of Tomek Links, explored extensions of the Edited Nearest Neighbor Rule in his 1976 paper titled “An Experiment with the Edited Nearest-Neighbor Rule.”

Among his experiments was a repeated ENN method that invoked the continued editing of the dataset using the ENN rule for a fixed number of iterations, referred to as “*unlimited editing*.”

… unlimited repetition of Wilson’s editing (in fact, editing is always stopped after a finite number of steps because after a certain number of repetitions the design set becomes immune to further elimination)

— An Experiment with the Edited Nearest-Neighbor Rule, 1976.

He also describes a method referred to as “*all k-NN*” that removes all examples from the dataset that were classified incorrectly.

Both of these additional editing procedures are also available via the imbalanced-learn library via the RepeatedEditedNearestNeighbours and AllKNN classes.

In this section, we will take a closer look at techniques that combine the techniques we have already looked at to both keep and delete examples from the majority class, such as One-Sided Selection and the Neighborhood Cleaning Rule.

One-Sided Selection, or OSS for short, is an undersampling technique that combines Tomek Links and the Condensed Nearest Neighbor (CNN) Rule.

Specifically, Tomek Links are ambiguous points on the class boundary and are identified and removed in the majority class. The CNN method is then used to remove redundant examples from the majority class that are far from the decision boundary.

OSS is an undersampling method resulting from the application of Tomek links followed by the application of US-CNN. Tomek links are used as an undersampling method and removes noisy and borderline majority class examples. […] US-CNN aims to remove examples from the majority class that are distant from the decision border.

— Page 84, Learning from Imbalanced Data Sets, 2018.

This combination of methods was proposed by Miroslav Kubat and Stan Matwin in their 1997 paper titled “Addressing The Curse Of Imbalanced Training Sets: One-sided Selection.”

The CNN procedure occurs in one-step and involves first adding all minority class examples to the store and some number of majority class examples (e.g. 1), then classifying all remaining majority class examples with KNN (*k=1*) and adding those that are misclassified to the store.

We can implement the OSS undersampling strategy via the OneSidedSelection imbalanced-learn class.

The number of seed examples can be set with *n_seeds_S* and defaults to 1 and the *k* for KNN can be set via the *n_neighbors* argument and defaults to 1.

Given that the CNN procedure occurs in one block, it is more useful to have a larger seed sample of the majority class in order to effectively remove redundant examples. In this case, we will use a value of 200.

... # define the undersampling method undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200)

The complete example of applying OSS on the binary classification problem is listed below.

We might expect a large number of redundant examples from the majority class to be removed from the interior of the distribution (e.g. away from the class boundary).

# Undersample and plot imbalanced dataset with One-Sided Selection from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import OneSidedSelection from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = OneSidedSelection(n_neighbors=1, n_seeds_S=200) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first reports the class distribution in the raw dataset, then the transformed dataset.

We can see that a large number of examples from the majority class were removed, consisting of both redundant examples (removed via CNN) and ambiguous examples (removed via Tomek Links). The ratio for this dataset is now around 1:10., down from 1:100.

Counter({0: 9900, 1: 100}) Counter({0: 940, 1: 100})

A scatter plot of the transformed dataset is created showing that most of the majority class examples left belong are around the class boundary and the overlapping examples from the minority class.

It might be interesting to explore larger seed samples from the majority class and different values of *k* used in the one-step CNN procedure.

The Neighborhood Cleaning Rule, or NCR for short, is an undersampling technique that combines both the Condensed Nearest Neighbor (CNN) Rule to remove redundant examples and the Edited Nearest Neighbors (ENN) Rule to remove noisy or ambiguous examples.

Like One-Sided Selection (OSS), the CSS method is applied in a one-step manner, then the examples that are misclassified according to a KNN classifier are removed, as per the ENN rule. Unlike OSS, less of the redundant examples are removed and more attention is placed on “*cleaning*” those examples that are retained.

The reason for this is to focus less on improving the balance of the class distribution and more on the quality (unambiguity) of the examples that are retained in the majority class.

… the quality of classification results does not necessarily depend on the size of the class. Therefore, we should consider, besides the class distribution, other characteristics of data, such as noise, that may hamper classification.

— Improving Identification of Difficult Small Classes by Balancing Class Distribution, 2001.

This approach was proposed by Jorma Laurikkala in her 2001 paper titled “Improving Identification of Difficult Small Classes by Balancing Class Distribution.”

The approach involves first selecting all examples from the minority class. Then all of the ambiguous examples in the majority class are identified using the ENN rule and removed. Finally, a one-step version of CNN is used where those remaining examples in the majority class that are misclassified against the store are removed, but only if the number of examples in the majority class is larger than half the size of the minority class.

This technique can be implemented using the NeighbourhoodCleaningRule imbalanced-learn class. The number of neighbors used in the ENN and CNN steps can be specified via the *n_neighbors* argument that defaults to three. The *threshold_cleaning* controls whether or not the CNN is applied to a given class, which might be useful if there are multiple minority classes with similar sizes. This is kept at 0.5.

The complete example of applying NCR on the binary classification problem is listed below.

Given the focus on data cleaning over removing redundant examples, we would expect only a modest reduction in the number of examples in the majority class.

# Undersample and plot imbalanced dataset with the neighborhood cleaning rule from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import NeighbourhoodCleaningRule from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define the undersampling method undersample = NeighbourhoodCleaningRule(n_neighbors=3, threshold_cleaning=0.5) # transform the dataset X, y = undersample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first reports the class distribution in the raw dataset, then the transformed dataset.

We can see that only 114 examples from the majority class were removed.

Counter({0: 9900, 1: 100}) Counter({0: 9786, 1: 100})

Given the limited and focused amount of undersampling performed, the change to the mass of majority examples is not obvious from the scatter plot that is created.

This section provides more resources on the topic if you are looking to go deeper.

- kNN Approach To Unbalanced Data Distributions: A Case Study Involving Information Extraction, 2003.
- The Condensed Nearest Neighbor Rule (Corresp.), 1968
- Two modifications of CNN, 1976.
- Addressing The Curse Of Imbalanced Training Sets: One-sided Selection, 1997.
- Asymptotic Properties of Nearest Neighbor Rules Using Edited Data, 1972.
- An Experiment with the Edited Nearest-Neighbor Rule, 1976.
- Improving Identification of Difficult Small Classes by Balancing Class Distribution, 2001.

- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

- Under-sampling, Imbalanced-Learn User Guide.
- imblearn.under_sampling.NearMiss API
- imblearn.under_sampling.CondensedNearestNeighbour API
- imblearn.under_sampling.TomekLinks API
- imblearn.under_sampling.OneSidedSelection API
- imblearn.under_sampling.EditedNearestNeighbours API.
- imblearn.under_sampling.NeighbourhoodCleaningRule API

In this tutorial, you discovered undersampling methods for imbalanced classification.

Specifically, you learned:

- How to use the Near-Miss and Condensed Nearest Neighbor Rule methods that select examples to keep from the majority class.
- How to use Tomek Links and the Edited Nearest Neighbors Rule methods that select examples to delete from the majority class.
- How to use One-Sided Selection and the Neighborhood Cleaning Rule that combine methods for choosing examples to keep and delete from the majority class.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Undersampling Algorithms for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post SMOTE Oversampling for Imbalanced Classification with Python appeared first on Machine Learning Mastery.

]]>Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on, the minority class, although typically it is performance on the minority class that is most important.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the **Synthetic Minority Oversampling Technique**, or SMOTE for short.

In this tutorial, you will discover the SMOTE for oversampling imbalanced classification datasets.

After completing this tutorial, you will know:

- How the SMOTE synthesizes new examples for the minority class.
- How to correctly fit and evaluate machine learning models on SMOTE-transformed training datasets.
- How to use extensions of the SMOTE that generate synthetic examples along the class decision boundary.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Letâ€™s get started.

This tutorial is divided into five parts; they are:

- Synthetic Minority Oversampling Technique
- Imbalanced-Learn Library
- SMOTE for Balancing Data
- SMOTE for Classification
- SMOTE With Selective Synthetic Sample Generation
- Borderline-SMOTE
- Borderline-SMOTE SVM
- Adaptive Synthetic Sampling (ADASYN)

A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary.

One way to solve this problem is to oversample the examples in the minority class. This can be achieved by simply duplicating examples from the minority class in the training dataset prior to fitting a model. This can balance the class distribution but does not provide any additional information to the model.

An improvement on duplicating examples from the minority class is to synthesize new examples from the minority class. This is a type of data augmentation for tabular data and can be very effective.

Perhaps the most widely used approach to synthesizing new examples is called the **Synthetic Minority Oversampling TEchnique**, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

SMOTE works by selecting examples that are close in the feature space, drawing a line between the examples in the feature space and drawing a new sample at a point along that line.

Specifically, a random example from the minority class is first chosen. Then *k* of the nearest neighbors for that example are found (typically *k=5*). A randomly selected neighbor is chosen and a synthetic example is created at a randomly selected point between the two examples in feature space.

… SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

— Page 47, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

This procedure can be used to create as many synthetic examples for the minority class as are required. As described in the paper, it suggests first using random undersampling to trim the number of examples in the majority class, then use SMOTE to oversample the minority class to balance the class distribution.

The combination of SMOTE and under-sampling performs better than plain under-sampling.

— SMOTE: Synthetic Minority Over-sampling Technique, 2011.

The approach is effective because new synthetic examples from the minority class are created that are plausible, that is, are relatively close in feature space to existing examples from the minority class.

Our method of synthetic over-sampling works to cause the classifier to build larger decision regions that contain nearby minority class points.

— SMOTE: Synthetic Minority Over-sampling Technique, 2011.

A general downside of the approach is that synthetic examples are created without considering the majority class, possibly resulting in ambiguous examples if there is a strong overlap for the classes.

Now that we are familiar with the technique, let’s look at a worked example for an imbalanced classification problem.

In these examples, we will use the implementations provided by the imbalanced-learn Python library, which can be installed via pip as follows:

sudo pip install imbalanced-learn

You can confirm that the installation was successful by printing the version of the installed library:

# check version number import imblearn print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.5.0

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will develop an intuition for the SMOTE by applying it to an imbalanced binary classification problem.

First, we can use the make_classification() scikit-learn function to create a synthetic binary classification dataset with 10,000 examples and a 1:100 class distribution.

... # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1)

We can use the Counter object to summarize the number of examples in each class to confirm the dataset was created correctly.

... # summarize class distribution counter = Counter(y) print(counter)

Finally, we can create a scatter plot of the dataset and color the examples for each class a different color to clearly see the spatial nature of the class imbalance.

... # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Tying this all together, the complete example of generating and plotting a synthetic binary classification problem is listed below.

# Generate and plot a synthetic imbalanced classification dataset from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the class distribution, confirms the 1:100 ratio, in this case with about 9,900 examples in the majority class and 100 in the minority class.

Counter({0: 9900, 1: 100})

A scatter plot of the dataset is created showing the large mass of points that belong to the minority class (blue) and a small number of points spread out for the minority class (orange). We can see some measure of overlap between the two classes.

Next, we can oversample the minority class using SMOTE and plot the transformed dataset.

We can use the SMOTE implementation provided by the imbalanced-learn Python library in the SMOTE class.

The SMOTE class acts like a data transform object from scikit-learn in that it must be defined and configured, fit on a dataset, then applied to create a new transformed version of the dataset.

For example, we can define a SMOTE instance with default parameters that will balance the minority class and then fit and apply it in one step to create a transformed version of our dataset.

... # transform the dataset oversample = SMOTE() X, y = oversample.fit_resample(X, y)

Once transformed, we can summarize the class distribution of the new transformed dataset, which would expect to now be balanced through the creation of many new synthetic examples in the minority class.

... # summarize the new class distribution counter = Counter(y) print(counter)

A scatter plot of the transformed dataset can also be created and we would expect to see many more examples for the minority class on lines between the original examples in the minority class.

Tying this together, the complete examples of applying SMOTE to the synthetic dataset and then summarizing and plotting the transformed result is listed below.

# Oversample and plot imbalanced dataset with SMOTE from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import SMOTE from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # transform the dataset oversample = SMOTE() X, y = oversample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first creates the dataset and summarizes the class distribution, showing the 1:100 ratio.

Then the dataset is transformed using the SMOTE and the new class distribution is summarized, showing a balanced distribution now with 9,900 examples in the minority class.

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 9900})

Finally, a scatter plot of the transformed dataset is created.

It shows many more examples in the minority class created along the lines between the original examples in the minority class.

The original paper on SMOTE suggested combining SMOTE with random undersampling of the majority class.

The imbalanced-learn library supports random undersampling via the RandomUnderSampler class.

We can update the example to first oversample the minority class to have 10 percent the number of examples of the majority class (e.g. about 1,000), then use random undersampling to reduce the number of examples in the majority class to have 50 percent more than the minority class (e.g. about 2,000).

To implement this, we can specify the desired ratios as arguments to the SMOTE and *RandomUnderSampler* classes; for example:

... over = SMOTE(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5)

We can then chain these two transforms together into a Pipeline.

The Pipeline can then be applied to a dataset, performing each transformation in turn and returning a final dataset with the accumulation of the transform applied to it, in this case oversampling followed by undersampling.

... steps = [('o', over), ('u', under)] pipeline = Pipeline(steps=steps)

The pipeline can then be fit and applied to our dataset just like a single transform:

... # transform the dataset X, y = pipeline.fit_resample(X, y)

We can then summarize and plot the resulting dataset.

We would expect some SMOTE oversampling of the minority class, although not as much as before where the dataset was balanced. We also expect fewer examples in the majority class via random undersampling.

Tying this all together, the complete example is listed below.

# Oversample with SMOTE and random undersample for imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler from imblearn.pipeline import Pipeline from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # define pipeline over = SMOTE(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('o', over), ('u', under)] pipeline = Pipeline(steps=steps) # transform the dataset X, y = pipeline.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first creates the dataset and summarizes the class distribution.

Next, the dataset is transformed, first by oversampling the minority class, then undersampling the majority class. The final class distribution after this sequence of transforms matches our expectations with a 1:2 ratio or about 2,000 examples in the majority class and about 1,000 examples in the minority class.

Counter({0: 9900, 1: 100}) Counter({0: 1980, 1: 990})

Finally, a scatter plot of the transformed dataset is created, showing the oversampled majority class and the undersampled majority class.

Now that we are familiar with transforming imbalanced datasets, let’s look at using SMOTE when fitting and evaluating classification models.

In this section, we will look at how we can use SMOTE as a data preparation method when fitting and evaluating machine learning algorithms in scikit-learn.

First, we use our binary classification dataset from the previous section then fit and evaluate a decision tree algorithm.

The algorithm is defined with any required hyperparameters (we will use the defaults), then we will use repeated stratified k-fold cross-validation to evaluate the model. We will use three repeats of 10-fold cross-validation, meaning that 10-fold cross-validation is applied three times fitting and evaluating 30 models on the dataset.

The dataset is stratified, meaning that each fold of the cross-validation split will have the same class distribution as the original dataset, in this case, a 1:100 ratio. We will evaluate the model using the ROC area under curve (AUC) metric. This can be optimistic for severely imbalanced datasets but will still show a relative change with better performing models.

... # define model model = DecisionTreeClassifier() # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)

Once fit, we can calculate and report the mean of the scores across the folds and repeats.

... print('Mean ROC AUC: %.3f' % mean(scores))

We would not expect a decision tree fit on the raw imbalanced dataset to perform very well.

Tying this together, the complete example is listed below.

# decision tree evaluated on imbalanced dataset from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # define model model = DecisionTreeClassifier() # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the model and reports the mean ROC AUC.

Your results will vary given the stochastic nature of the learning algorithm and the evaluation procedure. Try running the example a few times.

In this case, we can see that a ROC AUC of about 0.76 is reported.

Mean ROC AUC: 0.761

Now, we can try the same model and the same evaluation method, although use a SMOTE transformed version of the dataset.

The correct application of oversampling during k-fold cross-validation is to apply the method to the training dataset only, then evaluate the model on the stratified but non-transformed test set.

This can be achieved by defining a Pipeline that first transforms the training dataset with SMOTE then fits the model.

... # define pipeline steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps)

This pipeline can then be evaluated using repeated k-fold cross-validation.

Tying this together, the complete example of evaluating a decision tree with SMOTE oversampling on the training dataset is listed below.

# decision tree evaluated on imbalanced dataset with SMOTE oversampling from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # define pipeline steps = [('over', SMOTE()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the model and reports the mean ROC AUC score across the multiple folds and repeats.

Your results will vary given the stochastic nature of the learning algorithm and the evaluation procedure. Try running the example a few times.

In this case, we can see a modest improvement in performance from a ROC AUC of about 0.76 to about 0.80.

Mean ROC AUC: 0.809

As mentioned in the paper, it is believed that SMOTE performs better when combined with undersampling of the majority class, such as random undersampling.

We can achieve this by simply adding a *RandomUnderSampler* step to the Pipeline.

As in the previous section, we will first oversample the minority class with SMOTE to about a 1:10 ratio, then undersample the majority class to achieve about a 1:2 ratio.

... # define pipeline model = DecisionTreeClassifier() over = SMOTE(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('over', over), ('under', under), ('model', model)] pipeline = Pipeline(steps=steps)

Tying this together, the complete example is listed below.

# decision tree on imbalanced dataset with SMOTE oversampling and random undersampling from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # define pipeline model = DecisionTreeClassifier() over = SMOTE(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('over', over), ('under', under), ('model', model)] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the model with the pipeline of SMOTE oversampling and random undersampling on the training dataset.

Your results will vary given the stochastic nature of the learning algorithm and the evaluation procedure. Try running the example a few times.

In this case, we can see that the reported ROC AUC shows an additional lift to about 0.83.

Mean ROC AUC: 0.834

You could explore testing different ratios of the minority class and majority class (e.g. changing the *sampling_strategy* argument) to see if a further lift in performance is possible.

Another area to explore would be to test different values of the k-nearest neighbors selected in the SMOTE procedure when each new synthetic example is created. The default is *k=5*, although larger or smaller values will influence the types of examples created, and in turn, may impact the performance of the model.

For example, we could grid search a range of values of *k*, such as values from 1 to 7, and evaluate the pipeline for each value.

... # values to evaluate k_values = [1, 2, 3, 4, 5, 6, 7] for k in k_values: # define pipeline ...

The complete example is listed below.

# grid search k value for SMOTE oversampling for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # values to evaluate k_values = [1, 2, 3, 4, 5, 6, 7] for k in k_values: # define pipeline model = DecisionTreeClassifier() over = SMOTE(sampling_strategy=0.1, k_neighbors=k) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('over', over), ('under', under), ('model', model)] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) score = mean(scores) print('> k=%d, Mean ROC AUC: %.3f' % (k, score))

Running the example will perform SMOTE oversampling with different k values for the KNN used in the procedure, followed by random undersampling and fitting a decision tree on the resulting training dataset.

The mean ROC AUC is reported for each configuration.

In this case, the results suggest that a *k=3* might be good with a ROC AUC of about 0.84, and *k=7* might also be good with a ROC AUC of about 0.85.

This highlights that both the amount of oversampling and undersampling performed (sampling_strategy argument) and the number of examples selected from which a partner is chosen to create a synthetic example (*k_neighbors*) may be important parameters to select and tune for your dataset.

> k=1, Mean ROC AUC: 0.827 > k=2, Mean ROC AUC: 0.823 > k=3, Mean ROC AUC: 0.834 > k=4, Mean ROC AUC: 0.840 > k=5, Mean ROC AUC: 0.839 > k=6, Mean ROC AUC: 0.839 > k=7, Mean ROC AUC: 0.853

Now that we are familiar with how to use SMOTE when fitting and evaluating classification models, let’s look at some extensions of the SMOTE procedure.

We can be selective about the examples in the minority class that are oversampled using SMOTE.

In this section, we will review some extensions to SMOTE that are more selective regarding the examples from the minority class that provide the basis for generating new synthetic examples.

A popular extension to SMOTE involves selecting those instances of the minority class that are misclassified, such as with a k-nearest neighbor classification model.

We can then oversample just those difficult instances, providing more resolution only where it may be required.

The examples on the borderline and the ones nearby […] are more apt to be misclassified than the ones far from the borderline, and thus more important for classification.

— Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, 2005.

These examples that are misclassified are likely ambiguous and in a region of the edge or border of decision boundary where class membership may overlap. As such, this modified to SMOTE is called Borderline-SMOTE and was proposed by Hui Han, et al. in their 2005 paper titled “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning.”

The authors also describe a version of the method that also oversampled the majority class for those examples that cause a misclassification of borderline instances in the minority class. This is referred to as Borderline-SMOTE1, whereas the oversampling of just the borderline cases in minority class is referred to as Borderline-SMOTE2.

Borderline-SMOTE2 not only generates synthetic examples from each example in DANGER and its positive nearest neighbors in P, but also does that from its nearest negative neighbor in N.

— Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, 2005.

We can implement Borderline-SMOTE1 using the BorderlineSMOTE class from imbalanced-learn.

We can demonstrate the technique on the synthetic binary classification problem used in the previous sections.

Instead of generating new synthetic examples for the minority class blindly, we would expect the Borderline-SMOTE method to only create synthetic examples along the decision boundary between the two classes.

The complete example of using Borderline-SMOTE to oversample binary classification datasets is listed below.

# borderline-SMOTE for imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import BorderlineSMOTE from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # transform the dataset oversample = BorderlineSMOTE() X, y = oversample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first creates the dataset and summarizes the initial class distribution, showing a 1:100 relationship.

The Borderline-SMOTE is applied to balance the class distribution, which is confirmed with the printed class summary.

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 9900})

Finally, a scatter plot of the transformed dataset is created. The plot clearly shows the effect of the selective approach to oversampling. Examples along the decision boundary of the minority class are oversampled intently (orange).

The plot shows that those examples far from the decision boundary are not oversampled. This includes both examples that are easier to classify (those orange points toward the top left of the plot) and those that are overwhelmingly difficult to classify given the strong class overlap (those orange points toward the bottom right of the plot).

Hien Nguyen, et al. suggest using an alternative of Borderline-SMOTE where an SVM algorithm is used instead of a KNN to identify misclassified examples on the decision boundary.

Their approach is summarized in the 2009 paper titled “Borderline Over-sampling For Imbalanced Data Classification.” An SVM is used to locate the decision boundary defined by the support vectors and examples in the minority class that close to the support vectors become the focus for generating synthetic examples.

… the borderline area is approximated by the support vectors obtained after training a standard SVMs classifier on the original training set. New instances will be randomly created along the lines joining each minority class support vector with a number of its nearest neighbors using the interpolation

— Borderline Over-sampling For Imbalanced Data Classification, 2009.

In addition to using an SVM, the technique attempts to select regions where there are fewer examples of the minority class and tries to extrapolate towards the class boundary.

If majority class instances count for less than a half of its nearest neighbors, new instances will be created with extrapolation to expand minority class area toward the majority class.

— Borderline Over-sampling For Imbalanced Data Classification, 2009.

This variation can be implemented via the SVMSMOTE class from the imbalanced-learn library.

The example below demonstrates this alternative approach to Borderline SMOTE on the same imbalanced dataset.

# borderline-SMOTE with SVM for imbalanced dataset from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import SVMSMOTE from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # transform the dataset oversample = SVMSMOTE() X, y = oversample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first summarizes the raw class distribution, then the balanced class distribution after applying Borderline-SMOTE with an SVM model.

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 9900})

A scatter plot of the dataset is created showing the directed oversampling along the decision boundary with the majority class.

We can also see that unlike Borderline-SMOTE, more examples are synthesized away from the region of class overlap, such as toward the top left of the plot.

Another approach involves generating synthetic samples inversely proportional to the density of the examples in the minority class.

That is, generate more synthetic examples in regions of the feature space where the density of minority examples is low, and fewer or none where the density is high.

This modification to SMOTE is referred to as the Adaptive Synthetic Sampling Method, or ADASYN, and was proposed to Haibo He, et al. in their 2008 paper named for the method titled “ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning.”

ADASYN is based on the idea of adaptively generating minority data samples according to their distributions: more synthetic data is generated for minority class samples that are harder to learn compared to those minority samples that are easier to learn.

— ADASYN: Adaptive synthetic sampling approach for imbalanced learning, 2008.

With online Borderline-SMOTE, a discriminative model is not created. Instead, examples in the minority class are weighted according to their density, then those examples with the lowest density are the focus for the SMOTE synthetic example generation process.

The key idea of ADASYN algorithm is to use a density distribution as a criterion to automatically decide the number of synthetic samples that need to be generated for each minority data example.

— ADASYN: Adaptive synthetic sampling approach for imbalanced learning, 2008.

We can implement this procedure using the ADASYN class in the imbalanced-learn library.

The example below demonstrates this alternative approach to oversampling on the imbalanced binary classification dataset.

# Oversample and plot imbalanced dataset with ADASYN from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import ADASYN from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=1) # summarize class distribution counter = Counter(y) print(counter) # transform the dataset oversample = ADASYN() X, y = oversample.fit_resample(X, y) # summarize the new class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

Running the example first creates the dataset and summarizes the initial class distribution, then the updated class distribution after oversampling was performed.

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 9899})

A scatter plot of the transformed dataset is created. Like Borderline-SMOTE, we can see that synthetic sample generation is focused around the decision boundary as this region has the lowest density.

Unlike Borderline-SMOTE, we can see that the examples that have the most class overlap have the most focus. On problems where these low density examples might be outliers, the ADASYN approach may put too much attention on these areas of the feature space, which may result in worse model performance.

It may help to remove outliers prior to applying the oversampling procedure, and this might be a helpful heuristic to use more generally.

This section provides more resources on the topic if you are looking to go deeper.

- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

- SMOTE: Synthetic Minority Over-sampling Technique, 2002.
- Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning, 2005.
- Borderline Over-sampling For Imbalanced Data Classification, 2009.
- ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning, 2008.

- imblearn.over_sampling.SMOTE API.
- imblearn.over_sampling.SMOTENC API.
- imblearn.over_sampling.BorderlineSMOTE API.
- imblearn.over_sampling.SVMSMOTE API.
- imblearn.over_sampling.ADASYN API.

In this tutorial, you discovered the SMOTE for oversampling imbalanced classification datasets.

Specifically, you learned:

- How the SMOTE synthesizes new examples for the minority class.
- How to correctly fit and evaluate machine learning models on SMOTE-transformed training datasets.
- How to use extensions of the SMOTE that generate synthetic examples along the class decision boundary.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post SMOTE Oversampling for Imbalanced Classification with Python appeared first on Machine Learning Mastery.

]]>The post Imbalanced Classification With Python (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Get on top of imbalanced classification in 7 days.

Classification predictive modeling is the task of assigning a label to an example.

Imbalanced classification are those classification tasks where the distribution of examples across the classes is not equal.

Practical imbalanced classification requires the use of a suite of specialized techniques, data preparation techniques, learning algorithms, and performance metrics.

In this crash course, you will discover how you can get started and confidently work through an imbalanced classification project with Python in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Maybe you know how to work through a predictive modeling problem end-to-end, or at least most of the main steps, with popular tools.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You may know some basic scikit-learn for modeling.

You do NOT need to be:

- A math wiz!
- A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can navigate an imbalanced classification project.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This crash course is broken down into seven lessons.

You could complete one lesson per day (*recommended*) or complete all of the lessons in one day (*hardcore*). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with imbalanced classification in Python:

**Lesson 01**: Challenge of Imbalanced Classification**Lesson 02**: Intuition for Imbalanced Data**Lesson 03**: Evaluate Imbalanced Classification Models**Lesson 04**: Undersampling the Majority Class**Lesson 05**: Oversampling the Minority Class**Lesson 06**: Combine Data Undersampling and Oversampling**Lesson 07**: Cost-Sensitive Algorithms

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the algorithms and the best-of-breed tools in Python. (* Hint*: I have all of the answers directly on this blog; use the search box.)

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

**Note**: This is just a crash course. For a lot more detail and fleshed-out tutorials, see my book on the topic titled “Imbalanced Classification with Python.”

In this lesson, you will discover the challenge of imbalanced classification problems.

Imbalanced classification problems pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

This results in models that have poor predictive performance, specifically for the minority class. This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

**Majority Class**: More than half of the examples belong to this class, often the negative or normal case.**Minority Class**: Less than half of the examples belong to this class, often the positive or abnormal case.

A classification problem may be a little skewed, such as if there is a slight imbalance. Alternately, the classification problem may have a severe imbalance where there might be hundreds or thousands of examples in one class and tens of examples in another class for a given training dataset.

**Slight Imbalance**. Where the distribution of examples is uneven by a small amount in the training dataset (e.g. 4:6).**Severe Imbalance**. Where the distribution of examples is uneven by a large amount in the training dataset (e.g. 1:100 or more).

Many of the classification predictive modeling problems that we are interested in solving in practice are imbalanced.

As such, it is surprising that imbalanced classification does not get more attention than it does.

For this lesson, you must list five general examples of problems that inherently have a class imbalance.

One example might be fraud detection, another might be intrusion detection.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to develop an intuition for skewed class distributions.

In this lesson, you will discover how to develop a practical intuition for imbalanced classification datasets.

A challenge for beginners working with imbalanced classification problems is what a specific skewed class distribution means. For example, what is the difference and implication for a 1:10 vs. a 1:100 class ratio?

The make_classification() scikit-learn function can be used to define a synthetic dataset with a desired class imbalance. The “*weights*” argument specifies the ratio of examples in the negative class, e.g. [0.99, 0.01] means that 99 percent of the examples will belong to the majority class, and the remaining 1 percent will belong to the minority class.

... # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0)

Once defined, we can summarize the class distribution using a Counter object to get an idea of exactly how many examples belong to each class.

... # summarize class distribution counter = Counter(y) print(counter)

We can also create a scatter plot of the dataset because there are only two input variables. The dots can then be colored by each class. This plot provides a visual intuition for what exactly a 99 percent vs. 1 percent majority/minority class imbalance looks like in practice.

The complete example of creating and summarizing an imbalanced classification dataset is listed below.

# plot imbalanced classification problem from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot from numpy import where # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0) # summarize class distribution counter = Counter(y) print(counter) # scatter plot of examples by class label for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show()

For this lesson, you must run the example and review the plot.

For bonus points, you can test different class ratios and review the results.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to evaluate models for imbalanced classification.

In this lesson, you will discover how to evaluate models on imbalanced classification problems.

Prediction accuracy is the most common metric for classification tasks, although it is inappropriate and potentially dangerously misleading when used on imbalanced classification tasks.

The reason for this is because if 98 percent of the data belongs to the negative class, you can achieve 98 percent accuracy on average by simply predicting the negative class all the time, achieving a score that naively looks good, but in practice has no skill.

Instead, alternate performance metrics must be adopted.

Popular alternatives are the precision and recall scores that allow the performance of the model to be considered by focusing on the minority class, called the positive class.

Precision calculates the ratio of the number of correctly predicted positive examples divided by the total number of positive examples that were predicted. Maximizing the precision will minimize the false negatives.

- Precision = TruePositives / (TruePositives + FalsePositives)

Recall predicts the ratio of the total number of correctly predicted positive examples divided by the total number of positive examples that could have been predicted. Maximizing recall will minimize false positives.

- Recall = TruePositives / (TruePositives + FalseNegatives)

The performance of a model can be summarized by a single score that averages both the precision and the recall, called the F-Measure. Maximizing the F-Measure will maximize both the precision and recall at the same time.

- F-measure = (2 * Precision * Recall) / (Precision + Recall)

The example below fits a logistic regression model on an imbalanced classification problem and calculates the accuracy, which can then be compared to the precision, recall, and F-measure.

# evaluate imbalanced classification model with different metrics from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import f1_score # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y) # define model model = LogisticRegression(solver='liblinear') # fit model model.fit(trainX, trainy) # predict on test set yhat = model.predict(testX) # evaluate predictions print('Accuracy: %.3f' % accuracy_score(testy, yhat)) print('Precision: %.3f' % precision_score(testy, yhat)) print('Recall: %.3f' % recall_score(testy, yhat)) print('F-measure: %.3f' % f1_score(testy, yhat))

For this lesson, you must run the example and compare the classification accuracy to the other metrics, such as precision, recall, and F-measure.

For bonus points, try other metrics such as Fbeta-measure and ROC AUC scores.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to undersample the majority class.

In this lesson, you will discover how to undersample the majority class in the training dataset.

A simple approach to using standard machine learning algorithms on an imbalanced dataset is to change the training dataset to have a more balanced class distribution.

This can be achieved by deleting examples from the majority class, referred to as “*undersampling*.” A possible downside is that examples from the majority class that are helpful during modeling may be deleted.

The imbalanced-learn library provides many examples of undersampling algorithms. This library can be installed easily using pip; for example:

pip install imbalanced-learn

A fast and reliable approach is to randomly delete examples from the majority class to reduce the imbalance to a ratio that is less severe or even so that the classes are even.

The example below creates a synthetic imbalanced classification data, then uses RandomUnderSampler class to change the class distribution from 1:100 minority to majority classes to the less severe 1:2.

# example of undersampling the majority class from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import RandomUnderSampler # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0) # summarize class distribution print(Counter(y)) # define undersample strategy undersample = RandomUnderSampler(sampling_strategy=0.5) # fit and apply the transform X_over, y_over = undersample.fit_resample(X, y) # summarize class distribution print(Counter(y_over))

For this lesson, you must run the example and note the change in the class distribution before and after undersampling the majority class.

For bonus points, try other undersampling ratios or even try other undersampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to oversample the minority class.

In this lesson, you will discover how to oversample the minority class in the training dataset.

An alternative to deleting examples from the majority class is to add new examples from the minority class.

This can be achieved by simply duplicating examples in the minority class, but these examples do not add any new information. Instead, new examples from the minority can be synthesized using existing examples in the training dataset. These new examples will be “*close*” to existing examples in the feature space, but different in small but random ways.

The SMOTE algorithm is a popular approach for oversampling the minority class. This technique can be used to reduce the imbalance or to make the class distribution even.

The example below demonstrates using the SMOTE class provided by the imbalanced-learn library on a synthetic dataset. The initial class distribution is 1:100 and the minority class is oversampled to a 1:2 distribution.

# example of oversampling the minority class from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import SMOTE # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0) # summarize class distribution print(Counter(y)) # define oversample strategy oversample = SMOTE(sampling_strategy=0.5) # fit and apply the transform X_over, y_over = oversample.fit_resample(X, y) # summarize class distribution print(Counter(y_over))

For this lesson, you must run the example and note the change in the class distribution before and after oversampling the minority class.

For bonus points, try other oversampling ratios, or even try other oversampling techniques provided by the imbalanced-learn library.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to combine undersampling and oversampling techniques.

In this lesson, you will discover how to combine data undersampling and oversampling on a training dataset.

Data undersampling will delete examples from the majority class, whereas data oversampling will add examples to the majority class. These two approaches can be combined and used on a single training dataset.

Given that there are so many different data sampling techniques to choose from, it can be confusing as to which methods to combine. Thankfully, there are common combinations that have been shown to work well in practice; some examples include:

- Random Undersampling with SMOTE oversampling.
- Tomek Links Undersampling with SMOTE oversampling.
- Edited Nearest Neighbors Undersampling with SMOTE oversampling.

These combinations can be applied manually to a given training dataset by first applying one sampling algorithm, then another. Thankfully, the imbalanced-learn library provides implementations of common combined data sampling techniques.

The example below demonstrates how to use the SMOTEENN that combines both SMOTE oversampling of the minority class and Edited Nearest Neighbors undersampling of the majority class.

# example of both undersampling and oversampling from collections import Counter from sklearn.datasets import make_classification from imblearn.combine import SMOTEENN # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99, 0.01], flip_y=0) # summarize class distribution print(Counter(y)) # define sampling strategy sample = SMOTEENN(sampling_strategy=0.5) # fit and apply the transform X_over, y_over = sample.fit_resample(X, y) # summarize class distribution print(Counter(y_over))

For this lesson, you must run the example and note the change in the class distribution before and after the data sampling.

For bonus points, try other combined data sampling techniques or even try manually applying oversampling followed by undersampling on the dataset.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

In this lesson, you will discover how to use cost-sensitive algorithms for imbalanced classification.

Most machine learning algorithms assume that all misclassification errors made by a model are equal. This is often not the case for imbalanced classification problems, where missing a positive or minority class case is worse than incorrectly classifying an example from the negative or majority class.

Cost-sensitive learning is a subfield of machine learning that takes the costs of prediction errors (and potentially other costs) into account when training a machine learning model. Many machine learning algorithms can be updated to be cost-sensitive, where the model is penalized for misclassification errors from one class more than the other, such as the minority class.

The scikit-learn library provides this capability for a range of algorithms via the *class_weight* attribute specified when defining the model. A weighting can be specified that is inversely proportional to the class distribution.

If the class distribution was 0.99 to 0.01 for the majority and minority classes, then the *class_weight* argument could be defined as a dictionary that defines a penalty of 0.01 for errors made for the majority class and a penalty of 0.99 for errors made with the minority class, e.g. {0:0.01, 1:0.99}.

This is a useful heuristic and can be configured automatically by setting the *class_weight* argument to the string ‘*balanced*‘.

The example below demonstrates how to define and fit a cost-sensitive logistic regression model on an imbalanced classification dataset.

# example of cost sensitive logistic regression for imbalanced classification from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import f1_score # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, stratify=y) # define model model = LogisticRegression(solver='liblinear', class_weight='balanced') # fit model model.fit(trainX, trainy) # predict on test set yhat = model.predict(testX) # evaluate predictions print('F-Measure: %.3f' % f1_score(testy, yhat))

For this lesson, you must run the example and review the performance of the cost-sensitive model.

For bonus points, compare the performance to the cost-insensitive version of logistic regression.

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson of the mini-course.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The challenge of imbalanced classification is the lack of examples for the minority class and the difference in importance of classification errors across the classes.
- How to develop a spatial intuition for imbalanced classification datasets that might inform data preparation and algorithm selection.
- The failure of classification accuracy and how alternate metrics like precision, recall, and the F-measure can better summarize model performance on imbalanced datasets.
- How to delete examples from the majority class in the training dataset, referred to as data undersampling.
- How to synthesize new examples in the minority class in the training dataset, referred to as data oversampling.
- How to combine data oversampling and undersampling techniques on the training dataset, and common combinations that result in good performance.
- How to use cost-sensitive modified versions of machine learning algorithms to improve performance on imbalanced classification datasets.

Take the next step and check out my book on Imbalanced Classification with Python.

**How did you do with the mini-course?**

Did you enjoy this crash course?

**Do you have any questions? **Were there any sticking points?

Let me know. Leave a comment below.

The post Imbalanced Classification With Python (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post Random Oversampling and Undersampling for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Imbalanced datasets are those where there is a severe skew in the class distribution, such as 1:100 or 1:1000 examples in the minority class to the majority class.

This bias in the training dataset can influence many machine learning algorithms, leading some to ignore the minority class entirely. This is a problem as it is typically the minority class on which predictions are most important.

One approach to addressing the problem of class imbalance is to randomly resample the training dataset. The two main approaches to randomly resampling an imbalanced dataset are to delete examples from the majority class, called undersampling, and to duplicate examples from the minority class, called oversampling.

In this tutorial, you will discover random oversampling and undersampling for imbalanced classification

After completing this tutorial, you will know:

- Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
- Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
- Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Letâ€™s get started.

This tutorial is divided into five parts; they are:

- Random Resampling Imbalanced Datasets
- Imbalanced-Learn Library
- Random Oversampling Imbalanced Datasets
- Random Undersampling Imbalanced Datasets
- Combining Random Oversampling and Undersampling

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.

This is a simple and effective strategy for imbalanced classification problems.

Applying re-sampling strategies to obtain a more balanced data distribution is an effective solution to the imbalance problem

— A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

The simplest strategy is to choose examples for the transformed dataset randomly, called random resampling.

There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

**Random Oversampling**: Randomly duplicate examples in the minority class.**Random Undersampling**: Randomly delete examples in the majority class.

Random oversampling involves randomly selecting examples from the minority class, with replacement, and adding them to the training dataset. Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset.

In the random under-sampling, the majority class instances are discarded at random until a more balanced distribution is reached.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

Both approaches can be repeated until the desired class distribution is achieved in the training dataset, such as an equal split across the classes.

They are referred to as “*naive resampling*” methods because they assume nothing about the data and no heuristics are used. This makes them simple to implement and fast to execute, which is desirable for very large and complex datasets.

Both techniques can be used for two-class (binary) classification problems and multi-class classification problems with one or more majority or minority classes.

Importantly, the change to the class distribution is only applied to the training dataset. The intent is to influence the fit of the models. The resampling is not applied to the test or holdout dataset used to evaluate the performance of a model.

Generally, these naive methods can be effective, although that depends on the specifics of the dataset and models involved.

Let’s take a closer look at each method and how to use them in practice.

In these examples, we will use the implementations provided by the imbalanced-learn Python library, which can be installed via pip as follows:

sudo pip install imbalanced-learn

You can confirm that the installation was successful by printing the version of the installed library:

# check version number import imblearn print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.5.0

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Random oversampling involves randomly duplicating examples from the minority class and adding them to the training dataset.

Examples from the training dataset are selected randomly with replacement. This means that examples from the minority class can be chosen and added to the new “*more balanced*” training dataset multiple times; they are selected from the original training dataset, added to the new training dataset, and then returned or “*replaced*” in the original dataset, allowing them to be selected again.

This technique can be effective for those machine learning algorithms that are affected by a skewed distribution and where multiple duplicate examples for a given class can influence the fit of the model. This might include algorithms that iteratively learn coefficients, like artificial neural networks that use stochastic gradient descent. It can also affect models that seek good splits of the data, such as support vector machines and decision trees.

It might be useful to tune the target class distribution. In some cases, seeking a balanced distribution for a severely imbalanced dataset can cause affected algorithms to overfit the minority class, leading to increased generalization error. The effect can be better performance on the training dataset, but worse performance on the holdout or test dataset.

… the random oversampling may increase the likelihood of occurring overfitting, since it makes exact copies of the minority class examples. In this way, a symbolic classifier, for instance, might construct rules that are apparently accurate, but actually cover one replicated example.

— Page 83, Learning from Imbalanced Data Sets, 2018.

As such, to gain insight into the impact of the method, it is a good idea to monitor the performance on both train and test datasets after oversampling and compare the results to the same algorithm on the original dataset.

The increase in the number of examples for the minority class, especially if the class skew was severe, can also result in a marked increase in the computational cost when fitting the model, especially considering the model is seeing the same examples in the training dataset again and again.

… in random over-sampling, a random set of copies of minority class examples is added to the data. This may increase the likelihood of overfitting, specially for higher over-sampling rates. Moreover, it may decrease the classifier performance and increase the computational effort.

— A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

Random oversampling can be implemented using the RandomOverSampler class.

The class can be defined and takes a *sampling_strategy* argument that can be set to “*minority*” to automatically balance the minority class with majority class or classes.

For example:

... # define oversampling strategy oversample = RandomOverSampler(sampling_strategy='minority')

This means that if the majority class had 1,000 examples and the minority class had 100, this strategy would oversampling the minority class so that it has 1,000 examples.

A floating point value can be specified to indicate the ratio of minority class majority examples in the transformed dataset. For example:

... # define oversampling strategy oversample = RandomOverSampler(sampling_strategy=0.5)

This would ensure that the minority class was oversampled to have half the number of examples as the majority class, for binary classification problems. This means that if the majority class had 1,000 examples and the minority class had 100, the transformed dataset would have 500 examples of the minority class.

The class is like a scikit-learn transform object in that it is fit on a dataset, then used to generate a new or transformed dataset. Unlike the scikit-learn transforms, it will change the number of examples in the dataset, not just the values (like a scaler) or number of features (like a projection).

For example, it can be fit and applied in one step by calling the *fit_sample()* function:

... # fit and apply the transform X_over, y_over = oversample.fit_resample(X, y)

We can demonstrate this on a simple synthetic binary classification problem with a 1:100 class imbalance.

... # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0)

The complete example of defining the dataset and performing random oversampling to balance the class distribution is listed below.

# example of random oversampling to balance the class distribution from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import RandomOverSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # summarize class distribution print(Counter(y)) # define oversampling strategy oversample = RandomOverSampler(sampling_strategy='minority') # fit and apply the transform X_over, y_over = oversample.fit_resample(X, y) # summarize class distribution print(Counter(y_over))

Running the example first creates the dataset, then summarizes the class distribution. We can see that there are nearly 10K examples in the majority class and 100 examples in the minority class.

Then the random oversample transform is defined to balance the minority class, then fit and applied to the dataset. The class distribution for the transformed dataset is reported showing that now the minority class has the same number of examples as the majority class.

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 9900})

This transform can be used as part of a *Pipeline* to ensure that it is only applied to the training dataset as part of each split in a k-fold cross validation.

A traditional scikit-learn Pipeline cannot be used; instead, a Pipeline from the imbalanced-learn library can be used. For example:

... # pipeline steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps)

The example below provides a complete example of evaluating a decision tree on an imbalanced dataset with a 1:100 class distribution.

The model is evaluated using repeated 10-fold cross-validation with three repeats, and the oversampling is performed on the training dataset within each fold separately, ensuring that there is no data leakage as might occur if the oversampling was performed prior to the cross-validation.

# example of evaluating a decision tree with random oversampling from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import RandomOverSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # define pipeline steps = [('over', RandomOverSampler()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1) score = mean(scores) print('F1 Score: %.3f' % score)

Running the example evaluates the decision tree model on the imbalanced dataset with oversampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm, rather than optimally solve the synthetic dataset.

The default oversampling strategy is used, which balances the minority classes with the majority class. The F1 score averaged across each fold and each repeat is reported.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.990

Now that we are familiar with oversampling, let’s take a look at undersampling.

Random undersampling involves randomly selecting examples from the majority class to delete from the training dataset.

This has the effect of reducing the number of examples in the majority class in the transformed version of the training dataset. This process can be repeated until the desired class distribution is achieved, such as an equal number of examples for each class.

This approach may be more suitable for those datasets where there is a class imbalance although a sufficient number of examples in the minority class, such a useful model can be fit.

A limitation of undersampling is that examples from the majority class are deleted that may be useful, important, or perhaps critical to fitting a robust decision boundary. Given that examples are deleted randomly, there is no way to detect or preserve “*good*” or more information-rich examples from the majority class.

… in random under-sampling (potentially), vast quantities of data are discarded. […] This can be highly problematic, as the loss of such data can make the decision boundary between minority and majority instances harder to learn, resulting in a loss in classification performance.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The random undersampling technique can be implemented using the RandomUnderSampler imbalanced-learn class.

The class can be used just like the *RandomOverSampler* class in the previous section, except the strategies impact the majority class instead of the minority class. For example, setting the *sampling_strategy* argument to “*majority*” will undersample the majority class determined by the class with the largest number of examples.

... # define undersample strategy undersample = RandomUnderSampler(sampling_strategy='majority')

For example, a dataset with 1,000 examples in the majority class and 100 examples in the minority class will be undersampled such that both classes would have 100 examples in the transformed training dataset.

We can also set the *sampling_strategy* argument to a floating point value which will be a percentage relative to the minority class, specifically the number of examples in the minority class divided by the number of examples in the majority class. For example, if we set *sampling_strategy* to 0.5 in an imbalanced data dataset with 1,000 examples in the majority class and 100 examples in the minority class, then there would be 200 examples for the majority class in the transformed dataset (or 100/200 = 0.5).

... # define undersample strategy undersample = RandomUnderSampler(sampling_strategy=0.5)

This might be preferred to ensure that the resulting dataset is both large enough to fit a reasonable model, and that not too much useful information from the majority class is discarded.

In random under-sampling, one might attempt to create a balanced class distribution by selecting 90 majority class instances at random to be removed. The resulting dataset will then consist of 20 instances: 10 (randomly remaining) majority class instances and (the original) 10 minority class instances.

— Page 45, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013

The transform can then be fit and applied to a dataset in one step by calling the *fit_resample()* function and passing the untransformed dataset as arguments.

... # fit and apply the transform X_over, y_over = undersample.fit_resample(X, y)

We can demonstrate this on a dataset with a 1:100 class imbalance.

The complete example is listed below.

# example of random undersampling to balance the class distribution from collections import Counter from sklearn.datasets import make_classification from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # summarize class distribution print(Counter(y)) # define undersample strategy undersample = RandomUnderSampler(sampling_strategy='majority') # fit and apply the transform X_over, y_over = undersample.fit_resample(X, y) # summarize class distribution print(Counter(y_over))

Running the example first creates the dataset and reports the imbalanced class distribution.

The transform is fit and applied on the dataset and the new class distribution is reported. We can see that that majority class is undersampled to have the same number of examples as the minority class.

Judgment and empirical results will have to be used as to whether a training dataset with just 200 examples would be sufficient to train a model.

Counter({0: 9900, 1: 100}) Counter({0: 100, 1: 100})

This undersampling transform can also be used in a Pipeline, like the oversampling transform from the previous section.

This allows the transform to be applied to the training dataset only using evaluation schemes such as k-fold cross-validation, avoiding any data leakage in the evaluation of a model.

... # define pipeline steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps)

We can define an example of fitting a decision tree on an imbalanced classification dataset with the undersampling transform applied to the training dataset on each split of a repeated 10-fold cross-validation.

The complete example is listed below.

# example of evaluating a decision tree with random undersampling from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # define pipeline steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1) score = mean(scores) print('F1 Score: %.3f' % score)

Running the example evaluates the decision tree model on the imbalanced dataset with undersampling.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

The default undersampling strategy is used, which balances the majority classes with the minority class. The F1 score averaged across each fold and each repeat is reported.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.889

Interesting results may be achieved by combining both random oversampling and undersampling.

For example, a modest amount of oversampling can be applied to the minority class to improve the bias towards these examples, whilst also applying a modest amount of undersampling to the majority class to reduce the bias on that class.

This can result in improved overall performance compared to performing one or the other techniques in isolation.

For example, if we had a dataset with a 1:100 class distribution, we might first apply oversampling to increase the ratio to 1:10 by duplicating examples from the minority class, then apply undersampling to further improve the ratio to 1:2 by deleting examples from the majority class.

This could be implemented using imbalanced-learn by using a *RandomOverSampler* with *sampling_strategy* set to 0.1 (10%), then using a *RandomUnderSampler* with a *sampling_strategy* set to 0.5 (50%). For example:

... # define oversampling strategy over = RandomOverSampler(sampling_strategy=0.1) # fit and apply the transform X, y = over.fit_resample(X, y) # define undersampling strategy under = RandomUnderSampler(sampling_strategy=0.5) # fit and apply the transform X, y = under.fit_resample(X, y)

We can demonstrate this on a synthetic dataset with a 1:100 class distribution. The complete example is listed below:

# example of combining random oversampling and undersampling for imbalanced data from collections import Counter from sklearn.datasets import make_classification from imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # summarize class distribution print(Counter(y)) # define oversampling strategy over = RandomOverSampler(sampling_strategy=0.1) # fit and apply the transform X, y = over.fit_resample(X, y) # summarize class distribution print(Counter(y)) # define undersampling strategy under = RandomUnderSampler(sampling_strategy=0.5) # fit and apply the transform X, y = under.fit_resample(X, y) # summarize class distribution print(Counter(y))

Running the example first creates the synthetic dataset and summarizes the class distribution, showing an approximate 1:100 class distribution.

Then oversampling is applied, increasing the distribution from about 1:100 to about 1:10. Finally, undersampling is applied, further improving the class distribution from 1:10 to about 1:2

Counter({0: 9900, 1: 100}) Counter({0: 9900, 1: 990}) Counter({0: 1980, 1: 990})

We might also want to apply this same hybrid approach when evaluating a model using k-fold cross-validation.

This can be achieved by using a Pipeline with a sequence of transforms and ending with the model that is being evaluated; for example:

... # define pipeline over = RandomOverSampler(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps)

We can demonstrate this with a decision tree model on the same synthetic dataset.

The complete example is listed below.

# example of evaluating a model with random oversampling and undersampling from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from imblearn.pipeline import Pipeline from imblearn.over_sampling import RandomOverSampler from imblearn.under_sampling import RandomUnderSampler # define dataset X, y = make_classification(n_samples=10000, weights=[0.99], flip_y=0) # define pipeline over = RandomOverSampler(sampling_strategy=0.1) under = RandomUnderSampler(sampling_strategy=0.5) steps = [('o', over), ('u', under), ('m', DecisionTreeClassifier())] pipeline = Pipeline(steps=steps) # evaluate pipeline cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(pipeline, X, y, scoring='f1_micro', cv=cv, n_jobs=-1) score = mean(scores) print('F1 Score: %.3f' % score)

Running the example evaluates a decision tree model using repeated k-fold cross-validation where the training dataset is transformed, first using oversampling, then undersampling, for each split and repeat performed. The F1 score averaged across each fold and each repeat is reported.

The chosen model and resampling configuration are arbitrary, designed to provide a template that you can use to test undersampling with your dataset and learning algorithm rather than optimally solve the synthetic dataset.

Your specific results may differ given the stochastic nature of the dataset and the resampling strategy.

F1 Score: 0.985

This section provides more resources on the topic if you are looking to go deeper.

- Chapter 5 Data Level Preprocessing Methods, Learning from Imbalanced Data Sets, 2018.
- Chapter 3 Imbalanced Datasets: From Sampling to Classifiers, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

- A Study Of The Behavior Of Several Methods For Balancing Machine Learning Training Data, 2004.
- A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

- Imbalanced-Learn Documentation.
- imbalanced-learn, GitHub.
- imblearn.over_sampling.RandomOverSampler API.
- imblearn.pipeline.Pipeline API.
- imblearn.under_sampling.RandomUnderSampler API.

In this tutorial, you discovered random oversampling and undersampling for imbalanced classification

Specifically, you learned:

- Random resampling provides a naive technique for rebalancing the class distribution for an imbalanced dataset.
- Random oversampling duplicates examples from the minority class in the training dataset and can result in overfitting for some models.
- Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Random Oversampling and Undersampling for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post What Is the Naive Classifier for Each Imbalanced Classification Metric? appeared first on Machine Learning Mastery.

]]>A common mistake made by beginners is to apply machine learning algorithms to a problem without establishing a performance baseline.

A performance baseline provides a minimum score above which a model is considered to have skill on the dataset. It also provides a point of relative improvement for all models evaluated on the dataset. A baseline can be established using a naive classifier, such as predicting one class label for all examples in the test dataset.

Another common mistake made by beginners is using classification accuracy as a performance metric on problems that have an imbalanced class distribution. This can result in high accuracy scores even when the majority class is predicted for all cases. Instead, an alternate performance metric must be chosen among a suite of classification measures.

The challenge is that the baseline in performance is dependent upon the choice of performance metric. As such, deep knowledge of each performance metric may be required in order to select an appropriate naive classifier to establish a performance baseline.

In this tutorial, you will discover which naive classifier to use for each imbalanced classification performance metric.

After completing this tutorial, you will know:

- The metrics to consider when evaluating machine learning models for imbalanced classification problems.
- The naive classification strategies that can be used to calculate a baseline in model performance.
- The naive classifier to use for each metric, including the rationale and a worked example demonstrating the result.

Letâ€™s get started.

This tutorial is divided into four parts; they are:

- Metrics for Imbalanced Classification
- Naive Classification Models
- Naive Classifiers for Classification Metrics
- Naive Classifier for Accuracy
- Naive Classifier for G-Mean
- Naive Classifier for F-Measure
- Naive Classifier for ROC AUC
- Naive Classifier for Precision-Recall AUC
- Naive Classifier for Brier Score

- Summary of the Mappings

There are many metrics to choose from for imbalanced classification.

Choosing a metric might be the most important step of the project, as choosing the wrong metric can result in optimizing and choosing a model that solves a problem that is different from the problem that you actually want to solve.

As such, there are perhaps 5 metrics from the tens or hundreds most commonly used that work for imbalanced classification. They are as follows:

Metrics for evaluating predicted class labels:

- Accuracy.
- G-Mean.
- F1-Measure.
- F0.5-Measure.
- F2-Measure.

Metrics for evaluating predicted probabilities:

- ROC Area Under Curve (ROC AUC).
- Precision Recall Area Under Curve (PR AUC).
- Brier Score.

For more on how to calculate each metric, see the tutorial:

A naive classifier is a classification algorithm with no logic that provides a baseline of performance on a classification dataset.

It is important to establish a baseline in performance for a classification dataset. It provides a line in the sand by which all other algorithms can be compared. An algorithm that achieves a score below a naive classification model has no skill on the dataset, whereas an algorithm that achieves a score above that of a naive classification model has some skill on the dataset.

There are perhaps five different naive classification methods that can be used to establish a baseline of performance on a dataset.

Explained in the context of an imbalanced two-class (binary) classification problem, the naive classification methods are as follows:

**Uniformly Random Guess**: Predict 0 or 1 with equal probability.**Prior Random Guess**: Predict 0 or 1 proportional to the prior probability in the dataset.**Majority Class**: Predict 0.**Minority Class**: Predict 1.**Class Prior**: Predict the prior probability for each class.

These can be implemented using the DummyClassifier class form the scikit-learn library.

This class provides the strategy argument that allows different naive classifier techniques to be used. Examples include:

**Uniformly Random Guess**: Set the “*strategy*” argument to “*uniform*“.**Prior Random Guess**: Set the “*strategy*” argument to “*stratified*“.**Majority Class**: Set the “*strategy*” argument to “*most_frequent*“.**Minority Class**: Set the “*strategy*” argument to “*constant*” and set the “*constant*” argument to 1.**Class Prior**: Set the “*strategy*” argument to “*prior*“.

For more on naive classifiers, see the tutorial:

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We have established that there are many different metrics to choose from for an imbalanced classification problem.

We have also established that it is critical to determine a baseline in performance for a new classification problem using a naive classifier.

The challenge is, each classification metric requires the careful choice of a specific naive classification strategy that achieves the appropriate “*no skill*” performance. This can and should be selected using knowledge of each metric and can be confirmed by careful experimentation.

In this section, we will rationalize the selection of the appropriate naive classifier for each imbalanced classification metric, then confirm the selection with an empirical result on a synthetic binary classification dataset.

The synthetic dataset has 10,000 examples, 99 percent of which belong to the majority class (negative case or class label 0) and 1 percent of which belong to the minority class (positive case or class label 1).

Each naive classifier strategy is evaluated using stratified 10-fold cross-validation with three repeats, and performance is summarized using the mean and standard deviation across these runs.

The mapping from metrics to naive classifier can be used on your next imbalanced classification project, and the empirical results confirm the rationale and help to establish the intuition for each mapping.

Let’s dive in.

Classification accuracy is the total number of correct predictions divided by the total number of predictions made.

The appropriate naive classifier for classification accuracy is to predict the majority class in all cases. This will maximize the true negatives and minimize the false negatives.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem. We would expect that predicting the majority class would result in a classification accuracy of approximately 99 percent on this dataset.

The complete example is listed below.

# compare naive classifiers with classification accuracy metric from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from matplotlib import pyplot # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the classification accuracy for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods. Try running the example a few times.

In this case, we can see that the majority strategy achieves the best classification accuracy of 99 percent, as we expected. We can also see that the prior strategy achieves the same result as it predicts mostly 0.01 (1 percent for the positive class) in all cases, which is mapped to the majority class label 0.

>Uniform 0.501 (0.015) >Stratified 0.980 (0.003) >Majority 0.990 (0.000) >Minority 0.010 (0.000) >Prior 0.990 (0.000)

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

The geometric mean, or G-Mean, is the geometric mean of the sensitivity and specificity scores.

Sensitivity summarizes how well the positive class was predicted, and specificity summarizes how well the negative class was predicted.

Performing perfectly well on the majority or minority class will come at the cost of a worst-case performance on the other class, which will result in a zero G-Mean score.

Therefore, the most appropriate naive classification strategy is to predict each class with an equal probability, which will give each class an opportunity for a correct prediction.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem. We would expect that predict a uniformly random class label would result in a G-Mean of approximately 0.5 on this dataset.

The complete example is listed below.

# compare naive classifiers with g-mean metric from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from imblearn.metrics import geometric_mean_score from sklearn.metrics import make_scorer from matplotlib import pyplot # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(geometric_mean_score) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the G-mean for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods. Try running the example a few times.

In this case, we can see that, as expected, the uniformly random naive classifier resulted in a G-Mean of 0.5 and all other strategies resulted in a G-Mean score of 0.

>Uniform 0.507 (0.074) >Stratified 0.021 (0.079) >Majority 0.000 (0.000) >Minority 0.000 (0.000) >Prior 0.000 (0.000)

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

The F-measure (also called the F1-score) is calculated as the harmonic mean between the precision and the recall.

Precision summarizes the fraction of examples assigned the positive class that belong to the positive class and recall summarizes how well the positive class was predicted out of all positive predictions that could have been made.

Making predictions that favor precision (e.g. predict the minority class) will also result in a lower bound on the recall.

Therefore, the naive strategy for the F-measure is to predict the minority class in all cases.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem.

The F-measure when predicting only the minority class for this dataset is not obvious at first. Recall will be perfect, or 1.0. The precision will be equivalent to the prior for the minority class, that is 1 percent or 0.01. Therefore, the F-measure is the harmonic mean between 1.0 and 0.01, which is about 0.02.

The complete example is listed below.

# compare naive classifiers with f1-measure from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from matplotlib import pyplot # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='f1', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the ROC AUC for each naive classifier strategy.

Your results may vary slightly given the stochastic nature of some of the methods. Try running the example a few times.

You may get a warning when evaluating the naive classifier that only predicts the minority class, as there are no positive cases predicted. You will see a warning as follows:

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.

In this case, we can see that predicting the minority class results in the expected F-measure of about 0.02. We can also see that we approximate this score when using the uniform and stratified strategies.

>Uniform 0.020 (0.007) >Stratified 0.020 (0.040) >Majority 0.000 (0.000) >Minority 0.020 (0.000) >Prior 0.000 (0.000)

Box and whisker plots for each naive classifier are also created, allowing the distribution of scores to be compared visually.

This same naive classifier strategy of predicting the minority class is also appropriate when using the F0.5 and F2 measures.

The ROC Curve is a plot of the false positive rate versus the true positive rate for a range of different probability thresholds.

The ROC area under curve is an approximation of the integral or area under the ROC curve and summarizes how well an algorithm performs across the range of probability thresholds.

A no-skill model has a ROC AUC of 0.5 and can be achieved by predicting class labels randomly but in proportion to their base rate (e.g. no discrimination power). This would be the stratified method.

Predicting a constant value, like the majority class or minority class will result in an invalid ROC Curve (e.g. a point) and in turn an invalid ROC AUC score. Scores for models that predict a constant value should be ignored.

The complete example is listed below.

# compare naive classifiers with roc auc from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from matplotlib import pyplot # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the ROC AUC for each naive classifier strategy.

In this case, we can see that as expected, predicting a stratified random label results in the worst-case ROC AUC of 0.5.

>Uniform 0.500 (0.000) >Stratified 0.506 (0.020) >Majority 0.500 (0.000) >Minority 0.500 (0.000) >Prior 0.500 (0.000)

The Precision-Recall Curve (or PR Curve) is a plot of the recall versus the precision for a range of different probability thresholds.

The Precision-Recall area under curve is an approximation of the integral or area under the Precision-Recall curve and summarizes how well an algorithm performs across the range of probability thresholds.

A no-skill model has a PR AUC that matches the base rate of the positive class, e.g. 0.01. This can be achieved by predicting class labels randomly but in proportion to their base rate (e.g. no discrimination power). This would be the stratified method.

Predicting a constant value, like the majority class or minority class will result in an invalid PR Curve (e.g. a point) and in turn an invalid PR AUC score. Scores for models that predict a constant value should be ignored.

The complete example is listed below.

# compare naive classifiers with precision-recall auc metric from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from sklearn.metrics import precision_recall_curve from sklearn.metrics import auc from sklearn.metrics import make_scorer from matplotlib import pyplot # calculate precision-recall area under curve def pr_auc(y_true, probas_pred): # calculate precision-recall curve p, r, _ = precision_recall_curve(y_true, probas_pred) # calculate area under curve return auc(r, p) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(pr_auc, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the PR AUC score for each naive classifier strategy.

In this case, we can see that as expected, predicting a stratified random class label results in the worst-case PR AUC of close to 0.01.

>Uniform 0.505 (0.000) >Stratified 0.015 (0.037) >Majority 0.505 (0.000) >Minority 0.505 (0.000) >Prior 0.505 (0.000)

Brier score calculates the mean squared error between the expected probabilities and the predicted probabilities.

The appropriate naive classifier for Brier score is to predict the class priors for each example in the test set. For a binary classification problem that involves predicting a Binomial distribution, this would be the prior for class 0 and the prior for class 1.

We can demonstrate this with a worked example comparing each naive classifier strategy on a binary classification problem.

The model would predict the probabilities [0.99, 0.01] in all cases. We would expect that this will result in mean squared error close to the prior for the minority class, e.g. 0.01 on this dataset. This is because the Binomial probability for most examples is 0.0 with only 1 percent having 1.0 which results in a maximum error for 1 percent of cases, or a Brier score of 0.01.

The complete example is listed below.

# compare naive classifiers with brier score metric from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier from matplotlib import pyplot # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='brier_score_loss', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # Uniformly Random Guess models.append(DummyClassifier(strategy='uniform')) names.append('Uniform') # Prior Random Guess models.append(DummyClassifier(strategy='stratified')) names.append('Stratified') # Majority Class: Predict 0 models.append(DummyClassifier(strategy='most_frequent')) names.append('Majority') # Minority Class: Predict 1 models.append(DummyClassifier(strategy='constant', constant=1)) names.append('Minority') # Class Prior models.append(DummyClassifier(strategy='prior')) names.append('Prior') return models, names # define dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the Brier score for each naive classifier strategy.

Brier score is minimized, with 0.0 representing the lowest possible score.

As such, the scikit-learn inverts the score by making it negative, hence the negative mean Brier scores for each naive classifier. The sign can, therefore, be ignored.

As expected, we can see that predicting the prior probability results in the best score. We can also see that predicting the majority class also results in the same best Brier score.

>Uniform -0.250 (0.000) >Stratified -0.020 (0.003) >Majority -0.010 (0.000) >Minority -0.990 (0.000) >Prior -0.010 (0.000)

We can summarize the mapping of imbalanced classification metrics to naive classification methods.

This provides a look-up table that you can consult on your next imbalanced classification project.

**Accuracy**: Predict the majority class (class 0).**G-Mean**: Predict a uniformly random class.**F1-Measure**: Predict the minority class (class 1).**F0.5-Measure**: Predict the minority class (class 1).**F2-Measure**: Predict the minority class (class 1).**ROC AUC**: Predict a stratified random class.**PR ROC**: Predict a stratified random class.**Brier Score**: Predict majority class prior.

This section provides more resources on the topic if you are looking to go deeper.

- Tour of Evaluation Metrics for Imbalanced Classification
- How to Develop and Evaluate Naive Classifier Strategies Using Probability

In this tutorial, you discovered which naive classifier to use for each imbalanced classification performance metric.

Specifically, you learned:

- The metrics to consider when evaluating machine learning models for imbalanced classification problems.
- The naive classification strategies that can be used to calculate a baseline in model performance.
- The naive classifier to use for each metric, including the rationale and a worked example demonstrating the result.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post What Is the Naive Classifier for Each Imbalanced Classification Metric? appeared first on Machine Learning Mastery.

]]>The post How to Fix k-Fold Cross-Validation for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Model evaluation involves using the available dataset to fit a model and estimate its performance when making predictions on unseen examples.

It is a challenging problem as both the training dataset used to fit the model and the test set used to evaluate it must be sufficiently large and representative of the underlying problem so that the resulting estimate of model performance is not too optimistic or pessimistic.

The two most common approaches used for model evaluation are the train/test split and the k-fold cross-validation procedure. Both approaches can be very effective in general, although they can result in misleading results and potentially fail when used on classification problems with a severe class imbalance.

In this tutorial, you will discover how to evaluate classifier models on imbalanced datasets.

After completing this tutorial, you will know:

- The challenge of evaluating classifiers on datasets using train/test splits and cross-validation.
- How a naive application of k-fold cross-validation and train-test splits will fail when evaluating classifiers on imbalanced datasets.
- How modified k-fold cross-validation and train-test splits can be used to preserve the class distribution in the dataset.

Letâ€™s get started.

This tutorial is divided into three parts; they are:

- Challenge of Evaluating Classifiers
- Failure of k-Fold Cross-Validation
- Fix Cross-Validation for Imbalanced Classification

Evaluating a classification model is challenging because we won’t know how good a model is until it is used.

Instead, we must estimate the performance of a model using available data where we already have the target or outcome.

Model evaluation involves more than just evaluating a model; it includes testing different data preparation schemes, different learning algorithms, and different hyperparameters for well-performing learning algorithms.

- Model = Data Preparation + Learning Algorithm + Hyperparameters

Ideally, the model construction procedure (data preparation, learning algorithm, and hyperparameters) with the best score (with your chosen metric) can be selected and used.

The simplest model evaluation procedure is to split a dataset into two parts and use one part for training a model and the second part for testing the model. As such, the parts of the dataset are named for their function, train set and test set respectively.

This is effective if your collected dataset is very large and representative of the problem. The number of examples required will differ from problem to problem, but may be thousands, hundreds of thousands, or millions of examples to be sufficient.

A split of 50/50 for train and test would be ideal, although more skewed splits are common, such as 67/33 or 80/20 for train and test sets.

We rarely have enough data to get an unbiased estimate of performance using a train/test split evaluation of a model. Instead, we often have a much smaller dataset than would be preferred, and resampling strategies must be used on this dataset.

The most used model evaluation scheme for classifiers is the 10-fold cross-validation procedure.

The k-fold cross-validation procedure involves splitting the training dataset into *k* folds. The first *k-1* folds are used to train a model, and the holdout *k*th fold is used as the test set. This process is repeated and each of the folds is given an opportunity to be used as the holdout test set. A total of *k* models are fit and evaluated, and the performance of the model is calculated as the mean of these runs.

The procedure has been shown to give a less optimistic estimate of model performance on small training datasets than a single train/test split. A value of *k=10* has been shown to be effective across a wide range of dataset sizes and model types.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Sadly, the k-fold cross-validation is not appropriate for evaluating imbalanced classifiers.

A 10-fold cross-validation, in particular, the most commonly used error-estimation method in machine learning, can easily break down in the case of class imbalances, even if the skew is less extreme than the one previously considered.

— Page 188, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

The reason is that the data is split into *k*-folds with a uniform probability distribution.

This might work fine for data with a balanced class distribution, but when the distribution is severely skewed, it is likely that one or more folds will have few or no examples from the minority class. This means that some or perhaps many of the model evaluations will be misleading, as the model need only predict the majority class correctly.

We can make this concrete with an example.

First, we can define a dataset with a 1:100 minority to majority class distribution.

This can be achieved using the make_classification() function for creating a synthetic dataset, specifying the number of examples (1,000), the number of classes (2), and the weighting of each class (99% and 1%).

# generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1)

The example below generates the synthetic binary classification dataset and summarizes the class distribution.

# create a binary classification dataset from numpy import unique from sklearn.datasets import make_classification # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1) # summarize dataset classes = unique(y) total = len(y) for c in classes: n_examples = len(y[y==c]) percent = n_examples / total * 100 print('> Class=%d : %d/%d (%.1f%%)' % (c, n_examples, total, percent))

Running the example creates the dataset and summarizes the number of examples in each class.

By setting the *random_state* argument, it ensures that we get the same randomly generated examples each time the code is run.

> Class=0 : 990/1000 (99.0%) > Class=1 : 10/1000 (1.0%)

A total of 10 examples in the minority class is not many. If we used 10-folds, we would get one example in each fold in the ideal case, which is not enough to train a model. For demonstration purposes, we will use 5-folds.

In the ideal case, we would have 10/5 or two examples in each fold, meaning 4*2 (8) folds worth of examples in a training dataset and 1*2 folds (2) in a given test dataset.

First, we will use the KFold class to randomly split the dataset into 5-folds and check the composition of each train and test set. The complete example is listed below.

# example of k-fold cross-validation with an imbalanced dataset from sklearn.datasets import make_classification from sklearn.model_selection import KFold # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1) kfold = KFold(n_splits=5, shuffle=True, random_state=1) # enumerate the splits and summarize the distributions for train_ix, test_ix in kfold.split(X): # select rows train_X, test_y = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # summarize train and test composition train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1]) test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1]) print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

Running the example creates the same dataset and enumerates each split of the data, showing the class distribution for both the train and test sets.

We can see that in this case, there are some splits that have the expected 8/2 split for train and test sets, and others that are much worse, such as 6/4 (optimistic) and 10/0 (pessimistic).

Evaluating a model on these splits of the data would not give a reliable estimate of performance.

>Train: 0=791, 1=9, Test: 0=199, 1=1 >Train: 0=793, 1=7, Test: 0=197, 1=3 >Train: 0=794, 1=6, Test: 0=196, 1=4 >Train: 0=790, 1=10, Test: 0=200, 1=0 >Train: 0=792, 1=8, Test: 0=198, 1=2

We can demonstrate a similar issue exists if we use a simple train/test split of the dataset, although the issue is less severe.

We can use the train_test_split() function to create a 50/50 split of the dataset and, on average, we would expect five examples from the minority class to appear in each dataset if we performed this split many times.

The complete example is listed below.

# example of train/test split with an imbalanced dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # summarize train_0, train_1 = len(trainy[trainy==0]), len(trainy[trainy==1]) test_0, test_1 = len(testy[testy==0]), len(testy[testy==1]) print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

Running the example creates the same dataset as before and splits it into a random train and test split.

In this case, we can see only three examples of the minority class are present in the training set, with seven in the test set.

Evaluating models on this split would not give them enough examples to learn from, too many to be evaluated on, and likely give poor performance. You can imagine how the situation could be worse with an even more severe random spit.

>Train: 0=497, 1=3, Test: 0=493, 1=7

The solution is to not split the data randomly when using k-fold cross-validation or a train-test split.

Specifically, we can split a dataset randomly, although in such a way that maintains the same class distribution in each subset. This is called stratification or stratified sampling and the target variable (*y*), the class, is used to control the sampling process.

For example, we can use a version of k-fold cross-validation that preserves the imbalanced class distribution in each fold. It is called stratified k-fold cross-validation and will enforce the class distribution in each split of the data to match the distribution in the complete training dataset.

… it is common, in the case of class imbalances in particular, to use stratified 10-fold cross-validation, which ensures that the proportion of positive to negative examples found in the original distribution is respected in all the folds.

— Page 205, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

We can make this concrete with an example.

We can stratify the splits using the StratifiedKFold class that supports stratified k-fold cross-validation as its name suggests.

Below is the same dataset and the same example with the stratified version of cross-validation.

# example of stratified k-fold cross-validation with an imbalanced dataset from sklearn.datasets import make_classification from sklearn.model_selection import StratifiedKFold # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1) kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) # enumerate the splits and summarize the distributions for train_ix, test_ix in kfold.split(X, y): # select rows train_X, test_y = X[train_ix], X[test_ix] train_y, test_y = y[train_ix], y[test_ix] # summarize train and test composition train_0, train_1 = len(train_y[train_y==0]), len(train_y[train_y==1]) test_0, test_1 = len(test_y[test_y==0]), len(test_y[test_y==1]) print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

Running the example generates the dataset as before and summarizes the class distribution for the train and test sets for each split.

In this case, we can see that each split matches what we expected in the ideal case.

Each of the examples in the minority class is given one opportunity to be used in a test set, and each train and test set for each split of the data has the same class distribution.

>Train: 0=792, 1=8, Test: 0=198, 1=2 >Train: 0=792, 1=8, Test: 0=198, 1=2 >Train: 0=792, 1=8, Test: 0=198, 1=2 >Train: 0=792, 1=8, Test: 0=198, 1=2 >Train: 0=792, 1=8, Test: 0=198, 1=2

This example highlights the need to first select a value of *k* for *k*-fold cross-validation to ensure that there are a sufficient number of examples in the train and test sets to fit and evaluate a model (two examples from the minority class in the test set is probably too few for a test set).

It also highlights the requirement to use stratified *k*-fold cross-validation with imbalanced datasets to preserve the class distribution in the train and test sets for each evaluation of a given model.

We can also use a stratified version of a train/test split.

This can be achieved by setting the “*stratify*” argument on the call to *train_test_split()* and setting it to the “*y*” variable containing the target variable from the dataset. From this, the function will determine the desired class distribution and ensure that the train and test sets both have this distribution.

We can demonstrate this with a worked example, listed below.

# example of stratified train/test split with an imbalanced dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], flip_y=0, random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # summarize train_0, train_1 = len(trainy[trainy==0]), len(trainy[trainy==1]) test_0, test_1 = len(testy[testy==0]), len(testy[testy==1]) print('>Train: 0=%d, 1=%d, Test: 0=%d, 1=%d' % (train_0, train_1, test_0, test_1))

Running the example creates a random split of the dataset into training and test sets, ensuring that the class distribution is preserved, in this case leaving five examples in each dataset.

>Train: 0=495, 1=5, Test: 0=495, 1=5

This section provides more resources on the topic if you are looking to go deeper.

- sklearn.model_selection.KFold API.
- sklearn.model_selection.StratifiedKFold API.
- sklearn.model_selection.train_test_split API.

In this tutorial, you discovered how to evaluate classifier models on imbalanced datasets.

Specifically, you learned:

- The challenge of evaluating classifiers on datasets using train/test splits and cross-validation.
- How a naive application of k-fold cross-validation and train-test splits will fail when evaluating classifiers on imbalanced datasets.
- How modified k-fold cross-validation and train-test splits can be used to preserve the class distribution in the dataset.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Fix k-Fold Cross-Validation for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Probability Metrics for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Classification predictive modeling involves predicting a class label for examples, although some problems require the prediction of a probability of class membership.

For these problems, the crisp class labels are not required, and instead, the likelihood that each example belonging to each class is required and later interpreted. As such, small relative probabilities can carry a lot of meaning and specialized metrics are required to quantify the predicted probabilities.

In this tutorial, you will discover metrics for evaluating probabilistic predictions for imbalanced classification.

After completing this tutorial, you will know:

- Probability predictions are required for some classification predictive modeling problems.
- Log loss quantifies the average difference between predicted and expected probability distributions.
- Brier score quantifies the average difference between predicted and expected probabilities.

Letâ€™s get started.

This tutorial is divided into three parts; they are:

- Probability Metrics
- Log Loss for Imbalanced Classification
- Brier Score for Imbalanced Classification

Classification predictive modeling involves predicting a class label for an example.

On some problems, a crisp class label is not required, and instead a probability of class membership is preferred. The probability summarizes the likelihood (or uncertainty) of an example belonging to each class label. Probabilities are more nuanced and can be interpreted by a human operator or a system in decision making.

Probability metrics are those specifically designed to quantify the skill of a classifier model using the predicted probabilities instead of crisp class labels. They are typically scores that provide a single value that can be used to compare different models based on how well the predicted probabilities match the expected class probabilities.

In practice, a dataset will not have target probabilities. Instead, it will have class labels.

For example, a two-class (binary) classification problem will have the class labels 0 for the negative case and 1 for the positive case. When an example has the class label 0, then the probability of the class labels 0 and 1 will be 1 and 0 respectively. When an example has the class label 1, then the probability of class labels 0 and 1 will be 0 and 1 respectively.

**Example with Class=0**: P(class=0) = 1, P(class=1) = 0**Example with Class=1**: P(class=0) = 0, P(class=1) = 1

We can see how this would scale to three classes or more; for example:

**Example with Class=0**: P(class=0) = 1, P(class=1) = 0, P(class=2) = 0**Example with Class=1**: P(class=0) = 0, P(class=1) = 1, P(class=2) = 0**Example with Class=2**: P(class=0) = 0, P(class=1) = 0, P(class=2) = 1

In the case of binary classification problems, this representation can be simplified to just focus on the positive class.

That is, we only require the probability of an example belonging to class 1 to represent the probabilities for binary classification (the so-called Bernoulli distribution); for example:

**Example with Class=0**: P(class=1) = 0**Example with Class=1**: P(class=1) = 1

Probability metrics will summarize how well the predicted distribution of class membership matches the known class probability distribution.

This focus on predicted probabilities may mean that the crisp class labels predicted by a model are ignored. This focus may mean that a model that predicts probabilities may appear to have terrible performance when evaluated according to its crisp class labels, such as using accuracy or a similar score. This is because although the predicted probabilities may show skill, they must be interpreted with an appropriate threshold prior to being converted into crisp class labels.

Additionally, the focus on predicted probabilities may also require that the probabilities predicted by some nonlinear models to be calibrated prior to being used or evaluated. Some models will learn calibrated probabilities as part of the training process (e.g. logistic regression), but many will not and will require calibration (e.g. support vector machines, decision trees, and neural networks).

A given probability metric is typically calculated for each example, then averaged across all examples in the training dataset.

There are two popular metrics for evaluating predicted probabilities; they are:

- Log Loss
- Brier Score

Let’s take a closer look at each in turn.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Logarithmic loss or log loss for short is a loss function known for training the logistic regression classification algorithm.

The log loss function calculates the negative log likelihood for probability predictions made by the binary classification model. Most notably, this is logistic regression, but this function can be used by other models, such as neural networks, and is known by other names, such as cross-entropy.

Generally, the log loss can be calculated using the expected probabilities for each class and the natural logarithm of the predicted probabilities for each class; for example:

- LogLoss = -(P(class=0) * log(P(class=0)) + (P(class=1)) * log(P(class=1)))

The best possible log loss is 0.0, and values are positive to infinite for progressively worse scores.

If you are just predicting the probability for the positive class, then the log loss function can be calculated for one binary classification prediction (*yhat*) compared to the expected probability (*y*) as follows:

- LogLoss = -((1 – y) * log(1 – yhat) + y * log(yhat))

For example, if the expected probability was 1.0 and the model predicted 0.8, the log loss would be:

- LogLoss = -((1 – y) * log(1 – yhat) + y * log(yhat))
- LogLoss = -((1 – 1.0) * log(1 – 0.8) + 1.0 * log(0.8))
- LogLoss = -(-0.0 + -0.223)
- LogLoss = 0.223

This calculation can be scaled up for multiple classes by adding additional terms; for example:

- LogLoss = -( sum c in C y_c * log(yhat_c))

This generalization is also known as cross-entropy and calculates the number of bits (if log base-2 is used) or nats (if log base-e is used) by which two probability distributions differ.

Specifically, it builds upon the idea of entropy from information theory and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

… the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q …

â€” Page 57, Machine Learning: A Probabilistic Perspective, 2012.

The intuition for this definition comes if we consider a target or underlying probability distribution *P* and an approximation of the target distribution *Q*, then the cross-entropy of *Q* from *P* is the number of additional bits to represent an event using *Q* instead of *P*.

We will stick with log loss for now, as it is the term most commonly used when using this calculation as an evaluation metric for classifier models.

When calculating the log loss for a set of predictions compared to a set of expected probabilities in a test dataset, the average of the log loss across all samples is calculated and reported; for example:

- AverageLogLoss = 1/N * sum i in N -((1 – y) * log(1 – yhat) + y * log(yhat))

The average log loss for a set of predictions on a training dataset is often simply referred to as the log loss.

We can demonstrate calculating log loss with a worked example.

First, let’s define a synthetic binary classification dataset. We will use the make_classification() function to create 1,000 examples, with 99%/1% split for the two classes. The complete example of creating and summarizing the dataset is listed below.

# create an imbalanced dataset from numpy import unique from sklearn.datasets import make_classification # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0, random_state=1) # summarize dataset classes = unique(y) total = len(y) for c in classes: n_examples = len(y[y==c]) percent = n_examples / total * 100 print('> Class=%d : %d/%d (%.1f%%)' % (c, n_examples, total, percent))

Running the example creates the dataset and reports the distribution of examples in each class.

> Class=0 : 990/1000 (99.0%) > Class=1 : 10/1000 (1.0%)

Next, we will develop an intuition for naive predictions of probabilities.

A naive prediction strategy would be to predict certainty for the majority class, or P(class=0) = 1. An alternative strategy would be to predict the minority class, or P(class=1) = 1.

Log loss can be calculated using the log_loss() scikit-learn function. It takes the probability for each class as input and returns the average log loss. Specifically, each example must have a prediction with one probability per class, meaning a prediction for one example for a binary classification problem must have a probability for class 0 and class 1.

Therefore, predicting certain probabilities for class 0 for all examples would be implemented as follows:

... # no skill prediction 0 probabilities = [[1, 0] for _ in range(len(testy))] avg_logloss = log_loss(testy, probabilities) print('P(class0=1): Log Loss=%.3f' % (avg_logloss))

We can do the same thing for P(class1)=1.

These two strategies are expected to perform terribly.

A better naive strategy would be to predict the class distribution for each example. For example, because our dataset has a 99%/1% class distribution for the majority and minority classes, this distribution can be “*predicted*” for each example to give a baseline for probability predictions.

... # baseline probabilities probabilities = [[0.99, 0.01] for _ in range(len(testy))] avg_logloss = log_loss(testy, probabilities) print('Baseline: Log Loss=%.3f' % (avg_logloss))

Finally, we can also calculate the log loss for perfectly predicted probabilities by taking the target values for the test set as predictions.

... # perfect probabilities avg_logloss = log_loss(testy, testy) print('Perfect: Log Loss=%.3f' % (avg_logloss))

Tying this all together, the complete example is listed below.

# log loss for naive probability predictions. from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import log_loss # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0, random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # no skill prediction 0 probabilities = [[1, 0] for _ in range(len(testy))] avg_logloss = log_loss(testy, probabilities) print('P(class0=1): Log Loss=%.3f' % (avg_logloss)) # no skill prediction 1 probabilities = [[0, 1] for _ in range(len(testy))] avg_logloss = log_loss(testy, probabilities) print('P(class1=1): Log Loss=%.3f' % (avg_logloss)) # baseline probabilities probabilities = [[0.99, 0.01] for _ in range(len(testy))] avg_logloss = log_loss(testy, probabilities) print('Baseline: Log Loss=%.3f' % (avg_logloss)) # perfect probabilities avg_logloss = log_loss(testy, testy) print('Perfect: Log Loss=%.3f' % (avg_logloss))

Running the example reports the log loss for each naive strategy.

As expected, predicting certainty for each class label is punished with large log loss scores, with the case of being certain for the minority class in all cases resulting in a much larger score.

We can see that predicting the distribution of examples in the dataset as the baseline results in a better score than either of the other naive measures. This baseline represents the no skill classifier and log loss scores below this strategy represent a model that has some skill.

Finally, we can see that a log loss for perfectly predicted probabilities is 0.0, indicating no difference between actual and predicted probability distributions.

P(class0=1): Log Loss=0.345 P(class1=1): Log Loss=34.193 Baseline: Log Loss=0.056 Perfect: Log Loss=0.000

Now that we are familiar with log loss, let’s take a look at the Brier score.

The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values.

The score summarizes the magnitude of the error in the probability forecasts and is designed for binary classification problems. It is focused on evaluating the probabilities for the positive class. Nevertheless, it can be adapted for problems with multiple classes.

As such, it is an appropriate probabilistic metric for imbalanced classification problems.

The evaluation of probabilistic scores is generally performed by means of the Brier Score. The basic idea is to compute the mean squared error (MSE) between predicted probability scores and the true class indicator, where the positive class is coded as 1, and negative class 0.

— Page 57, Learning from Imbalanced Data Sets, 2018.

The error score is always between 0.0 and 1.0, where a model with perfect skill has a score of 0.0.

The Brier score can be calculated for positive predicted probabilities (*yhat*) compared to the expected probabilities (*y*) as follows:

- BrierScore = 1/N * Sum i to N (yhat_i – y_i)^2

For example, if a predicted positive class probability is 0.8 and the expected probability is 1.0, then the Brier score is calculated as:

- BrierScore = (yhat_i – y_i)^2
- BrierScore = (0.8 – 1.0)^2
- BrierScore = 0.04

We can demonstrate calculating Brier score with a worked example using the same dataset and naive predictive models as were used in the previous section.

The Brier score can be calculated using the brier_score_loss() scikit-learn function. It takes the probabilities for the positive class only, and returns an average score.

As in the previous section, we can evaluate naive strategies of predicting the certainty for each class label. In this case, as the score only considered the probability for the positive class, this will involve predicting 0.0 for P(class=1)=0 and 1.0 for P(class=1)=1. For example:

... # no skill prediction 0 probabilities = [0.0 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('P(class1=0): Brier Score=%.4f' % (avg_brier)) # no skill prediction 1 probabilities = [1.0 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('P(class1=1): Brier Score=%.4f' % (avg_brier))

We can also test the no skill classifier that predicts the ratio of positive examples in the dataset, which in this case is 1 percent or 0.01.

... # baseline probabilities probabilities = [0.01 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('Baseline: Brier Score=%.4f' % (avg_brier))

Finally, we can also confirm the Brier score for perfectly predicted probabilities.

... # perfect probabilities avg_brier = brier_score_loss(testy, testy) print('Perfect: Brier Score=%.4f' % (avg_brier))

Tying this together, the complete example is listed below.

# brier score for naive probability predictions. from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import brier_score_loss # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0, random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # no skill prediction 0 probabilities = [0.0 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('P(class1=0): Brier Score=%.4f' % (avg_brier)) # no skill prediction 1 probabilities = [1.0 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('P(class1=1): Brier Score=%.4f' % (avg_brier)) # baseline probabilities probabilities = [0.01 for _ in range(len(testy))] avg_brier = brier_score_loss(testy, probabilities) print('Baseline: Brier Score=%.4f' % (avg_brier)) # perfect probabilities avg_brier = brier_score_loss(testy, testy) print('Perfect: Brier Score=%.4f' % (avg_brier))

Running the example, we can see the scores for the naive models and the baseline no skill classifier.

As we might expect, we can see that predicting a 0.0 for all examples results in a low score, as the mean squared error between all 0.0 predictions and mostly 0 classes in the test set results in a small value. Conversely, the error between 1.0 predictions and mostly 0 class values results in a larger error score.

Importantly, we can see that the default no skill classifier results in a lower score than predicting all 0.0 values. Again, this represents the baseline score, below which models will demonstrate skill.

P(class1=0): Brier Score=0.0100 P(class1=1): Brier Score=0.9900 Baseline: Brier Score=0.0099 Perfect: Brier Score=0.0000

The Brier scores can become very small and the focus will be on fractions well below the decimal point. For example, the difference in the above example between Baseline and Perfect scores is slight at four decimal places.

A common practice is to transform the score using a reference score, such as the no skill classifier. This is called a Brier Skill Score, or BSS, and is calculated as follows:

- BrierSkillScore = 1 – (BrierScore / BrierScore_ref)

We can see that if the reference score was evaluated, it would result in a BSS of 0.0. This represents a no skill prediction. Values below this will be negative and represent worse than no skill. Values above 0.0 represent skillful predictions with a perfect prediction value of 1.0.

We can demonstrate this by developing a function to calculate the Brier skill score listed below.

# calculate the brier skill score def brier_skill_score(y, yhat, brier_ref): # calculate the brier score bs = brier_score_loss(y, yhat) # calculate skill score return 1.0 - (bs / brier_ref)

We can then calculate the BSS for each of the naive forecasts, as well as for a perfect prediction.

The complete example is listed below.

# brier skill score for naive probability predictions. from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import brier_score_loss # calculate the brier skill score def brier_skill_score(y, yhat, brier_ref): # calculate the brier score bs = brier_score_loss(y, yhat) # calculate skill score return 1.0 - (bs / brier_ref) # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99], flip_y=0, random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # calculate reference probabilities = [0.01 for _ in range(len(testy))] brier_ref = brier_score_loss(testy, probabilities) print('Reference: Brier Score=%.4f' % (brier_ref)) # no skill prediction 0 probabilities = [0.0 for _ in range(len(testy))] bss = brier_skill_score(testy, probabilities, brier_ref) print('P(class1=0): BSS=%.4f' % (bss)) # no skill prediction 1 probabilities = [1.0 for _ in range(len(testy))] bss = brier_skill_score(testy, probabilities, brier_ref) print('P(class1=1): BSS=%.4f' % (bss)) # baseline probabilities probabilities = [0.01 for _ in range(len(testy))] bss = brier_skill_score(testy, probabilities, brier_ref) print('Baseline: BSS=%.4f' % (bss)) # perfect probabilities bss = brier_skill_score(testy, testy, brier_ref) print('Perfect: BSS=%.4f' % (bss))

Running the example first calculates the reference Brier score used in the BSS calculation.

We can then see that predicting certainty scores for each class results in a negative BSS score, indicating that they are worse than no skill. Finally, we can see that evaluating the reference forecast itself results in 0.0, indicating no skill and evaluating the true values as predictions results in a perfect score of 1.0.

As such, the Brier Skill Score is a best practice for evaluating probability predictions and is widely used where probability classification prediction are evaluated routinely, such as in weather forecasts (e.g. rain or not).

Reference: Brier Score=0.0099 P(class1=0): BSS=-0.0101 P(class1=1): BSS=-99.0000 Baseline: BSS=0.0000 Perfect: BSS=1.0000

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Probability Scoring Methods in Python
- A Gentle Introduction to Cross-Entropy for Machine Learning
- A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation

- Chapter 8 Assessment Metrics For Imbalanced Learning, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Chapter 3 Performance Measures, Learning from Imbalanced Data Sets, 2018.

- sklearn.datasets.make_classification API.
- sklearn.metrics.log_loss API.
- sklearn.metrics.brier_score_loss API.

- Brier score, Wikipedia.
- Cross entropy, Wikipedia.
- Joint Working Group on Forecast Verification Research

In this tutorial, you discovered metrics for evaluating probabilistic predictions for imbalanced classification.

Specifically, you learned:

- Probability predictions are required for some classification predictive modeling problems.
- Log loss quantifies the average difference between predicted and expected probability distributions.
- Brier score quantifies the average difference between predicted and expected probabilities.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Probability Metrics for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post Tour of Evaluation Metrics for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>A classifier is only as good as the metric used to evaluate it.

If you choose the wrong metric to evaluate your models, you are likely to choose a poor model, or in the worst case, be misled about the expected performance of your model.

Choosing an appropriate metric is challenging generally in applied machine learning, but is particularly difficult for imbalanced classification problems. Firstly, because most of the standard metrics that are widely used assume a balanced class distribution, and because typically not all classes, and therefore, not all prediction errors, are equal for imbalanced classification.

In this tutorial, you will discover metrics that you can use for imbalanced classification.

After completing this tutorial, you will know:

- About the challenge of choosing metrics for classification, and how it is particularly difficult when there is a skewed class distribution.
- How there are three main types of metrics for evaluating classifier models, referred to as rank, threshold, and probability.
- How to choose a metric for imbalanced classification if you don’t know where to start.

Letâ€™s get started.

This tutorial is divided into three parts; they are:

- Challenge of Evaluation Metrics
- Taxonomy of Classifier Evaluation Metrics
- How to Choose an Evaluation Metric

An evaluation metric quantifies the performance of a predictive model.

This typically involves training a model on a dataset, using the model to make predictions on a holdout dataset not used during training, then comparing the predictions to the expected values in the holdout dataset.

For classification problems, metrics involve comparing the expected class label to the predicted class label or interpreting the predicted probabilities for the class labels for the problem.

Selecting a model, and even the data preparation methods together are a search problem that is guided by the evaluation metric. Experiments are performed with different models and the outcome of each experiment is quantified with a metric.

Evaluation measures play a crucial role in both assessing the classification performance and guiding the classifier modeling.

— Classification Of Imbalanced Data: A Review, 2009.

There are standard metrics that are widely used for evaluating classification predictive models, such as classification accuracy or classification error.

Standard metrics work well on most problems, which is why they are widely adopted. But all metrics make assumptions about the problem or about what is important in the problem. Therefore an evaluation metric must be chosen that best captures what you or your project stakeholders believe is important about the model or predictions, which makes choosing model evaluation metrics challenging.

This challenge is made even more difficult when there is a skew in the class distribution. The reason for this is that many of the standard metrics become unreliable or even misleading when classes are imbalanced, or severely imbalanced, such as 1:100 or 1:1000 ratio between a minority and majority class.

In the case of class imbalances, the problem is even more acute because the default, relatively robust procedures used for unskewed data can break down miserably when the data is skewed.

— Page 187, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

For example, reporting classification accuracy for a severely imbalanced classification problem could be dangerously misleading. This is the case if project stakeholders use the results to draw conclusions or plan new projects.

In fact, the use of common metrics in imbalanced domains can lead to sub-optimal classification models and might produce misleading conclusions since these measures are insensitive to skewed domains.

— A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

Importantly, different evaluation metrics are often required when working with imbalanced classification.

Unlike standard evaluation metrics that treat all classes as equally important, imbalanced classification problems typically rate classification errors with the minority class as more important than those with the majority class. As such performance metrics may be needed that focus on the minority class, which is made challenging because it is the minority class where we lack observations required to train an effective model.

The main problem of imbalanced data sets lies on the fact that they are often associated with a user preference bias towards the performance on cases that are poorly represented in the available data sample.

— A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

Now that we are familiar with the challenge of choosing a model evaluation metric, let’s look at some examples of different metrics from which we might choose.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are tens of metrics to choose from when evaluating classifier models, and perhaps hundreds, if you consider all of the pet versions of metrics proposed by academics.

In order to get a handle on the metrics that you could choose from, we will use a taxonomy proposed by Cesar Ferri, et al. in their 2008 paper titled “An Experimental Comparison Of Performance Measures For Classification.” It was also adopted in the 2013 book titled “Imbalanced Learning” and I think proves useful.

We can divide evaluation metrics into three useful groups; they are:

- Threshold Metrics
- Ranking Metrics
- Probability Metrics.

This division is useful because the top metrics used by practitioners for classifiers generally, and specifically imbalanced classification, fit into the taxonomy neatly.

Several machine learning researchers have identified three families of evaluation metrics used in the context of classification. These are the threshold metrics (e.g., accuracy and F-measure), the ranking methods and metrics (e.g., receiver operating characteristics (ROC) analysis and AUC), and the probabilistic metrics (e.g., root-mean-squared error).

— Page 189, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Let’s take a closer look at each group in turn.

Threshold metrics are those that quantify the classification prediction errors.

That is, they are designed to summarize the fraction, ratio, or rate of when a predicted class does not match the expected class in a holdout dataset.

Metrics based on a threshold and a qualitative understanding of error […] These measures are used when we want a model to minimise the number of errors.

— An Experimental Comparison Of Performance Measures For Classification, 2008.

Perhaps the most widely used threshold metric is classification accuracy.

**Accuracy**= Correct Predictions / Total Predictions

And the complement of classification accuracy called classification error.

**Error**= Incorrect Predictions / Total Predictions

Although widely used, classification accuracy is almost universally inappropriate for imbalanced classification. The reason is, a high accuracy (or low error) is achievable by a no skill model that only predicts the majority class.

For more on the failure of classification accuracy, see the tutorial:

For imbalanced classification problems, the majority class is typically referred to as the negative outcome (e.g. such as “*no change*” or “*negative test result*“), and the minority class is typically referred to as the positive outcome (e.g. “*change*” or “*positive test result*“).

**Majority Class**: Negative outcome, class 0.**Minority Class**: Positive outcome, class 1.

Most threshold metrics can be best understood by the terms used in a confusion matrix for a binary (two-class) classification problem. This does not mean that the metrics are limited for use on binary classification; it is just an easy way to quickly understand what is being measured.

The confusion matrix provides more insight into not only the performance of a predictive model but also which classes are being predicted correctly, which incorrectly, and what type of errors are being made. In this type of confusion matrix, each cell in the table has a specific and well-understood name, summarized as follows:

| Positive Prediction | Negative Prediction Positive Class | True Positive (TP) | False Negative (FN) Negative Class | False Positive (FP) | True Negative (TN)

There are two groups of metrics that may be useful for imbalanced classification because they focus on one class; they are sensitivity-specificity and precision-recall.

Sensitivity refers to the true positive rate and summarizes how well the positive class was predicted.

**Sensitivity**= TruePositive / (TruePositive + FalseNegative)

Specificity is the complement to sensitivity, or the true negative rate, and summarises how well the negative class was predicted.

**Specificity**= TrueNegative / (FalsePositive + TrueNegative)

For imbalanced classification, the sensitivity might be more interesting than the specificity.

Sensitivity and Specificity can be combined into a single score that balances both concerns, called the geometric mean or G-Mean.

**G-Mean**= sqrt(Sensitivity * Specificity)

Precision summarizes the fraction of examples assigned the positive class that belong to the positive class.

**Precision**= TruePositive / (TruePositive + FalsePositive)

Recall summarizes how well the positive class was predicted and is the same calculation as sensitivity.

**Recall**= TruePositive / (TruePositive + FalseNegative)

Precision and recall can be combined into a single score that seeks to balance both concerns, called the F-score or the F-measure.

**F-Measure**= (2 * Precision * Recall) / (Precision + Recall)

The F-Measure is a popular metric for imbalanced classification.

The Fbeta-measure measure is an abstraction of the F-measure where the balance of precision and recall in the calculation of the harmonic mean is controlled by a coefficient called *beta*.

**Fbeta-Measure**= ((1 + beta^2) * Precision * Recall) / (beta^2 * Precision + Recall)

For more on precision, recall and F-measure for imbalanced classification, see the tutorial:

These are probably the most popular metrics to consider, although many others do exist. To give you a taste, these include Kappa, Macro-Average Accuracy, Mean-Class-Weighted Accuracy, Optimized Precision, Adjusted Geometric Mean, Balanced Accuracy, and more.

Threshold metrics are easy to calculate and easy to understand.

One limitation of these metrics is that they assume that the class distribution observed in the training dataset will match the distribution in the test set and in real data when the model is used to make predictions. This is often the case, but when it is not the case, the performance can be quite misleading.

An important disadvantage of all the threshold metrics discussed in the previous section is that they assume full knowledge of the conditions under which the classifier will be deployed. In particular, they assume that the class imbalance present in the training set is the one that will be encountered throughout the operating life of the classifier

— Page 196, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Ranking metrics don’t make any assumptions about class distributions.

Rank metrics are more concerned with evaluating classifiers based on how effective they are at separating classes.

Metrics based on how well the model ranks the examples […] These are important for many applications […] where classifiers are used to select the best n instances of a set of data or when good class separation is crucial.

— An Experimental Comparison Of Performance Measures For Classification, 2008.

These metrics require that a classifier predicts a score or a probability of class membership.

From this score, different thresholds can be applied to test the effectiveness of classifiers. Those models that maintain a good score across a range of thresholds will have good class separation and will be ranked higher.

… consider a classifier that gives a numeric score for an instance to be classified in the positive class. Therefore, instead of a simple positive or negative prediction, the score introduces a level of granularity

– Page 53, Learning from Imbalanced Data Sets, 2018.

The most commonly used ranking metric is the ROC Curve or ROC Analysis.

ROC is an acronym that means Receiver Operating Characteristic and summarizes a field of study for analyzing binary classifiers based on their ability to discriminate classes.

A ROC curve is a diagnostic plot for summarizing the behavior of a model by calculating the false positive rate and true positive rate for a set of predictions by the model under different thresholds.

The true positive rate is the recall or sensitivity.

**TruePositiveRate**= TruePositive / (TruePositive + FalseNegative)

The false positive rate is calculated as:

**FalsePositiveRate**= FalsePositive / (FalsePositive + TrueNegative)

Each threshold is a point on the plot and the points are connected to form a curve. A classifier that has no skill (e.g. predicts the majority class under all thresholds) will be represented by a diagonal line from the bottom left to the top right.

Any points below this line have worse than no skill. A perfect model will be a point in the top right of the plot.

The ROC Curve is a helpful diagnostic for one model.

The area under the ROC curve can be calculated and provides a single score to summarize the plot that can be used to compare models.

A no skill classifier will have a score of 0.5, whereas a perfect classifier will have a score of 1.0.

**ROC AUC**= ROC Area Under Curve

Although generally effective, the ROC Curve and ROC AUC can be optimistic under a severe class imbalance, especially when the number of examples in the minority class is small.

An alternative to the ROC Curve is the precision-recall curve that can be used in a similar way, although focuses on the performance of the classifier on the minority class.

Again, different thresholds are used on a set of predictions by a model, and in this case, the precision and recall are calculated. The points form a curve and classifiers that perform better under a range of different thresholds will be ranked higher.

A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5. A perfect classifier is represented by a point in the top right.

Like the ROC Curve, the Precision-Recall Curve is a helpful diagnostic tool for evaluating a single classifier but challenging for comparing classifiers.

And like the ROC AUC, we can calculate the area under the curve as a score and use that score to compare classifiers. In this case, the focus on the minority class makes the Precision-Recall AUC more useful for imbalanced classification problems.

**PR AUC**= Precision-Recall Area Under Curve

There are other ranking metrics that are less widely used, such as modification to the ROC Curve for imbalanced classification and cost curves.

For more on ROC curves and precision-recall curves for imbalanced classification, see the tutorial:

Probabilistic metrics are designed specifically to quantify the uncertainty in a classifierâ€™s predictions.

These are useful for problems where we are less interested in incorrect vs. correct class predictions and more interested in the uncertainty the model has in predictions and penalizing those predictions that are wrong but highly confident.

Metrics based on a probabilistic understanding of error, i.e. measuring the deviation from the true probability […] These measures are especially useful when we want an assessment of the reliability of the classifiers, not only measuring when they fail but whether they have selected the wrong class with a high or low probability.

— An Experimental Comparison Of Performance Measures For Classification, 2008.

Evaluating a model based on the predicted probabilities requires that the probabilities are calibrated.

Some classifiers are trained using a probabilistic framework, such as maximum likelihood estimation, meaning that their probabilities are already calibrated. An example would be logistic regression.

Many nonlinear classifiers are not trained under a probabilistic framework and therefore require their probabilities to be calibrated against a dataset prior to being evaluated via a probabilistic metric. Examples might include support vector machines and k-nearest neighbors.

Perhaps the most common metric for evaluating predicted probabilities is log loss for binary classification (or the negative log likelihood), or known more generally as cross-entropy.

For a binary classification dataset where the expected values are y and the predicted values are yhat, this can be calculated as follows:

**LogLoss**= -((1 – y) * log(1 – yhat) + y * log(yhat))

The score can be generalized to multiple classes by simply adding the terms; for example:

**LogLoss**= -( sum c in C y_c * log(yhat_c))

The score summarizes the average difference between two probability distributions. A perfect classifier has a log loss of 0.0, with worse values being positive up to infinity.

Another popular score for predicted probabilities is the Brier score.

The benefit of the Brier score is that it is focused on the positive class, which for imbalanced classification is the minority class. This makes it more preferable than log loss, which is focused on the entire probability distribution.

The Brier score is calculated as the mean squared error between the expected probabilities for the positive class (e.g. 1.0) and the predicted probabilities. Recall that the mean squared error is the average of the squared differences between the values.

**BrierScore**= 1/N * Sum i to N (yhat_i – y_i)^2

A perfect classifier has a Brier score of 0.0. Although typically described in terms of binary classification tasks, the Brier score can also be calculated for multiclass classification problems.

The differences in Brier score for different classifiers can be very small. In order to address this problem, the score can be scaled against a reference score, such as the score from a no skill classifier (e.g. predicting the probability distribution of the positive class in the training dataset).

Using the reference score, a Brier Skill Score, or BSS, can be calculated where 0.0 represents no skill, worse than no skill results are negative, and the perfect skill is represented by a value of 1.0.

**BrierSkillScore**= 1 – (BrierScore / BrierScore_ref)

Although popular for balanced classification problems, probability scoring methods are less widely used for classification problems with a skewed class distribution.

For more on probabilistic metrics for imbalanced classification, see the tutorial:

There is an enormous number of model evaluation metrics to choose from.

Given that choosing an evaluation metric is so important and there are tens or perhaps hundreds of metrics to choose from, what are you supposed to do?

The correct evaluation of learned models is one of the most important issues in pattern recognition.

— An Experimental Comparison Of Performance Measures For Classification, 2008.

Perhaps the best approach is to talk to project stakeholders and figure out what is important about a model or set of predictions. Then select a few metrics that seem to capture what is important, then test the metric with different scenarios.

A scenario might be a mock set of predictions for a test dataset with a skewed class distribution that matches your problem domain. You can test what happens to the metric if a model predicts all the majority class, all the minority class, does well, does poorly, and so on. A few small tests can rapidly help you get a feeling for how the metric might perform.

Another approach might be to perform a literature review and discover what metrics are most commonly used by other practitioners or academics working on the same general type of problem. This can often be insightful, but be warned that some fields of study may fall into groupthink and adopt a metric that might be excellent for comparing large numbers of models at scale, but terrible for model selection in practice.

Still have no idea?

Here are some first-order suggestions:

**Are you predicting probabilities?****Do you need class labels?****Is the positive class more important?**- Use Precision-Recall AUC

**Are both classes important?**- Use ROC AUC

**Do you need probabilities?**- Use Brier Score and Brier Skill Score

**Are you predicting class labels?****Is the positive class more important?****Are False Negatives and False Positives Equally Important?**- Use F1-Measure

**Are False Negatives More Important?**- Use F2-Measure

**Are False Positives More Important?**- Use F0.5-Measure

**Are both classes important?****Do you have < 80%-90% Examples for the Majority Class?Â**- Use Accuracy

**Do you have > 80%-90% Examples for the Majority Class?Â**- Use G-Mean

These suggestions take the important case into account where we might use models that predict probabilities, but require crisp class labels. This is an important class of problems that allow the operator or implementor to choose the threshold to trade-off misclassification errors. In this scenario, error metrics are required that consider all reasonable thresholds, hence the use of the area under curve metrics.

We can transform these suggestions into a helpful template.

This section provides more resources on the topic if you are looking to go deeper.

- An Experimental Comparison Of Performance Measures For Classification, 2008.
- Classification Of Imbalanced Data: A Review, 2009.
- A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

- Chapter 8 Assessment Metrics For Imbalanced Learning, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Chapter 3 Performance Measures, Learning from Imbalanced Data Sets, 2018.

- Precision and recall, Wikipedia.
- Sensitivity and specificity, Wikipedia.
- Receiver operating characteristic, Wikipedia.
- Cross entropy, Wikipedia.
- Brier score, Wikipedia.

In this tutorial, you discovered metrics that you can use for imbalanced classification.

Specifically, you learned:

- About the challenge of choosing metrics for classification, and how it is particularly difficult when there is a skewed class distribution.
- How there are three main types of metrics for evaluating classifier models, referred to as rank, threshold, and probability.
- How to choose a metric for imbalanced classification if you don’t know where to start.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Tour of Evaluation Metrics for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post ROC Curves and Precision-Recall Curves for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Most imbalanced classification problems involve two classes: a negative case with the majority of examples and a positive case with a minority of examples.

Two diagnostic tools that help in the interpretation of binary (two-class) classification predictive models are ROC Curves and Precision-Recall curves.

Plots from the curves can be created and used to understand the trade-off in performance for different threshold values when interpreting probabilistic predictions. Each plot can also be summarized with an area under the curve score that can be used to directly compare classification models.

In this tutorial, you will discover ROC Curves and Precision-Recall Curves for imbalanced classification.

After completing this tutorial, you will know:

- ROC Curves and Precision-Recall Curves provide a diagnostic tool for binary classification models.
- ROC AUC and Precision-Recall AUC provide scores that summarize the curves and can be used to compare classifiers.
- ROC Curves and ROC AUC can be optimistic on severely imbalanced classification problems with few samples of the minority class.

Letâ€™s get started.

This tutorial is divided into four parts; they are:

- Review of the Confusion Matrix
- ROC Curves and ROC AUC
- Precision-Recall Curves and AUC
- ROC and Precision-Recall Curves With a Severe Imbalance

Before we dive into ROC Curves and PR Curves, it is important to review the confusion matrix.

For imbalanced classification problems, the majority class is typically referred to as the negative outcome (e.g. such as “*no change*” or “*negative test result*“), and the minority class is typically referred to as the positive outcome (e.g. “*change*” or “*positive test result*“).

The confusion matrix provides more insight into not only the performance of a predictive model, but also which classes are being predicted correctly, which incorrectly, and what type of errors are being made.

The simplest confusion matrix is for a two-class classification problem, with negative (class 0) and positive (class 1) classes.

In this type of confusion matrix, each cell in the table has a specific and well-understood name, summarized as follows:

| Positive Prediction | Negative Prediction Positive Class | True Positive (TP) | False Negative (FN) Negative Class | False Positive (FP) | True Negative (TN)

The metrics that make up the ROC curve and the precision-recall curve are defined in terms of the cells in the confusion matrix.

Now that we have brushed up on the confusion matrix, let’s take a closer look at the ROC Curves metric.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

An ROC curve (or receiver operating characteristic curve) is a plot that summarizes the performance of a binary classification model on the positive class.

The x-axis indicates the False Positive Rate and the y-axis indicates the True Positive Rate.

**ROC Curve**: Plot of False Positive Rate (x) vs. True Positive Rate (y).

The true positive rate is a fraction calculated as the total number of true positive predictions divided by the sum of the true positives and the false negatives (e.g. all examples in the positive class). The true positive rate is referred to as the sensitivity or the recall.

**TruePositiveRate**= TruePositives / (TruePositives + False Negatives)

The false positive rate is calculated as the total number of false positive predictions divided by the sum of the false positives and true negatives (e.g. all examples in the negative class).

**FalsePositiveRate**= FalsePositives / (FalsePositives + TrueNegatives)

We can think of the plot as the fraction of correct predictions for the positive class (y-axis) versus the fraction of errors for the negative class (x-axis).

Ideally, we want the fraction of correct positive class predictions to be 1 (top of the plot) and the fraction of incorrect negative class predictions to be 0 (left of the plot). This highlights that the best possible classifier that achieves perfect skill is the top-left of the plot (coordinate 0,1).

**Perfect Skill**: A point in the top left of the plot.

The threshold is applied to the cut-off point in probability between the positive and negative classes, which by default for any classifier would be set at 0.5, halfway between each outcome (0 and 1).

A trade-off exists between the TruePositiveRate and FalsePositiveRate, such that changing the threshold of classification will change the balance of predictions towards improving the TruePositiveRate at the expense of FalsePositiveRate, or the reverse case.

By evaluating the true positive and false positives for different threshold values, a curve can be constructed that stretches from the bottom left to top right and bows toward the top left. This curve is called the ROC curve.

A classifier that has no discriminative power between positive and negative classes will form a diagonal line between a False Positive Rate of 0 and a True Positive Rate of 0 (coordinate (0,0) or predict all negative class) to a False Positive Rate of 1 and a True Positive Rate of 1 (coordinate (1,1) or predict all positive class). Models represented by points below this line have worse than no skill.

The curve provides a convenient diagnostic tool to investigate one classifier with different threshold values and the effect on the TruePositiveRate and FalsePositiveRate. One might choose a threshold in order to bias the predictive behavior of a classification model.

It is a popular diagnostic tool for classifiers on balanced and imbalanced binary prediction problems alike because it is not biased to the majority or minority class.

ROC analysis does not have any bias toward models that perform well on the majority class at the expense of the majority classâ€”a property that is quite attractive when dealing with imbalanced data.

— Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

We can plot a ROC curve for a model in Python using the roc_curve() scikit-learn function.

The function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. The function returns the false positive rates for each threshold, true positive rates for each threshold and thresholds.

... # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, pos_probs)

Most scikit-learn models can predict probabilities by calling the *predict_proba()* function.

This will return the probabilities for each class, for each sample in a test set, e.g. two numbers for each of the two classes in a binary classification problem. The probabilities for the positive class can be retrieved as the second column in this array of probabilities.

... # predict probabilities yhat = model.predict_proba(testX) # retrieve just the probabilities for the positive class pos_probs = yhat[:, 1]

We can demonstrate this on a synthetic dataset and plot the ROC curve for a no skill classifier and a Logistic Regression model.

The make_classification() function can be used to create synthetic classification problems. In this case, we will create 1,000 examples for a binary classification problem (about 500 examples per class). We will then split the dataset into a train and test sets of equal size in order to fit and evaluate the model.

... # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)

A Logistic Regression model is a good model for demonstration because the predicted probabilities are well-calibrated, as opposed to other machine learning models that are not developed around a probabilistic model, in which case their probabilities may need to be calibrated first (e.g. an SVM).

... # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy)

The complete example is listed below.

# example of a roc curve for a predictive model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # retrieve just the probabilities for the positive class pos_probs = yhat[:, 1] # plot no skill roc curve pyplot.plot([0, 1], [0, 1], linestyle='--', label='No Skill') # calculate roc curve for model fpr, tpr, _ = roc_curve(testy, pos_probs) # plot model roc curve pyplot.plot(fpr, tpr, marker='.', label='Logistic') # axis labels pyplot.xlabel('False Positive Rate') pyplot.ylabel('True Positive Rate') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example creates the synthetic dataset, splits into train and test sets, then fits a Logistic Regression model on the training dataset and uses it to make a prediction on the test set.

The ROC Curve for the Logistic Regression model is shown (orange with dots). A no skill classifier as a diagonal line (blue with dashes).

Now that we have seen the ROC Curve, let’s take a closer look at the ROC area under curve score.

Although the ROC Curve is a helpful diagnostic tool, it can be challenging to compare two or more classifiers based on their curves.

Instead, the area under the curve can be calculated to give a single score for a classifier model across all threshold values. This is called the ROC area under curve or ROC AUC or sometimes ROCAUC.

The score is a value between 0.0 and 1.0 for a perfect classifier.

AUCROC can be interpreted as the probability that the scores given by a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

— Page 54, Learning from Imbalanced Data Sets, 2018.

This single score can be used to compare binary classifier models directly. As such, this score might be the most commonly used for comparing classification models for imbalanced problems.

The most common metric involves receiver operation characteristics (ROC) analysis, and the area under the ROC curve (AUC).

— Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

The AUC for the ROC can be calculated in scikit-learn using the roc_auc_score() function.

Like the *roc_curve()* function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the positive class.

... # calculate roc auc roc_auc = roc_auc_score(testy, pos_probs)

We can demonstrate this the same synthetic dataset with a Logistic Regression model.

The complete example is listed below.

# example of a roc auc for a predictive model from sklearn.datasets import make_classification from sklearn.dummy import DummyClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # no skill model, stratified random class predictions model = DummyClassifier(strategy='stratified') model.fit(trainX, trainy) yhat = model.predict_proba(testX) pos_probs = yhat[:, 1] # calculate roc auc roc_auc = roc_auc_score(testy, pos_probs) print('No Skill ROC AUC %.3f' % roc_auc) # skilled model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) yhat = model.predict_proba(testX) pos_probs = yhat[:, 1] # calculate roc auc roc_auc = roc_auc_score(testy, pos_probs) print('Logistic ROC AUC %.3f' % roc_auc)

Running the example creates and splits the synthetic dataset, fits the model, and uses the fit model to predict probabilities on the test dataset.

In this case, we can see that the ROC AUC for the Logistic Regression model on the synthetic dataset is about 0.903, which is much better than a no skill classifier with a score of about 0.5.

No Skill ROC AUC 0.509 Logistic ROC AUC 0.903

Although widely used, the ROC AUC is not without problems.

For imbalanced classification with a severe skew and few examples of the minority class, the ROC AUC can be misleading. This is because a small number of correct or incorrect predictions can result in a large change in the ROC Curve or ROC AUC score.

Although ROC graphs are widely used to evaluate classifiers under presence of class imbalance, it has a drawback: under class rarity, that is, when the problem of class imbalance is associated to the presence of a low sample size of minority instances, as the estimates can be unreliable.

— Page 55, Learning from Imbalanced Data Sets, 2018.

A common alternative is the precision-recall curve and area under curve.

Precision is a metric that quantifies the number of correct positive predictions made.

It is calculated as the number of true positives divided by the total number of true positives and false positives.

**Precision**= TruePositives / (TruePositives + FalsePositives)

The result is a value between 0.0 for no precision and 1.0 for full or perfect precision.

Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.

It is calculated as the number of true positives divided by the total number of true positives and false negatives (e.g. it is the true positive rate).

**Recall**= TruePositives / (TruePositives + FalseNegatives)

The result is a value between 0.0 for no recall and 1.0 for full or perfect recall.

Both the precision and the recall are focused on the positive class (the minority class) and are unconcerned with the true negatives (majority class).

… precision and recall make it possible to assess the performance of a classifier on the minority class.

— Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

A precision-recall curve (or PR Curve) is a plot of the precision (y-axis) and the recall (x-axis) for different probability thresholds.

**PR Curve**: Plot of Recall (x) vs Precision (y).

A model with perfect skill is depicted as a point at a coordinate of (1,1). A skillful model is represented by a curve that bows towards a coordinate of (1,1). A no-skill classifier will be a horizontal line on the plot with a precision that is proportional to the number of positive examples in the dataset. For a balanced dataset this will be 0.5.

The focus of the PR curve on the minority class makes it an effective diagnostic for imbalanced binary classification models.

Precision-recall curves (PR curves) are recommended for highly skewed domains where ROC curves may provide an excessively optimistic view of the performance.

— A Survey of Predictive Modelling under Imbalanced Distributions, 2015.

A precision-recall curve can be calculated in scikit-learn using the precision_recall_curve() function that takes the class labels and predicted probabilities for the minority class and returns the precision, recall, and thresholds.

... # calculate precision-recall curve precision, recall, _ = precision_recall_curve(testy, pos_probs)

We can demonstrate this on a synthetic dataset for a predictive model.

The complete example is listed below.

# example of a precision-recall curve for a predictive model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # retrieve just the probabilities for the positive class pos_probs = yhat[:, 1] # calculate the no skill line as the proportion of the positive class no_skill = len(y[y==1]) / len(y) # plot the no skill precision-recall curve pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') # calculate model precision-recall curve precision, recall, _ = precision_recall_curve(testy, pos_probs) # plot the model precision-recall curve pyplot.plot(recall, precision, marker='.', label='Logistic') # axis labels pyplot.xlabel('Recall') pyplot.ylabel('Precision') # show the legend pyplot.legend() # show the plot pyplot.show()

Running the example creates the synthetic dataset, splits into train and test sets, then fits a Logistic Regression model on the training dataset and uses it to make a prediction on the test set.

The Precision-Recall Curve for the Logistic Regression model is shown (orange with dots). A random or baseline classifier is shown as a horizontal line (blue with dashes).

Now that we have seen the Precision-Recall Curve, let’s take a closer look at the ROC area under curve score.

The Precision-Recall AUC is just like the ROC AUC, in that it summarizes the curve with a range of threshold values as a single score.

The score can then be used as a point of comparison between different models on a binary classification problem where a score of 1.0 represents a model with perfect skill.

The Precision-Recall AUC score can be calculated using the auc() function in scikit-learn, taking the precision and recall values as arguments.

... # calculate the precision-recall auc auc_score = auc(recall, precision)

Again, we can demonstrate calculating the Precision-Recall AUC for a Logistic Regression on a synthetic dataset.

The complete example is listed below.

# example of a precision-recall auc for a predictive model from sklearn.datasets import make_classification from sklearn.dummy import DummyClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import auc # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # no skill model, stratified random class predictions model = DummyClassifier(strategy='stratified') model.fit(trainX, trainy) yhat = model.predict_proba(testX) pos_probs = yhat[:, 1] # calculate the precision-recall auc precision, recall, _ = precision_recall_curve(testy, pos_probs) auc_score = auc(recall, precision) print('No Skill PR AUC: %.3f' % auc_score) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) yhat = model.predict_proba(testX) pos_probs = yhat[:, 1] # calculate the precision-recall auc precision, recall, _ = precision_recall_curve(testy, pos_probs) auc_score = auc(recall, precision) print('Logistic PR AUC: %.3f' % auc_score)

Running the example creates and splits the synthetic dataset, fits the model, and uses the fit model to predict probabilities on the test dataset.

In this case, we can see that the Precision-Recall AUC for the Logistic Regression model on the synthetic dataset is about 0.898, which is much better than a no skill classifier that would achieve the score in this case of 0.632.

No Skill PR AUC: 0.632 Logistic PR AUC: 0.898

In this section, we will explore the case of using the ROC Curves and Precision-Recall curves with a binary classification problem that has a severe class imbalance.

Firstly, we can use the *make_classification()* function to create 1,000 examples for a classification problem with about a 1:100 minority to majority class ratio. This can be achieved by setting the “*weights*” argument and specifying the weighting of generated instances from each class.

We will use a 99 percent and 1 percent weighting with 1,000 total examples, meaning there would be about 990 for class 0 and about 10 for class 1.

... # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1)

We can then split the dataset into training and test sets and ensure that both have the same general class ratio by setting the “*stratify*” argument on the call to the *train_test_split()* function and setting it to the array of target variables.

... # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y)

Tying this together, the complete example of preparing the imbalanced dataset is listed below.

# create an imbalanced dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # summarize dataset print('Dataset: Class0=%d, Class1=%d' % (len(y[y==0]), len(y[y==1]))) print('Train: Class0=%d, Class1=%d' % (len(trainy[trainy==0]), len(trainy[trainy==1]))) print('Test: Class0=%d, Class1=%d' % (len(testy[testy==0]), len(testy[testy==1])))

Running the example first summarizes the class ratio of the whole dataset, then the ratio for each of the train and test sets, confirming the split of the dataset holds the same ratio.

Dataset: Class0=985, Class1=15 Train: Class0=492, Class1=8 Test: Class0=493, Class1=7

Next, we can develop a Logistic Regression model on the dataset and evaluate the performance of the model using a ROC Curve and ROC AUC score, and compare the results to a no skill classifier, as we did in a prior section.

The complete example is listed below.

# roc curve and roc auc on an imbalanced dataset from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.dummy import DummyClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # plot no skill and model roc curves def plot_roc_curve(test_y, naive_probs, model_probs): # plot naive skill roc curve fpr, tpr, _ = roc_curve(test_y, naive_probs) pyplot.plot(fpr, tpr, linestyle='--', label='No Skill') # plot model roc curve fpr, tpr, _ = roc_curve(test_y, model_probs) pyplot.plot(fpr, tpr, marker='.', label='Logistic') # axis labels pyplot.xlabel('False Positive Rate') pyplot.ylabel('True Positive Rate') # show the legend pyplot.legend() # show the plot pyplot.show() # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # no skill model, stratified random class predictions model = DummyClassifier(strategy='stratified') model.fit(trainX, trainy) yhat = model.predict_proba(testX) naive_probs = yhat[:, 1] # calculate roc auc roc_auc = roc_auc_score(testy, naive_probs) print('No Skill ROC AUC %.3f' % roc_auc) # skilled model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) yhat = model.predict_proba(testX) model_probs = yhat[:, 1] # calculate roc auc roc_auc = roc_auc_score(testy, model_probs) print('Logistic ROC AUC %.3f' % roc_auc) # plot roc curves plot_roc_curve(testy, naive_probs, model_probs)

Running the example creates the imbalanced binary classification dataset as before.

Then a logistic regression model is fit on the training dataset and evaluated on the test dataset. A no skill classifier is evaluated alongside for reference.

The ROC AUC scores for both classifiers are reported, showing the no skill classifier achieving the lowest score of approximately 0.5 as expected. The results for the logistic regression model suggest it has some skill with a score of about 0.869.

No Skill ROC AUC 0.490 Logistic ROC AUC 0.869

A ROC curve is also created for the model and the no skill classifier, showing not excellent performance, but definitely skillful performance as compared to the diagonal no skill.

Next, we can perform an analysis of the same model fit and evaluated on the same data using the precision-recall curve and AUC score.

The complete example is listed below.

# pr curve and pr auc on an imbalanced dataset from sklearn.datasets import make_classification from sklearn.dummy import DummyClassifier from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import auc from matplotlib import pyplot # plot no skill and model precision-recall curves def plot_pr_curve(test_y, model_probs): # calculate the no skill line as the proportion of the positive class no_skill = len(test_y[test_y==1]) / len(test_y) # plot the no skill precision-recall curve pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill') # plot model precision-recall curve precision, recall, _ = precision_recall_curve(testy, model_probs) pyplot.plot(recall, precision, marker='.', label='Logistic') # axis labels pyplot.xlabel('Recall') pyplot.ylabel('Precision') # show the legend pyplot.legend() # show the plot pyplot.show() # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # no skill model, stratified random class predictions model = DummyClassifier(strategy='stratified') model.fit(trainX, trainy) yhat = model.predict_proba(testX) naive_probs = yhat[:, 1] # calculate the precision-recall auc precision, recall, _ = precision_recall_curve(testy, naive_probs) auc_score = auc(recall, precision) print('No Skill PR AUC: %.3f' % auc_score) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) yhat = model.predict_proba(testX) model_probs = yhat[:, 1] # calculate the precision-recall auc precision, recall, _ = precision_recall_curve(testy, model_probs) auc_score = auc(recall, precision) print('Logistic PR AUC: %.3f' % auc_score) # plot precision-recall curves plot_pr_curve(testy, model_probs)

As before, running the example creates the imbalanced binary classification dataset.

In this case we can see that the Logistic Regression model achieves a PR AUC of about 0.228 and a no skill model achieves a PR AUC of about 0.007.

No Skill PR AUC: 0.007 Logistic PR AUC: 0.228

A plot of the precision-recall curve is also created.

We can see the horizontal line of the no skill classifier as expected and in this case the zig-zag line of the logistic regression curve close to the no skill line.

To explain why the ROC and PR curves tell a different story, recall that the PR curve focuses on the minority class, whereas the ROC curve covers both classes.

If we use a threshold of 0.5 and use the logistic regression model to make a prediction for all examples in the test set, we see that it predicts class 0 or the majority class in all cases. This can be confirmed by using the fit model to predict crisp class labels, that will use the default threshold of 0.5. The distribution of predicted class labels can then be summarized.

... # predict class labels yhat = model.predict(testX) # summarize the distribution of class labels print(Counter(yhat))

We can then create a histogram of the predicted probabilities of the positive class to confirm that the mass of predicted probabilities is below 0.5, and therefore are mapped to class 0.

... # create a histogram of the predicted probabilities pyplot.hist(pos_probs, bins=100) pyplot.show()

Tying this together, the complete example is listed below.

# summarize the distribution of predicted probabilities from collections import Counter from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.99, 0.01], random_state=1) # split into train/test sets with same class ratio trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2, stratify=y) # fit a model model = LogisticRegression(solver='lbfgs') model.fit(trainX, trainy) # predict probabilities yhat = model.predict_proba(testX) # retrieve just the probabilities for the positive class pos_probs = yhat[:, 1] # predict class labels yhat = model.predict(testX) # summarize the distribution of class labels print(Counter(yhat)) # create a histogram of the predicted probabilities pyplot.hist(pos_probs, bins=100) pyplot.show()

Running the example first summarizes the distribution of predicted class labels. As we expected, the majority class (class 0) is predicted for all examples in the test set.

Counter({0: 500})

A histogram plot of the predicted probabilities for class 1 is also created, showing the center of mass (most predicted probabilities) is less than 0.5 and in fact is generally close to zero.

This means, unless probability threshold is carefully chosen, any skillful nuance in the predictions made by the model will be lost. Selecting thresholds used to interpret predicted probabilities as crisp class labels is an important topic

This section provides more resources on the topic if you are looking to go deeper.

- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Learning from Imbalanced Data Sets, 2018.

- sklearn.datasets.make_classification API.
- sklearn.metrics.roc_curve API.
- sklearn.metrics.roc_auc_score API
- precision_recall_curve API.
- sklearn.metrics.auc API.

In this tutorial, you discovered ROC Curves and Precision-Recall Curves for imbalanced classification.

Specifically, you learned:

- ROC Curves and Precision-Recall Curves provide a diagnostic tool for binary classification models.
- ROC AUC and Precision-Recall AUC provide scores that summarize the curves and can be used to compare classifiers.
- ROC Curves and ROC AUC can be optimistic on severely imbalanced classification problems with few samples of the minority class.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post ROC Curves and Precision-Recall Curves for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Classification accuracy is the total number of correct predictions divided by the total number of predictions made for a dataset.

As a performance measure, accuracy is inappropriate for imbalanced classification problems.

The main reason is that the overwhelming number of examples from the majority class (or classes) will overwhelm the number of examples in the minority class, meaning that even unskillful models can achieve accuracy scores of 90 percent, or 99 percent, depending on how severe the class imbalance happens to be.

An alternative to using classification accuracy is to use precision and recall metrics.

In this tutorial, you will discover how to calculate and develop an intuition for precision and recall for imbalanced classification.

After completing this tutorial, you will know:

- Precision quantifies the number of positive class predictions that actually belong to the positive class.
- Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
- F-Measure provides a single score that balances both the concerns of precision and recall in one number.

Letâ€™s get started.

**Update Jan/2020**: Improved language about the objective of precision and recall. Fixed typos about what precision and recall seek to minimize (thanks for the comments!).

This tutorial is divided into five parts; they are:

- Confusion Matrix for Imbalanced Classification
- Precision for Imbalanced Classification
- Recall for Imbalanced Classification
- Precision vs. Recall for Imbalanced Classification
- F-Measure for Imbalanced Classification

Before we dive into precision and recall, it is important to review the confusion matrix.

For imbalanced classification problems, the majority class is typically referred to as the negative outcome (e.g. such as “*no change*” or “*negative test result*“), and the minority class is typically referred to as the positive outcome (e.g. “change” or “positive test result”).

The confusion matrix provides more insight into not only the performance of a predictive model, but also which classes are being predicted correctly, which incorrectly, and what type of errors are being made.

The simplest confusion matrix is for a two-class classification problem, with negative (class 0) and positive (class 1) classes.

In this type of confusion matrix, each cell in the table has a specific and well-understood name, summarized as follows:

| Positive Prediction | Negative Prediction Positive Class | True Positive (TP) | False Negative (FN) Negative Class | False Positive (FP) | True Negative (TN)

The precision and recall metrics are defined in terms of the cells in the confusion matrix, specifically terms like true positives and false negatives.

Now that we have brushed up on the confusion matrix, let’s take a closer look at the precision metric.

Precision is a metric that quantifies the number of correct positive predictions made.

Precision, therefore, calculates the accuracy for the minority class.

It is calculated as the ratio of correctly predicted positive examples divided by the total number of positive examples that were predicted.

Precision evaluates the fraction of correct classified instances among the ones classified as positive …

— Page 52, Learning from Imbalanced Data Sets, 2018.

In an imbalanced classification problem with two classes, precision is calculated as the number of true positives divided by the total number of true positives and false positives.

- Precision = TruePositives / (TruePositives + FalsePositives)

The result is a value between 0.0 for no precision and 1.0 for full or perfect precision.

Let’s make this calculation concrete with some examples.

Consider a dataset with a 1:100 minority to majority ratio, with 100 minority examples and 10,000 majority class examples.

A model makes predictions and predicts 120 examples as belonging to the minority class, 90 of which are correct, and 30 of which are incorrect.

The precision for this model is calculated as:

- Precision = TruePositives / (TruePositives + FalsePositives)
- Precision = 90 / (90 + 30)
- Precision = 90 / 120
- Precision = 0.75

The result is a precision of 0.75, which is a reasonable value but not outstanding.

You can see that precision is simply the ratio of correct positive predictions out of all positive predictions made, or the accuracy of minority class predictions.

Consider the same dataset, where a model predicts 50 examples belonging to the minority class, 45 of which are true positives and five of which are false positives. We can calculate the precision for this model as follows:

- Precision = TruePositives / (TruePositives + FalsePositives)
- Precision = 45 / (45 + 5)
- Precision = 45 / 50
- Precision = 0.90

In this case, although the model predicted far fewer examples as belonging to the minority class, the ratio of correct positive examples is much better.

This highlights that although precision is useful, it does not tell the whole story. It does not comment on how many real positive class examples were predicted as belonging to the negative class, so-called false negatives.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Precision is not limited to binary classification problems.

In an imbalanced classification problem with more than two classes, precision is calculated as the sum of true positives across all classes divided by the sum of true positives and false positives across all classes.

- Precision = Sum c in C TruePositives_c / Sum c in C (TruePositives_c + FalsePositives_c)

For example, we may have an imbalanced multiclass classification problem where the majority class is the negative class, but there are two positive minority classes: class 1 and class 2. Precision can quantify the ratio of correct predictions across both positive classes.

Consider a dataset with a 1:1:100 minority to majority class ratio, that is a 1:1 ratio for each positive class and a 1:100 ratio for the minority classes to the majority class, and we have 100 examples in each minority class, and 10,000 examples in the majority class.

A model makes predictions and predicts 70 examples for the first minority class, where 50 are correct and 20 are incorrect. It predicts 150 for the second class with 99 correct and 51 incorrect. Precision can be calculated for this model as follows:

- Precision = (TruePositives_1 + TruePositives_2) / ((TruePositives_1 + TruePositives_2) + (FalsePositives_1 + FalsePositives_2) )
- Precision = (50 + 99) / ((50 + 99) + (20 + 51))
- Precision = 149 / (149 + 71)
- Precision = 149 / 220
- Precision = 0.677

We can see that the precision metric calculation scales as we increase the number of minority classes.

The precision score can be calculated using the precision_score() scikit-learn function.

For example, we can use this function to calculate precision for the scenarios in the previous section.

First, the case where there are 100 positive to 10,000 negative examples, and a model predicts 90 true positives and 30 false positives. The complete example is listed below.

# calculates precision for 1:100 dataset with 90 tp and 30 fp from sklearn.metrics import precision_score # define actual act_pos = [1 for _ in range(100)] act_neg = [0 for _ in range(10000)] y_true = act_pos + act_neg # define predictions pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)] pred_neg = [1 for _ in range(30)] + [0 for _ in range(9970)] y_pred = pred_pos + pred_neg # calculate prediction precision = precision_score(y_true, y_pred, average='binary') print('Precision: %.3f' % precision)

Running the example calculates the precision, matching our manual calculation.

Precision: 0.750

Next, we can use the same function to calculate precision for the multiclass problem with 1:1:100, with 100 examples in each minority class and 10,000 in the majority class. A model predicts 50 true positives and 20 false positives for class 1 and 99 true positives and 51 false positives for class 2.

When using the *precision_score()* function for multiclass classification, it is important to specify the minority classes via the “*labels*” argument and to perform set the “*average*” argument to ‘*micro*‘ to ensure the calculation is performed as we expect.

The complete example is listed below.

# calculates precision for 1:1:100 dataset with 50tp,20fp, 99tp,51fp from sklearn.metrics import precision_score # define actual act_pos1 = [1 for _ in range(100)] act_pos2 = [2 for _ in range(100)] act_neg = [0 for _ in range(10000)] y_true = act_pos1 + act_pos2 + act_neg # define predictions pred_pos1 = [0 for _ in range(50)] + [1 for _ in range(50)] pred_pos2 = [0 for _ in range(1)] + [2 for _ in range(99)] pred_neg = [1 for _ in range(20)] + [2 for _ in range(51)] + [0 for _ in range(9929)] y_pred = pred_pos1 + pred_pos2 + pred_neg # calculate prediction precision = precision_score(y_true, y_pred, labels=[1,2], average='micro') print('Precision: %.3f' % precision)

Again, running the example calculates the precision for the multiclass example matching our manual calculation.

Precision: 0.677

Recall is a metric that quantifies the number of correct positive predictions made out of all positive predictions that could have been made.

Unlike precision that only comments on the correct positive predictions out of all positive predictions, recall provides an indication of missed positive predictions.

In this way, recall provides some notion of the coverage of the positive class.

For imbalanced learning, recall is typically used to measure the coverage of the minority class.

— Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

In an imbalanced classification problem with two classes, recall is calculated as the number of true positives divided by the total number of true positives and false negatives.

- Recall = TruePositives / (TruePositives + FalseNegatives)

The result is a value between 0.0 for no recall and 1.0 for full or perfect recall.

Let’s make this calculation concrete with some examples.

As in the previous section, consider a dataset with 1:100 minority to majority ratio, with 100 minority examples and 10,000 majority class examples.

A model makes predictions and predicts 90 of the positive class predictions correctly and 10 incorrectly. We can calculate the recall for this model as follows:

- Recall = TruePositives / (TruePositives + FalseNegatives)
- Recall = 90 / (90 + 10)
- Recall = 90 / 100
- Recall = 0.9

This model has a good recall.

Recall is not limited to binary classification problems.

In an imbalanced classification problem with more than two classes, recall is calculated as the sum of true positives across all classes divided by the sum of true positives and false negatives across all classes.

- Recall = Sum c in C TruePositives_c / Sum c in C (TruePositives_c + FalseNegatives_c)

As in the previous section, consider a dataset with a 1:1:100 minority to majority class ratio, that is a 1:1 ratio for each positive class and a 1:100 ratio for the minority classes to the majority class, and we have 100 examples in each minority class, and 10,000 examples in the majority class.

A model predicts 77 examples correctly and 23 incorrectly for class 1, and 95 correctly and five incorrectly for class 2. We can calculate recall for this model as follows:

- Recall = (TruePositives_1 + TruePositives_2) / ((TruePositives_1 + TruePositives_2) + (FalseNegatives_1 + FalseNegatives_2))
- Recall = (77 + 95) / ((77 + 95) + (23 + 5))
- Recall = 172 / (172 + 28)
- Recall = 172 / 200
- Recall = 0.86

The recall score can be calculated using the recall_score() scikit-learn function.

For example, we can use this function to calculate recall for the scenarios above.

First, we can consider the case of a 1:100 imbalance with 100 and 10,000 examples respectively, and a model predicts 90 true positives and 10 false negatives.

The complete example is listed below.

# calculates recall for 1:100 dataset with 90 tp and 10 fn from sklearn.metrics import recall_score # define actual act_pos = [1 for _ in range(100)] act_neg = [0 for _ in range(10000)] y_true = act_pos + act_neg # define predictions pred_pos = [0 for _ in range(10)] + [1 for _ in range(90)] pred_neg = [0 for _ in range(10000)] y_pred = pred_pos + pred_neg # calculate prediction precision = recall_score(y_true, y_pred, average='binary') print('Recall: %.3f' % precision)

Running the example, we can see that the score matches the manual calculation above.

Recall: 0.900

We can also use the *recall_score()* for imbalanced multiclass classification problems.

In this case, the dataset has a 1:1:100 imbalance, with 100 in each minority class and 10,000 in the majority class. A model predicts 77 true positives and 23 false negatives for class 1 and 95 true positives and five false negatives for class 2.

The complete example is listed below.

# calculates recall for 1:1:100 dataset with 77tp,23fn and 95tp,5fn from sklearn.metrics import recall_score # define actual act_pos1 = [1 for _ in range(100)] act_pos2 = [2 for _ in range(100)] act_neg = [0 for _ in range(10000)] y_true = act_pos1 + act_pos2 + act_neg # define predictions pred_pos1 = [0 for _ in range(23)] + [1 for _ in range(77)] pred_pos2 = [0 for _ in range(5)] + [2 for _ in range(95)] pred_neg = [0 for _ in range(10000)] y_pred = pred_pos1 + pred_pos2 + pred_neg # calculate prediction precision = recall_score(y_true, y_pred, labels=[1,2], average='micro') print('Recall: %.3f' % precision)

Again, running the example calculates the recall for the multiclass example matching our manual calculation.

Recall: 0.860

You may decide to use precision or recall on your imbalanced classification problem.

Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives.

**Precision**: Appropriate when minimizing false positives is the focus.**Recall**: Appropriate when minimizing false negatives is the focus.

Sometimes, we want excellent predictions of the positive class. We want high precision and high recall.

This can be challenging, as often increases in recall often come at the expense of decreases in precision.

In imbalanced datasets, the goal is to improve recall without hurting precision. These goals, however, are often conflicting, since in order to increase the TP for the minority class, the number of FP is also often increased, resulting in reduced precision.

— Page 55, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Nevertheless, instead of picking one measure or the other, we can choose a new metric that combines both precision and recall into one score.

Classification accuracy is widely used because it is one single measure used to summarize model performance.

F-Measure provides a way to combine both precision and recall into a single measure that captures both properties.

Alone, neither precision or recall tells the whole story. We can have excellent precision with terrible recall, or alternately, terrible precision with excellent recall. F-measure provides a way to express both concerns with a single score.

Once precision and recall have been calculated for a binary or multiclass classification problem, the two scores can be combined into the calculation of the F-Measure.

The traditional F measure is calculated as follows:

- F-Measure = (2 * Precision * Recall) / (Precision + Recall)

This is the harmonic mean of the two fractions. This is sometimes called the F-Score or the F1-Score and might be the most common metric used on imbalanced classification problems.

… the F1-measure, which weights precision and recall equally, is the variant most often used when learning from imbalanced data.

— Page 27, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

Like precision and recall, a poor F-Measure score is 0.0 and a best or perfect F-Measure score is 1.0

For example, a perfect precision and recall score would result in a perfect F-Measure score:

- F-Measure = (2 * Precision * Recall) / (Precision + Recall)
- F-Measure = (2 * 1.0 * 1.0) / (1.0 + 1.0)
- F-Measure = (2 * 1.0) / 2.0
- F-Measure = 1.0

Let’s make this calculation concrete with a worked example.

Consider a binary classification dataset with 1:100 minority to majority ratio, with 100 minority examples and 10,000 majority class examples.

Consider a model that predicts 150 examples for the positive class, 95 are correct (true positives), meaning five were missed (false negatives) and 55 are incorrect (false positives).

We can calculate the precision as follows:

- Precision = TruePositives / (TruePositives + FalsePositives)
- Precision = 95 / (95 + 55)
- Precision = 0.633

We can calculate the recall as follows:

- Recall = TruePositives / (TruePositives + FalseNegatives)
- Recall = 95 / (95 + 5)
- Recall = 0.95

This shows that the model has poor precision, but excellent recall.

Finally, we can calculate the F-Measure as follows:

- F-Measure = (2 * Precision * Recall) / (Precision + Recall)
- F-Measure = (2 * 0.633 * 0.95) / (0.633 + 0.95)
- F-Measure = (2 * 0.601) / 1.583
- F-Measure = 1.202 / 1.583
- F-Measure = 0.759

We can see that the good recall levels-out the poor precision, giving an okay or reasonable F-measure score.

The F-measure score can be calculated using the f1_score() scikit-learn function.

For example, we use this function to calculate F-Measure for the scenario above.

This is the case of a 1:100 imbalance with 100 and 10,000 examples respectively, and a model predicts 95 true positives, five false negatives, and 55 false positives.

The complete example is listed below.

# calculates f1 for 1:100 dataset with 95tp, 5fn, 55fp from sklearn.metrics import f1_score # define actual act_pos = [1 for _ in range(100)] act_neg = [0 for _ in range(10000)] y_true = act_pos + act_neg # define predictions pred_pos = [0 for _ in range(5)] + [1 for _ in range(95)] pred_neg = [1 for _ in range(55)] + [0 for _ in range(9945)] y_pred = pred_pos + pred_neg # calculate prediction precision = f1_score(y_true, y_pred, average='binary') print('F-Measure: %.3f' % precision)

Running the example computes the F-Measure, matching our manual calculation, within some minor rounding errors.

F-Measure: 0.760

This section provides more resources on the topic if you are looking to go deeper.

- How to Calculate Precision, Recall, F1, and More for Deep Learning Models
- How to Use ROC Curves and Precision-Recall Curves for Classification in Python

- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Learning from Imbalanced Data Sets, 2018.

- sklearn.metrics.precision_score API.
- sklearn.metrics.recall_score API.
- sklearn.metrics.f1_score API.

In this tutorial, you discovered you discovered how to calculate and develop an intuition for precision and recall for imbalanced classification.

Specifically, you learned:

- Precision quantifies the number of positive class predictions that actually belong to the positive class.
- Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
- F-Measure provides a single score that balances both the concerns of precision and recall in one number.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calculate Precision, Recall, and F-Measure for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>