The post Multi-Class Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Most imbalanced classification examples focus on binary classification tasks, yet many of the tools and techniques for imbalanced classification also directly support multi-class classification problems.

In this tutorial, you will discover how to use the tools of imbalanced classification with a multi-class dataset.

After completing this tutorial, you will know:

- About the glass identification standard imbalanced multi-class prediction problem.
- How to use SMOTE oversampling for imbalanced multi-class classification.
- How to use cost-sensitive learning for imbalanced multi-class classification.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Jan/2021**: Updated links for API documentation.

This tutorial is divided into three parts; they are:

- Glass Multi-Class Classification Dataset
- SMOTE Oversampling for Multi-Class Classification
- Cost-Sensitive Learning for Multi-Class Classification

In this tutorial, we will focus on the standard imbalanced multi-class classification problem referred to as “**Glass Identification**” or simply “*glass*.”

The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to Vina Spiehler in 1987.

Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:

- RI: Refractive Index
- Na: Sodium
- Mg: Magnesium
- Al: Aluminum
- Si: Silicon
- K: Potassium
- Ca: Calcium
- Ba: Barium
- Fe: Iron

The chemical compositions are measured as the weight percent in corresponding oxide.

There are seven types of glass listed; they are:

- Class 1: building windows (float processed)
- Class 2: building windows (non-float processed)
- Class 3: vehicle windows (float processed)
- Class 4: vehicle windows (non-float processed)
- Class 5: containers
- Class 6: tableware
- Class 7: headlamps

Float glass refers to the process used to make the glass.

There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.

- Class 1: 70 examples
- Class 2: 76 examples
- Class 3: 17 examples
- Class 4: 0 examples
- Class 5: 13 examples
- Class 6: 9 examples
- Class 7: 29 examples

Although there are minority classes, all classes are equally important in this prediction problem.

The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.

- Window Glass: 163 examples
- Non-Window Glass: 51 examples

Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.

- Float Glass: 87 examples
- Non-Float Glass: 76 examples

You can learn more about the dataset here:

No need to download the dataset; we will download it automatically as part of the worked examples.

Below is a sample of the first few rows of the data.

1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1 1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1 1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1 1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1 1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1 ...

We can see that all inputs are numeric and the target variable in the final column is the integer encoded class label.

You can learn more about how to work through this dataset as part of a project in the tutorial:

Now that we are familiar with the glass multi-class classification dataset, let’s explore how we can use standard imbalanced classification tools with it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Oversampling refers to copying or synthesizing new examples of the minority classes so that the number of examples in the minority class better resembles or matches the number of examples in the majority classes.

Perhaps the most widely used approach to synthesizing new examples is called the Synthetic Minority Oversampling TEchnique, or SMOTE for short. This technique was described by Nitesh Chawla, et al. in their 2002 paper named for the technique titled “SMOTE: Synthetic Minority Over-sampling Technique.”

You can learn more about SMOTE in the tutorial:

The imbalanced-learn library provides an implementation of SMOTE that we can use that is compatible with the popular scikit-learn library.

First, the library must be installed. We can install it using pip as follows:

sudo pip install imbalanced-learn

We can confirm that the installation was successful by printing the version of the installed library:

# check version number import imblearn print(imblearn.__version__)

Running the example will print the version number of the installed library; for example:

0.6.2

Before we apply SMOTE, let’s first load the dataset and confirm the number of examples in each class.

# load and summarize the dataset from pandas import read_csv from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example first downloads the dataset and splits it into train and test sets.

The number of rows in each class is then reported, confirming that some classes, such as 0 and 1, have many more examples (more than 70) than other classes, such as 3 and 4 (less than 15).

Class=0, n=70 (32.710%) Class=1, n=76 (35.514%) Class=2, n=17 (7.944%) Class=3, n=13 (6.075%) Class=4, n=9 (4.206%) Class=5, n=29 (13.551%)

A bar chart is created providing a visualization of the class breakdown of the dataset.

This gives a clearer idea that classes 0 and 1 have many more examples than classes 2, 3, 4 and 5.

Next, we can apply SMOTE to oversample the dataset.

By default, SMOTE will oversample all classes to have the same number of examples as the class with the most examples.

In this case, class 1 has the most examples with 76, therefore, SMOTE will oversample all classes to have 76 examples.

The complete example of oversampling the glass dataset with SMOTE is listed below.

# example of oversampling a multi-class classification dataset from pandas import read_csv from imblearn.over_sampling import SMOTE from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # transform the dataset oversample = SMOTE() X, y = oversample.fit_resample(X, y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example first loads the dataset and applies SMOTE to it.

The distribution of examples in each class is then reported, confirming that each class now has 76 examples, as we expected.

Class=0, n=76 (16.667%) Class=1, n=76 (16.667%) Class=2, n=76 (16.667%) Class=3, n=76 (16.667%) Class=4, n=76 (16.667%) Class=5, n=76 (16.667%)

A bar chart of the class distribution is also created, providing a strong visual indication that all classes now have the same number of examples.

Instead of using the default strategy of SMOTE to oversample all classes to the number of examples in the majority class, we could instead specify the number of examples to oversample in each class.

For example, we could oversample to 100 examples in classes 0 and 1 and 200 examples in remaining classes. This can be achieved by creating a dictionary that maps class labels to the number of desired examples in each class, then specifying this via the “*sampling_strategy*” argument to the SMOTE class.

... # transform the dataset strategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200} oversample = SMOTE(sampling_strategy=strategy) X, y = oversample.fit_resample(X, y)

Tying this together, the complete example of using a custom oversampling strategy for SMOTE is listed below.

# example of oversampling a multi-class classification dataset with a custom strategy from pandas import read_csv from imblearn.over_sampling import SMOTE from collections import Counter from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder # define the dataset location url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the csv file as a data frame df = read_csv(url, header=None) data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) # transform the dataset strategy = {0:100, 1:100, 2:200, 3:200, 4:200, 5:200} oversample = SMOTE(sampling_strategy=strategy) X, y = oversample.fit_resample(X, y) # summarize distribution counter = Counter(y) for k,v in counter.items(): per = v / len(y) * 100 print('Class=%d, n=%d (%.3f%%)' % (k, v, per)) # plot the distribution pyplot.bar(counter.keys(), counter.values()) pyplot.show()

Running the example creates the desired sampling and summarizes the effect on the dataset, confirming the intended result.

Class=0, n=100 (10.000%) Class=1, n=100 (10.000%) Class=2, n=200 (20.000%) Class=3, n=200 (20.000%) Class=4, n=200 (20.000%) Class=5, n=200 (20.000%)

Note: you may see warnings that can be safely ignored for the purposes of this example, such as:

UserWarning: After over-sampling, the number of samples (200) in class 5 will be larger than the number of samples in the majority class (class #1 -> 76)

A bar chart of the class distribution is also created confirming the specified class distribution after data sampling.

**Note**: when using data sampling like SMOTE, it must only be applied to the training dataset, not the entire dataset. I recommend using a Pipeline to ensure that the SMOTE method is correctly used when evaluating models and making predictions with models.

You can see an example of the correct usage of SMOTE in a Pipeline in this tutorial:

Most machine learning algorithms assume that all classes have an equal number of examples.

This is not the case in multi-class imbalanced classification. Algorithms can be modified to change the way learning is performed to bias towards those classes that have fewer examples in the training dataset. This is generally called cost-sensitive learning.

For more on cost-sensitive learning, see the tutorial:

The RandomForestClassifier class in scikit-learn supports cost-sensitive learning via the “*class_weight*” argument.

By default, the random forest class assigns equal weight to each class.

We can evaluate the classification accuracy of the default random forest class weighting on the glass imbalanced multi-class classification dataset.

The complete example is listed below.

# baseline model and test harness for the glass identification dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the reference model model = RandomForestClassifier(n_estimators=1000) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the default random forest algorithm with 1,000 trees on the glass dataset using repeated stratified k-fold cross-validation.

The mean and standard deviation classification accuracy are reported at the end of the run.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the default model achieved a classification accuracy of about 79.6 percent.

Mean Accuracy: 0.796 (0.047)

We can specify the “*class_weight*” argument to the value “*balanced*” that will automatically calculates a class weighting that will ensure each class gets an equal weighting during the training of the model.

... # define the model model = RandomForestClassifier(n_estimators=1000, class_weight='balanced')

Tying this together, the complete example is listed below.

# cost sensitive random forest with default class weights from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the model model = RandomForestClassifier(n_estimators=1000, class_weight='balanced') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the default model achieved a lift in classification accuracy over the cost-insensitive version of the algorithm, with 80.2 percent classification accuracy vs. 79.6 percent.

Mean Accuracy: 0.802 (0.044)

The “*class_weight*” argument takes a dictionary of class labels mapped to a class weighting value.

We can use this to specify a custom weighting, such as a default weighting for classes 0 and 1.0 that have many examples and a double class weighting of 2.0 for the other classes.

... # define the model weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0} model = RandomForestClassifier(n_estimators=1000, class_weight=weights)

Tying this together, the complete example of using a custom class weighting for cost-sensitive learning on the glass multi-class imbalanced classification problem is listed below.

# cost sensitive random forest with custom class weightings from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the model weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0} model = RandomForestClassifier(n_estimators=1000, class_weight=weights) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example reports the mean and standard deviation classification accuracy of the cost-sensitive version of random forest on the glass dataset with custom weights.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that we achieved a further lift in accuracy from about 80.2 percent with balanced class weighting to 80.8 percent with a more biased class weighting.

Mean Accuracy: 0.808 (0.059)

This section provides more resources on the topic if you are looking to go deeper.

- Imbalanced Multiclass Classification with the Glass Identification Dataset
- SMOTE for Imbalanced Classification with Python
- Cost-Sensitive Logistic Regression for Imbalanced Classification
- Cost-Sensitive Learning for Imbalanced Classification

In this tutorial, you discovered how to use the tools of imbalanced classification with a multi-class dataset.

Specifically, you learned:

- About the glass identification standard imbalanced multi-class prediction problem.
- How to use SMOTE oversampling for imbalanced multi-class classification.
- How to use cost-sensitive learning for imbalanced multi-class classification.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Multi-Class Imbalanced Classification appeared first on Machine Learning Mastery.

]]>The post Imbalanced Multiclass Classification with the E.coli Dataset appeared first on Machine Learning Mastery.

]]>These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.

Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The E.coli protein localization sites dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass E.coli dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Jan/2021**: Updated links for API documentation.

This tutorial is divided into five parts; they are:

- E.coli Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Evaluate Data Oversampling

- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*E.coli*” dataset, also referred to as the “*protein localization sites*” dataset.

The dataset describes the problem of classifying E.coli proteins using their amino acid sequences in their cell localization sites. That is, predicting how a protein will bind to a cell based on the chemical composition of the protein before it is folded.

The dataset is credited to Kenta Nakai and was developed into its current form by Paul Horton and Kenta Nakai in their 1996 paper titled “A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins.” In it, they achieved a classification accuracy of 81 percent.

336 E.coli proteins were classified into 8 classes with an accuracy of 81% …

— A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins, 1996.

The dataset is comprised of 336 examples of E.coli proteins and each example is described using seven input variables calculated from the proteins amino acid sequence.

Ignoring the sequence name, the input features are described as follows:

**mcg**: McGeoch’s method for signal sequence recognition.**gvh**: von Heijne’s method for signal sequence recognition.**lip**: von Heijne’s Signal Peptidase II consensus sequence score.**chg**: Presence of charge on N-terminus of predicted lipoproteins.**aac**: score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins.**alm1**: score of the ALOM membrane-spanning region prediction program.**alm2**: score of ALOM program after excluding putative cleavable signal regions from the sequence.

There are eight classes described as follows:

**cp**: cytoplasm**im**: inner membrane without signal sequence**pp**: periplasm**imU**: inner membrane, non cleavable signal sequence**om**: outer membrane**omL**: outer membrane lipoprotein**imL**: inner membrane lipoprotein**imS**: inner membrane, cleavable signal sequence

The distribution of examples across the classes is not equal and in some cases severely imbalanced.

For example, the “*cp*” class has 143 examples, whereas the “*imL*” and “*imS*” classes have just two examples each.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, download and unzip the dataset and save it in your current working directory with the name “*ecoli.csv*“.

Note that this version of the dataset has the first column (sequence name) removed as it does not contain generalizable information for modeling.

Review the contents of the file.

The first few lines of the file should look as follows:

0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp 0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp 0.56,0.40,0.48,0.50,0.49,0.37,0.46,cp 0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp 0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp ...

We can see that the input variables all appear numeric, and the class labels are string values that will need to be label encoded prior to modeling.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location of the file and the fact that there is no header line.

... # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

Next, we can calculate a five-number summary for each input variable.

... # describe the dataset set_option('precision', 3) print(dataframe.describe())

Finally, we can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from pandas import set_option from collections import Counter # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # describe the dataset set_option('precision', 3) print(dataframe.describe()) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, which are 336 rows and 7 input variables and 1 target variable.

Reviewing the summary of each variable, it appears that the variables have been centered, that is, shifted to have a mean of 0.5. It also appears that the variables have been normalized, meaning all values are in the range between about 0 and 1; at least no variables have values outside this range.

The class distribution is then summarized, confirming the severe skew in the observations for each class. We can see that the “*cp*” class is dominant with about 42 percent of the examples and minority classes such as “*imS*“, “*imL*“, and “*omL*” have about 1 percent or less of the dataset.

There may not be sufficient data to generalize from these minority classes. One approach might be to simply remove the examples with these classes.

(336, 8) 0 1 2 3 4 5 6 count 336.000 336.000 336.000 336.000 336.000 336.000 336.000 mean 0.500 0.500 0.495 0.501 0.500 0.500 0.500 std 0.195 0.148 0.088 0.027 0.122 0.216 0.209 min 0.000 0.160 0.480 0.500 0.000 0.030 0.000 25% 0.340 0.400 0.480 0.500 0.420 0.330 0.350 50% 0.500 0.470 0.480 0.500 0.495 0.455 0.430 75% 0.662 0.570 0.480 0.500 0.570 0.710 0.710 max 0.890 1.000 1.000 1.000 0.880 1.000 0.990 Class=cp, Count=143, Percentage=42.560% Class=im, Count=77, Percentage=22.917% Class=imS, Count=2, Percentage=0.595% Class=imL, Count=2, Percentage=0.595% Class=imU, Count=35, Percentage=10.417% Class=om, Count=20, Percentage=5.952% Class=omL, Count=5, Percentage=1.488% Class=pp, Count=52, Percentage=15.476%

We can also take a look at the distribution of the input variables by creating a histogram for each.

The complete example of creating histograms of all input variables is listed below.

# create histograms of all variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'ecoli.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # create a histogram plot of each variable df.hist(bins=25) # show the plot pyplot.show()

We can see that variables such as 0, 5, and 6 may have a multi-modal distribution. The variables 2 and 3 may have a binary distribution and variables 1 and 4 may have a Gaussian-like distribution.

Depending on the choice of model, the dataset may benefit from standardization, normalization, and perhaps a power transform.

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use *k=5*, meaning each fold will contain about 336/5 or about 67 examples.

Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 5 * 3, or 15, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

All classes are equally important. As such, in this case, we will use classification accuracy to evaluate models.

First, we can define a function to load the dataset and split the input variables into inputs and output variables and use a label encoder to ensure class labels are numbered sequentially.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can define a function to evaluate a candidate model using stratified repeated 5-fold cross-validation, then return a list of scores calculated on the model for each fold and repeat.

The *evaluate_model()* function below implements this.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the *load_dataset()* function to load and confirm the E.coli dataset.

... # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y))

In this case, we will evaluate the baseline strategy of predicting the majority class in all cases.

This can be implemented automatically using the DummyClassifier class and setting the “*strategy*” to “*most_frequent*” that will predict the most common class (e.g. class ‘*cp*‘) in the training dataset. As such, we would expect this model to achieve a classification accuracy of about 42 percent given this is the distribution of the most common class in the training dataset.

... # define the reference model model = DummyClassifier(strategy='most_frequent')

We can then evaluate the model by calling our *evaluate_model()* function and report the mean and standard deviation of the results.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this all together, the complete example of evaluating the baseline model on the E.coli dataset using classification accuracy is listed below.

# baseline model and test harness for the ecoli dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and reports the number of cases correctly as 336 and the distribution of class labels as we expect.

The *DummyClassifier* with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the classification accuracy is reported as about 42.6 percent.

(336, 7) (336,) Counter({0: 143, 1: 77, 7: 52, 4: 35, 5: 20, 6: 5, 3: 2, 2: 2}) Mean Accuracy: 0.426 (0.006)

Warnings are reported during the evaluation of the model; for example:

Warning: The least populated class in y has only 2 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5.

This is because some of the classes do not have a sufficient number of examples for the 5-fold cross-validation, e.g. classes “*imS*” and “*imL*“.

In this case, we will remove these examples from the dataset. This can be achieved by updating the *load_dataset()* to remove those rows with these classes, e.g. four rows.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can then re-run the example to establish a baseline in classification accuracy.

The complete example is listed below.

# baseline model and test harness for the ecoli dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example confirms that the number of examples was reduced by four, from 336 to 332.

We can also see that the number of classes was reduced from eight to six (class 0 through to class 5).

The baseline in performance was established at 43.1 percent. This score provides a baseline on this dataset by which all other classification algorithms can be compared. Achieving a score above about 43.1 percent indicates that a model has skill on this dataset, and a score at or below this value indicates that the model does not have skill on this dataset.

(332, 7) (332,) Counter({0: 143, 1: 77, 5: 52, 2: 35, 3: 20, 4: 5}) Mean Accuracy: 0.431 (0.005)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better classification accuracy using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the E.coli dataset:

- Linear Discriminant Analysis (LDA)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The *get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the E.coli dataset is listed below.

# spot check machine learning algorithms on the ecoli dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import LinearSVC from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving an accuracy above the default of 43.1 percent.

The results suggest that most algorithms do well on this dataset and that perhaps the ensembles of decision trees perform the best with Extra Trees achieving 88 percent accuracy and Random Forest achieving 89.5 percent accuracy.

>LDA 0.886 (0.027) >SVM 0.883 (0.027) >BAG 0.851 (0.037) >RF 0.895 (0.032) >ET 0.880 (0.030)

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the ensembles of decision trees clustered together separate from the other algorithms tested. In most cases, the mean and median are close on the plot, suggesting a somewhat symmetrical distribution of scores that may indicate the models are stable.

With so many classes and so few examples in many of the classes, the dataset may benefit from oversampling.

We can test the SMOTE algorithm applied to all except the majority class (*cp*) results in a lift in performance.

Generally, SMOTE does not appear to help ensembles of decision trees, so we will change the set of algorithms tested to the following:

- Multinomial Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- Support Vector Machine (SVM)
- k-Nearest Neighbors (KNN)
- Gaussian Process (GP)

The updated version of the *get_models()* function to define these models is listed below.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', multi_class='multinomial')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # KNN models.append(KNeighborsClassifier(n_neighbors=3)) names.append('KNN') # GP models.append(GaussianProcessClassifier()) names.append('GP') return models, names

We can use the SMOTE implementation from the imbalanced-learn library, and a Pipeline from the same library to first apply SMOTE to the training dataset, then fit a given model as part of the cross-validation procedure.

SMOTE will synthesize new examples using k-nearest neighbors in the training dataset, where by default, *k* is set to 5.

This is too large for some of the classes in our dataset. Therefore, we will try a *k* value of 2.

... # create pipeline steps = [('o', SMOTE(k_neighbors=2)), ('m', models[i])] pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

Tying this together, the complete example of using SMOTE oversampling on the E.coli dataset is listed below.

# spot check smote with machine learning algorithms on the ecoli dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import LinearSVC from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.neighbors import KNeighborsClassifier from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.linear_model import LogisticRegression from imblearn.pipeline import Pipeline from imblearn.over_sampling import SMOTE # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', multi_class='multinomial')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # SVM models.append(LinearSVC()) names.append('SVM') # KNN models.append(KNeighborsClassifier(n_neighbors=3)) names.append('KNN') # GP models.append(GaussianProcessClassifier()) names.append('GP') return models, names # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # create pipeline steps = [('o', SMOTE(k_neighbors=2)), ('m', models[i])] pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that LDA with SMOTE resulted in a small drop from 88.6 percent to about 87.9 percent, whereas SVM with SMOTE saw a small increase from about 88.3 percent to about 88.8 percent.

SVM also appears to be the best-performing method when using SMOTE in this case, although it does not achieve an improvement as compared to random forest in the previous section.

>LR 0.875 (0.024) >LDA 0.879 (0.029) >SVM 0.888 (0.025) >KNN 0.835 (0.040) >GP 0.876 (0.023)

Box and whisker plots of classification accuracy scores are created for each algorithm.

We can see that LDA has a number of performance outliers with high 90-percent values, which is quite interesting. It might suggest that LDA could perform better if focused on the abundant classes.

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the Random Forest model as our final model that achieved a classification accuracy of about 89.5 percent.

First, we can define the model.

... # define model to evaluate model = RandomForestClassifier(n_estimators=1000)

Once defined, we can fit it on the entire training dataset.

... # fit the model model.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the encoded class label for each example.

We can then use the label encoder to inverse transform to get the string class label.

For example:

... # define a row of data row = [...] # predict the class label yhat = model.predict([row]) label = le.inverse_transform(yhat)[0]

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

# fit a model and make predictions for the on the ecoli dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array df = read_csv(full_path, header=None) # remove rows for the minority classes df = df[df[7] != 'imS'] df = df[df[7] != 'imL'] # retrieve numpy array data = df.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable le = LabelEncoder() y = le.fit_transform(y) return X, y, le # define the location of the dataset full_path = 'ecoli.csv' # load the dataset X, y, le = load_dataset(full_path) # define model to evaluate model = RandomForestClassifier(n_estimators=1000) # fit the model model.fit(X, y) # known class "cp" row = [0.49,0.29,0.48,0.50,0.56,0.24,0.35] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected cp)' % (label)) # known class "im" row = [0.06,0.61,0.48,0.50,0.49,0.92,0.37] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected im)' % (label)) # known class "imU" row = [0.72,0.42,0.48,0.50,0.65,0.77,0.79] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected imU)' % (label)) # known class "om" row = [0.78,0.68,0.48,0.50,0.83,0.40,0.29] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected om)' % (label)) # known class "omL" row = [0.77,0.57,1.00,0.50,0.37,0.54,0.0] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected omL)' % (label)) # known class "pp" row = [0.74,0.49,0.48,0.50,0.42,0.54,0.36] yhat = model.predict([row]) label = le.inverse_transform(yhat)[0] print('>Predicted=%s (expected pp)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label for one example taken from each of the six classes.

We can see that the correct class label is predicted for each of the chosen examples. Nevertheless, on average, we expect that 1 in 10 predictions will be wrong and these errors may not be equally distributed across the classes.

>Predicted=cp (expected cp) >Predicted=im (expected im) >Predicted=imU (expected imU) >Predicted=om (expected om) >Predicted=omL (expected omL) >Predicted=pp (expected pp)

This section provides more resources on the topic if you are looking to go deeper.

- Expert System For Predicting Protein Localization Sites In Gram‐negative Bacteria, 1991.
- A Knowledge Base For Predicting Protein Localization Sites In Eukaryotic Cells, 1992.
- A Probabilistic Classification System For Predicting The Cellular Localization Sites Of Proteins, 1996.

- pandas.read_csv API.
- sklearn.dummy.DummyClassifier API.
- imblearn.over_sampling.SMOTE API.
- imblearn.pipeline.Pipeline API.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced multiclass E.coli dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Multiclass Classification with the E.coli Dataset appeared first on Machine Learning Mastery.

]]>The post Imbalanced Multiclass Classification with the Glass Identification Dataset appeared first on Machine Learning Mastery.

]]>These are challenging predictive modeling problems because a sufficiently representative number of examples of each class is required for a model to learn the problem. It is made challenging when the number of examples in each class is imbalanced, or skewed toward one or a few of the classes with very few examples of other classes.

Problems of this type are referred to as imbalanced multiclass classification problems and they require both the careful design of an evaluation metric and test harness and choice of machine learning models. The glass identification dataset is a standard dataset for exploring the challenge of imbalanced multiclass classification.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jun/2020**: Added an example that achieves better performance.

This tutorial is divided into five parts; they are:

- Glass Identification Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Improved Models (new)

- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*Glass Identification*” dataset, or simply “*glass*.”

The dataset describes the chemical properties of glass and involves classifying samples of glass using their chemical properties as one of six classes. The dataset was credited to Vina Spiehler in 1987.

Ignoring the sample identification number, there are nine input variables that summarize the properties of the glass dataset; they are:

**RI**: refractive index**Na**: Sodium**Mg**: Magnesium**Al**: Aluminum**Si**: Silicon**K**: Potassium**Ca**: Calcium**Ba**: Barium**Fe**: Iron

The chemical compositions are measured as the weight percent in corresponding oxide.

There are seven types of glass listed; they are:

**Class 1**: building windows (float processed)**Class 2**: building windows (non-float processed)**Class 3**: vehicle windows (float processed)**Class 4**: vehicle windows (non-float processed)**Class 5**: containers**Class 6**: tableware**Class 7**: headlamps

Float glass refers to the process used to make the glass.

There are 214 observations in the dataset and the number of observations in each class is imbalanced. Note that there are no examples for class 4 (non-float processed vehicle windows) in the dataset.

**Class 1**: 70 examples**Class 2**: 76 examples**Class 3**: 17 examples**Class 4**: 0 examples**Class 5**: 13 examples**Class 6**: 9 examples**Class 7**: 29 examples

Although there are minority classes, all classes are equally important in this prediction problem.

The dataset can be divided into window glass (classes 1-4) and non-window glass (classes 5-7). There are 163 examples of window glass and 51 examples of non-window glass.

**Window Glass**: 163 examples**Non-Window Glass**: 51 examples

Another division of the observations would be between float processed glass and non-float processed glass, in the case of window glass only. This division is more balanced.

**Float Glass**: 87 examples**Non-Float Glass**: 76 examples

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, download the dataset and save it in your current working directory with the name “*glass.csv*“.

Note that this version of the dataset has the first column (row) number removed as it does not contain generalizable information for modeling.

Review the contents of the file.

The first few lines of the file should look as follows:

1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00,1 1.51761,13.89,3.60,1.36,72.73,0.48,7.83,0.00,0.00,1 1.51618,13.53,3.55,1.54,72.99,0.39,7.78,0.00,0.00,1 1.51766,13.21,3.69,1.29,72.61,0.57,8.22,0.00,0.00,1 1.51742,13.27,3.62,1.24,73.08,0.55,8.07,0.00,0.00,1 ...

We can see that the input variables are numeric and the class label is an integer is in the final column.

All of the chemical input variables have the same units, although the first variable, the refractive index, has different units. As such, data scaling may be required for some modeling algorithms.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location of the dataset and the fact that there is no header line.

... # define the dataset location filename = 'glass.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the *DataFrame*.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'glass.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, which are 214 rows and 9 input variables and 1 target variable.

The class distribution is then summarized, confirming the severe skew in the observations for each class.

(214, 10) Class=1, Count=70, Percentage=32.710% Class=2, Count=76, Percentage=35.514% Class=3, Count=17, Percentage=7.944% Class=5, Count=13, Percentage=6.075% Class=6, Count=9, Percentage=4.206% Class=7, Count=29, Percentage=13.551%

We can also take a look at the distribution of the input variables by creating a histogram for each.

The complete example of creating histograms of all variables is listed below.

# create histograms of all variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'glass.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # create a histogram plot of each variable df.hist() # show the plot pyplot.show()

We can see that some of the variables have a Gaussian-like distribution and others appear to have an exponential or even a bimodal distribution.

Depending on the choice of algorithm, the data may benefit from standardization of some variables and perhaps a power transform.

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=5, meaning each fold will contain about 214/5, or about 42 examples.

Stratified means that each fold will aim to contain the same mixture of examples by class as the entire training dataset. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 5 * 3 or 15 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

All classes are equally important. There are minority classes that are only represented with 4 percent or 6 percent of the data, yet no class has more than about 35 percent dominance of the dataset.

As such, in this case, we will use classification accuracy to evaluate models.

First, we can define a function to load the dataset and split the input variables into inputs and output variables and use a label encoder to ensure class labels are numbered sequentially from 0 to 5.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can define a function to evaluate a candidate model using stratified repeated 5-fold cross-validation then return a list of scores calculated on the model for each fold and repeat. The *evaluate_model()* function below implements this.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can then call the *load_dataset()* function to load and confirm the glass identification dataset.

... # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y))

In this case, we will evaluate the baseline strategy of predicting the majority class in all cases.

This can be implemented automatically using the DummyClassifier class and setting the “*strategy*” to “*most_frequent*” that will predict the most common class (e.g. class 2) in the training dataset.

As such, we would expect this model to achieve a classification accuracy of about 35 percent given this is the distribution of the most common class in the training dataset.

... # define the reference model model = DummyClassifier(strategy='most_frequent')

We can then evaluate the model by calling our *evaluate_model()* function and report the mean and standard deviation of the results.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this all together, the complete example of evaluating the baseline model on the glass identification dataset using classification accuracy is listed below.

# baseline model and test harness for the glass identification dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads the dataset and reports the number of cases correctly as 214 and the distribution of class labels as we expect.

The *DummyClassifier* with our default strategy is then evaluated using repeated stratified k-fold cross-validation and the mean and standard deviation of the classification accuracy is reported as about 35.5 percent.

This score provides a baseline on this dataset by which all other classification algorithms can be compared. Achieving a score above about 35.5 percent indicates that a model has skill on this dataset, and a score at or below this value indicates that the model does not have skill on this dataset.

(214, 9) (214,) Counter({1: 76, 0: 70, 5: 29, 2: 17, 3: 13, 4: 9}) Mean Accuracy: 0.355 (0.011)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better classification accuracy using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s evaluate a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention and what doesn’t.

We will evaluate the following machine learning models on the glass dataset:

- Support Vector Machine (SVM)
- k-Nearest Neighbors (KNN)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The *get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # SVM models.append(SVC(gamma='auto')) names.append('SVM') # KNN models.append(KNeighborsClassifier()) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the glass identification dataset is listed below.

# spot check machine learning algorithms on the glass identification dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # SVM models.append(SVC(gamma='auto')) names.append('SVM') # KNN models.append(KNeighborsClassifier()) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving an accuracy above the default of 35.5 percent.

The results suggest that ensembles of decision trees perform well on this dataset, with perhaps random forest performing the best overall achieving a classification accuracy of approximately 79.6 percent.

>SVM 0.669 (0.057) >KNN 0.647 (0.055) >BAG 0.767 (0.070) >RF 0.796 (0.062) >ET 0.776 (0.057)

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the ensembles of decision trees clustered together separate from the other algorithms tested. In most cases, the mean and median are close on the plot, suggesting a somewhat symmetrical distribution of scores that may indicate the models are stable.

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

This section lists models discovered to have even better performance than those listed above, added after the tutorial was published.

A cost-sensitive version of random forest with custom class weightings was found to achieve better performance.

# cost sensitive random forest with custom class weightings from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/glass.csv' # load the dataset X, y = load_dataset(full_path) # define the model weights = {0:1.0, 1:1.0, 2:2.0, 3:2.0, 4:2.0, 5:2.0} model = RandomForestClassifier(n_estimators=1000, class_weight=weights) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the algorithms and reports the mean and standard deviation accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, the model achieves an accuracy of about 80.8%.

Mean Accuracy: 0.808 (0.059)

**Can you do better?
**Let me know in the comments below and I will add your model here if I can reproduce the result using the same test harness.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the Random Forest model as our final model that achieved a classification accuracy of about 79 percent.

First, we can define the model.

... # define model to evaluate model = RandomForestClassifier(n_estimators=1000)

Once defined, we can fit it on the entire training dataset.

... # fit the model model.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function.

This will return the class label for each example.

For example:

... # define a row of data row = [...] # predict the class label yhat = model.predict([row])

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

# fit a model and make predictions for the on the glass identification dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # define the location of the dataset full_path = 'glass.csv' # load the dataset X, y = load_dataset(full_path) # define model to evaluate model = RandomForestClassifier(n_estimators=1000) # fit the model model.fit(X, y) # known class 0 (class=1 in the dataset) row = [1.52101,13.64,4.49,1.10,71.78,0.06,8.75,0.00,0.00] print('>Predicted=%d (expected 0)' % (model.predict([row]))) # known class 1 (class=2 in the dataset) row = [1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12] print('>Predicted=%d (expected 1)' % (model.predict([row]))) # known class 2 (class=3 in the dataset) row = [1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00] print('>Predicted=%d (expected 2)' % (model.predict([row]))) # known class 3 (class=5 in the dataset) row = [1.51915,12.73,1.85,1.86,72.69,0.60,10.09,0.00,0.00] print('>Predicted=%d (expected 3)' % (model.predict([row]))) # known class 4 (class=6 in the dataset) row = [1.51115,17.38,0.00,0.34,75.41,0.00,6.65,0.00,0.00] print('>Predicted=%d (expected 4)' % (model.predict([row]))) # known class 5 (class=7 in the dataset) row = [1.51556,13.87,0.00,2.54,73.23,0.14,9.41,0.81,0.01] print('>Predicted=%d (expected 5)' % (model.predict([row])))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label for one example taken from each of the six classes.

We can see that the correct class label is predicted for each of the chosen examples. Nevertheless, on average, we expect that 1 in 5 predictions will be wrong and these errors may not be equally distributed across the classes.

>Predicted=0 (expected 0) >Predicted=1 (expected 1) >Predicted=2 (expected 2) >Predicted=3 (expected 3) >Predicted=4 (expected 4) >Predicted=5 (expected 5)

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API.
- sklearn.dummy.DummyClassifier API.
- sklearn.ensemble.RandomForestClassifier API.

- Glass Identification Dataset, UCI Machine Learning Repository.
- Glass Identification Dataset.
- Glass Identification Dataset Description.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced multiclass glass identification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the class labels for specific examples.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Multiclass Classification with the Glass Identification Dataset appeared first on Machine Learning Mastery.

]]>The post Imbalanced Classification with the Fraudulent Credit Card Transactions Dataset appeared first on Machine Learning Mastery.

]]>Identifying fraudulent credit card transactions is a common type of imbalanced binary classification where the focus is on the positive class (is fraud) class.

As such, metrics like precision and recall can be used to summarize model performance in terms of class labels and precision-recall curves can be used to summarize model performance across a range of probability thresholds when mapping predicted probabilities to class labels.

This gives the operator of the model control over how predictions are made in terms of biasing toward false positive or false negative type errors made by the model.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced credit card fraud dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the probability of fraud for specific cases.

Let’s get started.

This tutorial is divided into five parts; they are:

- Credit Card Fraud Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “Credit Card Fraud Detection” dataset.

The data represents credit card transactions that occurred over two days in September 2013 by European cardholders.

The dataset is credited to the Machine Learning Group at the Free University of Brussels (Université Libre de Bruxelles) and a suite of publications by Andrea Dal Pozzolo, et al.

All details of the cardholders have been anonymized via a principal component analysis (PCA) transform. Instead, a total of 28 principal components of these anonymized features is provided. In addition, the time in seconds between transactions is provided, as is the purchase amount (presumably in Euros).

Each record is classified as normal (class “0”) or fraudulent (class “1” ) and the transactions are heavily skewed towards normal. Specifically, there are 492 fraudulent credit card transactions out of a total of 284,807 transactions, which is a total of about 0.172% of all transactions.

It contains a subset of online transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, where the positive class (frauds) account for 0.172% of all transactions …

— Calibrating Probability with Undersampling for Unbalanced Classification, 2015.

Some publications use the ROC area under curve metric, although the website for the dataset recommends using the precision-recall area under curve metric, given the severe class imbalance.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC).

— Credit Card Fraud Detection, Kaggle.

Next, let’s take a closer look at the data.

First, download and unzip the dataset and save it in your current working directory with the name “*creditcard.csv*“.

Review the contents of the file.

The first few lines of the file should look as follows:

0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62,"0" 0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.00898309914322813,0.0147241691924927,2.69,"0" 1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66,"0" 1,-0.966271711572087,-0.185226008082898,1.79299333957872,-0.863291275036453,-0.0103088796030823,1.24720316752486,0.23760893977178,0.377435874652262,-1.38702406270197,-0.0549519224713749,-0.226487263835401,0.178228225877303,0.507756869957169,-0.28792374549456,-0.631418117709045,-1.0596472454325,-0.684092786345479,1.96577500349538,-1.2326219700892,-0.208037781160366,-0.108300452035545,0.00527359678253453,-0.190320518742841,-1.17557533186321,0.647376034602038,-0.221928844458407,0.0627228487293033,0.0614576285006353,123.5,"0" 2,-1.15823309349523,0.877736754848451,1.548717846511,0.403033933955121,-0.407193377311653,0.0959214624684256,0.592940745385545,-0.270532677192282,0.817739308235294,0.753074431976354,-0.822842877946363,0.53819555014995,1.3458515932154,-1.11966983471731,0.175121130008994,-0.451449182813529,-0.237033239362776,-0.0381947870352842,0.803486924960175,0.408542360392758,-0.00943069713232919,0.79827849458971,-0.137458079619063,0.141266983824769,-0.206009587619756,0.502292224181569,0.219422229513348,0.215153147499206,69.99,"0" ...

Note that this version of the dataset has the header line removed. If you download the dataset from Kaggle, you must remove the header line first.

We can see that the first column is the time, which is an integer, and the second last column is the purchase amount. The final column contains the class label. We can see that the PCA transformed features are positive and negative and contain a lot of floating point precision.

The time column is unlikely to be useful and probably can be removed. The difference in scale between the PCA variables and the dollar amount suggests that data scaling should be used for those algorithms that are sensitive to the scale of input variables.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the names of the columns, as there is no header line.

... # define the dataset location filename = 'creditcard.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'creditcard.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, which are 284,807 rows and 30 input variables and 1 target variable.

The class distribution is then summarized, confirming the severe skew in the class distribution, with about 99.827 percent of transactions marked as normal and about 0.173 percent marked as fraudulent. This generally matches the description of the dataset in the paper.

(284807, 31) Class=0, Count=284315, Percentage=99.827% Class=1, Count=492, Percentage=0.173%

We can also take a look at the distribution of the input variables by creating a histogram for each.

Because of the large number of variables, the plots can look cluttered. Therefore we will disable the axis labels so that we can focus on the histograms. We will also increase the number of bins used in each histogram to help better see the data distribution.

The complete example of creating histograms of all input variables is listed below.

# create histograms of input variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'creditcard.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # drop the target variable df = df.drop(30, axis=1) # create a histogram plot of each numeric variable ax = df.hist(bins=100) # disable axis labels to avoid the clutter for axis in ax.flatten(): axis.set_xticklabels([]) axis.set_yticklabels([]) # show the plot pyplot.show()

We can see that the distribution of most of the PCA components is Gaussian, and many may be centered around zero, suggesting that the variables were standardized as part of the PCA transform.

The amount variable might be interesting and does not appear on the histogram.

This suggests that the distribution of the amount values may be skewed. We can create a 5-number summary of this variable to get a better idea of the transaction sizes.

The complete example is listed below.

# summarize the amount variable from pandas import read_csv # define the dataset location filename = 'creditcard.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # summarize the amount variable. print(df[29].describe())

Running the example, we can see that most amounts are small, with a mean of about 88 and the middle 50 percent of observations between 5 and 77.

The largest value is about 25,691, which is pulling the distribution up and might be an outlier (e.g. someone purchased a car on their credit card).

count 284807.000000 mean 88.349619 std 250.120109 min 0.000000 25% 5.600000 50% 22.000000 75% 77.165000 max 25691.160000 Name: 29, dtype: float64

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 284807/10 or 28,480 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 99.8 percent to 0.2 percent normal and fraudulent transaction respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use 3 repeats.

This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will use the recommended metric of area under precision-recall curve or PR AUC.

This requires that a given algorithm first predict a probability or probability-like measure. The predicted probabilities are then evaluated using precision and recall at a range of different thresholds for mapping probability to class labels, and the area under the curve of these thresholds is reported as the performance of the model.

This metric focuses on the positive class, which is desirable for such a severe class imbalance. It also allows the operator of a final model to choose a threshold for mapping probabilities to class labels (fraud or non-fraud transactions) that best balances the precision and recall of the final model.

We can define a function to load the dataset and split the columns into input and output variables. The *load_dataset()* function below implements this.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y

We can then define a function that will calculate the precision-recall area under curve for a given set of predictions.

This involves first calculating the precision-recall curve for the predictions via the precision_recall_curve() function. The output recall and precision values for each threshold can then be provided as arguments to the auc() to calculate the area under the curve. The *pr_auc()* function below implements this.

# calculate precision-recall area under curve def pr_auc(y_true, probas_pred): # calculate precision-recall curve p, r, _ = precision_recall_curve(y_true, probas_pred) # calculate area under curve return auc(r, p)

We can then define a function that will evaluate a given model on the dataset and return a list of PR AUC scores for each fold and repeat.

The *evaluate_model()* function below implements this, taking the dataset and model as arguments and returning the list of scores. The make_scorer() function is used to define the precision-recall AUC metric and indicates that a model must predict probabilities in order to be evaluated.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(pr_auc, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the positive class (class 1) for all examples will provide a baseline performance when using the precision-recall area under curve metric.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “*strategy*” argument to ‘*constant*‘ and setting the “*constant*” argument to ‘1’ to predict the positive class.

... # define the reference model model = DummyClassifier(strategy='constant', constant=1)

Once the model is evaluated, we can report the mean and standard deviation of the PR AUC scores directly.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean PR AUC: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.

# test harness and baseline model evaluation for the credit dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import precision_recall_curve from sklearn.metrics import auc from sklearn.metrics import make_scorer from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # calculate precision-recall area under curve def pr_auc(y_true, probas_pred): # calculate precision-recall curve p, r, _ = precision_recall_curve(y_true, probas_pred) # calculate area under curve return auc(r, p) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(pr_auc, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'creditcard.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='constant', constant=1) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean PR AUC: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded and that we have 30 input variables.

Next, the average of the PR AUC scores is reported.

In this case, we can see that the baseline algorithm achieves a mean PR AUC of about 0.501.

This score provides a lower limit on model skill; any model that achieves an average PR AUC above about 0.5 has skill, whereas models that achieve a score below this value do not have skill on this dataset.

(284807, 30) (284807,) Counter({0.0: 284315, 1.0: 492}) Mean PR AUC: 0.501 (0.000)

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better PR AUC performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention and what doesn’t.

We will evaluate the following machine learning models on the credit card fraud dataset:

- Decision Tree (CART)
- k-Nearest Neighbors (KNN)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 100. We will also standardize the input variables prior to providing them as input to the KNN algorithm.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The *get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # CART models.append(DecisionTreeClassifier()) names.append('CART') # KNN steps = [('s',StandardScaler()),('m',KNeighborsClassifier())] models.append(Pipeline(steps=steps)) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=100)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=100)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=100)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of an evaluation of a suite of machine learning algorithms on the credit card fraud dataset is listed below.

# spot check machine learning algorithms on the credit card fraud dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import precision_recall_curve from sklearn.metrics import auc from sklearn.metrics import make_scorer from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # calculate precision-recall area under curve def pr_auc(y_true, probas_pred): # calculate precision-recall curve p, r, _ = precision_recall_curve(y_true, probas_pred) # calculate area under curve return auc(r, p) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(pr_auc, needs_proba=True) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # CART models.append(DecisionTreeClassifier()) names.append('CART') # KNN steps = [('s',StandardScaler()),('m',KNeighborsClassifier())] models.append(Pipeline(steps=steps)) names.append('KNN') # Bagging models.append(BaggingClassifier(n_estimators=100)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=100)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=100)) names.append('ET') return models, names # define the location of the dataset full_path = 'creditcard.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation PR AUC.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving a PR AUC above the default of 0.5. The results suggest that the ensembles of decision tree algorithms all do well on this dataset, although the KNN with standardization of the dataset seems to perform the best on average.

>CART 0.771 (0.049) >KNN 0.867 (0.033) >BAG 0.846 (0.045) >RF 0.855 (0.046) >ET 0.864 (0.040)

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the KNN and ensembles of decision trees are tight and means seem to coincide with medians, suggesting the distributions may be symmetrical and are probably Gaussian and that the scores are probably quite stable.

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the KNN model as our final model that achieved a PR AUC of about 0.867. Fitting the final model involves defining a Pipeline to scale the numerical variables prior to fitting the model.

The Pipeline can then be used to make predictions on new data directly and will automatically scale new data using the same operations as performed on the training dataset.

First, we can define the model as a pipeline.

... # define model to evaluate model = KNeighborsClassifier() # scale, then fit model pipeline = Pipeline(steps=[('s',StandardScaler()), ('m',model)])

Once defined, we can fit it on the entire training dataset.

... # fit the model pipeline.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict_proba()* function. This will return the probability for each class.

We can retrieve the predicted probability for the positive class that a operator of the model might use to interpret the prediction.

For example:

... # define a row of data row = [...] yhat = pipeline.predict_proba([row]) # get the probability for the positive class result = yhat[0][1]

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

# fit a model and make predictions for the on the credit card fraud dataset from pandas import read_csv from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier from sklearn.pipeline import Pipeline # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # define the location of the dataset full_path = 'creditcard.csv' # load the dataset X, y = load_dataset(full_path) # define model to evaluate model = KNeighborsClassifier() # scale, then fit model pipeline = Pipeline(steps=[('s',StandardScaler()), ('m',model)]) # fit the model pipeline.fit(X, y) # evaluate on some normal cases (known class 0) print('Normal cases:') data = [[0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705,-0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62], [0,1.19185711131486,0.26615071205963,0.16648011335321,0.448154078460911,0.0600176492822243,-0.0823608088155687,-0.0788029833323113,0.0851016549148104,-0.255425128109186,-0.166974414004614,1.61272666105479,1.06523531137287,0.48909501589608,-0.143772296441519,0.635558093258208,0.463917041022171,-0.114804663102346,-0.183361270123994,-0.145783041325259,-0.0690831352230203,-0.225775248033138,-0.638671952771851,0.101288021253234,-0.339846475529127,0.167170404418143,0.125894532368176,-0.00898309914322813,0.0147241691924927,2.69], [1,-1.35835406159823,-1.34016307473609,1.77320934263119,0.379779593034328,-0.503198133318193,1.80049938079263,0.791460956450422,0.247675786588991,-1.51465432260583,0.207642865216696,0.624501459424895,0.066083685268831,0.717292731410831,-0.165945922763554,2.34586494901581,-2.89008319444231,1.10996937869599,-0.121359313195888,-2.26185709530414,0.524979725224404,0.247998153469754,0.771679401917229,0.909412262347719,-0.689280956490685,-0.327641833735251,-0.139096571514147,-0.0553527940384261,-0.0597518405929204,378.66]] for row in data: # make prediction yhat = pipeline.predict_proba([row]) # get the probability for the positive class result = yhat[0][1] # summarize print('>Predicted=%.3f (expected 0)' % (result)) # evaluate on some fraud cases (known class 1) print('Fraud cases:') data = [[406,-2.3122265423263,1.95199201064158,-1.60985073229769,3.9979055875468,-0.522187864667764,-1.42654531920595,-2.53738730624579,1.39165724829804,-2.77008927719433,-2.77227214465915,3.20203320709635,-2.89990738849473,-0.595221881324605,-4.28925378244217,0.389724120274487,-1.14074717980657,-2.83005567450437,-0.0168224681808257,0.416955705037907,0.126910559061474,0.517232370861764,-0.0350493686052974,-0.465211076182388,0.320198198514526,0.0445191674731724,0.177839798284401,0.261145002567677,-0.143275874698919,0], [7519,1.23423504613468,3.0197404207034,-4.30459688479665,4.73279513041887,3.62420083055386,-1.35774566315358,1.71344498787235,-0.496358487073991,-1.28285782036322,-2.44746925511151,2.10134386504854,-4.6096283906446,1.46437762476188,-6.07933719308005,-0.339237372732577,2.58185095378146,6.73938438478335,3.04249317830411,-2.72185312222835,0.00906083639534526,-0.37906830709218,-0.704181032215427,-0.656804756348389,-1.63265295692929,1.48890144838237,0.566797273468934,-0.0100162234965625,0.146792734916988,1], [7526,0.00843036489558254,4.13783683497998,-6.24069657194744,6.6757321631344,0.768307024571449,-3.35305954788994,-1.63173467271809,0.15461244822474,-2.79589246446281,-6.18789062970647,5.66439470857116,-9.85448482287037,-0.306166658250084,-10.6911962118171,-0.638498192673322,-2.04197379107768,-1.12905587703585,0.116452521226364,-1.93466573889727,0.488378221134715,0.36451420978479,-0.608057133838703,-0.539527941820093,0.128939982991813,1.48848121006868,0.50796267782385,0.735821636119662,0.513573740679437,1]] for row in data: # make prediction yhat = pipeline.predict_proba([row]) # get the probability for the positive class result = yhat[0][1] # summarize print('>Predicted=%.3f (expected 1)' % (result))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label of normal cases chosen from the dataset file. We can see that all cases are correctly predicted.

Then some fraud cases are used as input to the model and the label is predicted. As we might have hoped, most of the examples are predicted correctly with the default threshold. This highlights the need for a user of the model to select an appropriate probability threshold.

Normal cases:

>Predicted=0.000 (expected 0) >Predicted=0.000 (expected 0) >Predicted=0.000 (expected 0) Fraud cases: >Predicted=1.000 (expected 1) >Predicted=0.400 (expected 1) >Predicted=1.000 (expected 1)

This section provides more resources on the topic if you are looking to go deeper.

- pandas.read_csv API.
- sklearn.metrics.precision_recall_curve API.
- sklearn.metrics.auc API.
- sklearn.metrics.make_scorer API.
- sklearn.dummy.DummyClassifier API.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced credit card fraud classification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict the probability of fraud for specific cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Classification with the Fraudulent Credit Card Transactions Dataset appeared first on Machine Learning Mastery.

]]>The post Step-By-Step Framework for Imbalanced Classification Projects appeared first on Machine Learning Mastery.

]]>It is a challenging problem in general, especially if little is known about the dataset, as there are tens, if not hundreds, of machine learning algorithms to choose from. The problem is made significantly more difficult if the distribution of examples across the classes is imbalanced. This requires the use of specialized methods to either change the dataset or change the learning algorithm to handle the skewed class distribution.

A common way to deal with the overwhelm on a new classification project is to use a favorite machine learning algorithm like Random Forest or SMOTE. Another common approach is to scour the research literature for descriptions of vaguely similar problems and attempt to re-implement the algorithms and configurations that are described.

These approaches can be effective, although they are hit-or-miss and time-consuming respectively. Instead, the shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can also be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.

In this tutorial, you will discover a systematic framework for working through an imbalanced classification dataset.

After completing this tutorial, you will know:

- The challenge of choosing an algorithm for imbalanced classification.
- A high-level framework for systematically working through an imbalanced classification project.
- Specific algorithm suggestions to try at each step of an imbalanced classification project.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into three parts; they are:

- What Algorithm To Use?
- Use a Systematic Framework
- Detailed Framework for Imbalanced Classification
- Select a Metric
- Spot Check Algorithms
- Spot Check Imbalanced Algorithms
- Hyperparameter Tuning

You’re handed or acquire an imbalanced classification dataset. *Now what?*

There are so many machine learning algorithms to choose from, let alone techniques specifically designed for imbalanced classification.

**Which algorithms do you use?** *How do you choose?*

This is the challenge faced at the beginning of each new imbalanced classification project. It is this challenge that makes applied machine learning both thrilling and terrifying.

There are perhaps two common ways to solve this problem:

- Use a favorite algorithm.
- Use what has worked previously.

One approach might be to select a favorite algorithm and start tuning the hyperparameters. This is a fast approach to a solution but is only effective if your favorite algorithm just happens to be the best solution for your specific dataset.

Another approach might be to review the literature and see what techniques have been used on datasets like yours. This can be effective if many people have studied and reported results on similar datasets.

In practice, this is rarely the case, and research publications are biased towards showing promise for a pet algorithm rather than presenting an honest comparison of methods. At best, literature can be used for ideas for techniques to try.

Instead, if little is known about the problem, then the shortest path to a “*good*” result is to systematically test a suite of different algorithms on your dataset.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Consider a balanced classification task.

You’re faced with the same challenge of selecting which algorithms to use to address your dataset.

There are many solutions to this problem, but perhaps the most robust is to systematically test a suite of algorithms and use empirical results to choose.

Biases like “*my favorite algorithm*” or “*what has worked in the past*” can feed ideas into the study, but can lead you astray if relied upon. Instead, you need to let the results from systematic empirical experiments to tell you what algorithm is good or best for your imbalanced classification dataset.

Once you have a dataset, the process involves the three steps of (1) selecting a metric by which to evaluate candidate models, (2) testing a suite of algorithms, and (3) tuning the best performing models. This may not the only approach; it is just the simplest reliable process to get you from “*I have a new dataset*” to “*I have good results*” very quickly.

This process can be summarized as follows:

- Select a Metric
- Spot Check Algorithms
- Hyperparameter Tuning

Spot-checking algorithms is a little more involved as many algorithms require specialized data preparation such as scaling, removal of outliers, and more. Also, evaluating candidate algorithms requires the careful design of a test harness, often involving the use of k-fold cross-validation to estimate the performance of a given model on unseen data.

We can use this simple process for imbalanced classification.

It is still important to spot check standard machine learning algorithms on imbalanced classification. Standard algorithms often do not perform well when the class distribution is imbalanced. Nevertheless, testing them first provides a baseline in performance by which more specialized models can be compared and must out-perform.

It is also still important to tune the hyperparameters of well-performing algorithms. This includes the hyperparameters of models specifically designed for imbalanced classification.

Therefore, we can use the same three-step procedure and insert an additional step to evaluate imbalanced classification algorithms.

We can summarize this process as follows:

- Select a Metric
- Spot Check Algorithms
- Spot Check Imbalanced Algorithms
- Hyperparameter Tuning

This provides a high-level systematic framework to work through an imbalanced classification problem.

Nevertheless, there are many imbalanced algorithms to choose from, let alone many different standard machine learning algorithms to choose from.

We require a similar low-level systematic framework for each step.

We can develop a similar low-level framework to systematically work through each step of an imbalanced classification project.

From selecting a metric to hyperparameter tuning.

Selecting a metric might be the most important step in the project.

The metric is the measuring stick by which all models are evaluated and compared. The choice of the wrong metric can mean choosing the wrong algorithm. That is, a model that solves a different problem from the problem you actually want solved.

The metric must capture those details about a model or its predictions that are most important to the project or project stakeholders.

This is hard, as there are many metrics to choose from and often project stakeholders are not sure what they want. There may also be multiple ways to frame the problem, and it may be beneficial to explore a few different framings and, in turn, different metrics to see what makes sense to stakeholders.

First, you must decide whether you want to predict probabilities or crisp class labels. Recall that for binary imbalanced classification tasks, the majority class is normal, called the “*negative class*“, and the minority class is the exception, called the “*positive class*“.

Probabilities capture the uncertainty of the prediction, whereas crisp class labels can be used immediately.

**Probabilities**: Predict the probability of class membership for each example.**Class Labels**: Predict a crisp class label for each example.

If probabilities are intended to be used directly, then a good metric might be the Brier Score and the Brier Skill score.

Alternately, you may want to predict probabilities and allow the user to map them to crisp class labels themselves via a user-selected threshold. In this case, a measure can be chosen that summarizes the performance of the model across the range of possible thresholds.

If the positive class is the most important, then the precision-recall curve and area under curve (PR AUC) can be used. This will optimize both precision and recall across all thresholds.

Alternately, if both classes are equally important, the ROC Curve and area under curve (ROC AUC) can be used. This will maximize the true positive rate and minimize the false positive rate.

If class labels are required and both classes are equally important, a good default metric is classification accuracy. This only makes sense if the majority class is less than about 80 percent off the data. A majority class that has a greater than 80 percent or 90 percent skew will swamp the accuracy metric and it will lose its meaning for comparing algorithms.

If the class distribution is severely skewed, then the G-mean metric can be used that will optimize the sensitivity and specificity metrics.

If the positive class is more important, then variations of the F-Measure can be used that optimize the precision and recall. If both false positive and false negatives are equally important, then F1 can be used. If false negatives are more costly, then the F2-Measure can be used, otherwise, if false positives are more costly, then the F0.5-Measure can be used.

These are just heuristics but provide a useful starting point if you feel lost choosing a metric for your imbalanced classification task.

We can summarize these heuristics into a framework as follows:

**Are you predicting probabilities?****Do you need class labels?****Is the positive class more important?**- Use Precision-Recall AUC

**Are both classes important?**- Use ROC AUC

**Do you need probabilities?**- Use Brier Score and Brier Skill Score

**Are you predicting class labels?****Is the positive class more important?****Are False Negatives and False Positives Equally Costly?**- Use F1-Measure

**Are False Negatives More Costly?**- Use F2-Measure

**Are False Positives More Costly?**- Use F0.5-Measure

**Are both classes important?****Do you have < 80%-90% Examples for the Majority Class?**- Use Accuracy

**Do you have > 80%-90% Examples for the Majority Class?**- Use G-Mean

We can also transform these decisions into a decision tree, as follows.

Once a metric has been chosen, you can start evaluating machine learning algorithms.

Spot checking machine learning algorithms means evaluating a suite of different types of algorithms with minimal hyperparameter tuning.

Specifically, it means giving each algorithm a good chance to learn about the problem, including performing any required data preparation expected by the algorithm and using best-practice configuration options or defaults.

The objective is to quickly test a range of standard machine learning algorithms and provide a baseline in performance to which techniques specialized for imbalanced classification must be compared and outperform in order to be considered skillful. The idea here is that there is little point in using fancy imbalanced algorithms if they cannot out-perform so-called unbalanced algorithms.

A robust test harness must be defined. This often involves k-fold cross-validation, often with k-10 as a sensible default. Stratified cross-validation is often required to ensure that each fold has the same class distribution as the original dataset. And the cross-validation procedure is often repeated multiple times, such as 3, 10, or 30 in order to effectively capture a sample of model performance on the dataset, summarized with a mean and standard deviation of the scores.

There are perhaps four levels of algorithms to spot check; they are:

- Naive Algorithms
- Linear Algorithms
- Nonlinear Algorithms
- Ensemble Algorithms

Firstly, a naive classification must be evaluated.

This provides a rock-bottom baseline in performance that any algorithm must overcome in order to have skill on the dataset.

Naive means that the algorithm has no logic other than an if-statement or predicting a constant value. The choice of naive algorithm is based on the choice of performance metric.

For example, a suitable naive algorithm for classification accuracy is to predict the majority class in all cases. A suitable naive algorithm for the Brier Score when evaluating probabilities is to predict the prior probability of each class in the training dataset.

A suggested mapping of performance metrics to naive algorithms is as follows:

**Accuracy**: Predict the majority class (class 0).**G-Mean**: Predict a uniformly random class.**F-Measure**: Predict the minority class (class 1).**ROC AUC**: Predict a stratified random class.**PR ROC**: Predict a stratified random class.**Brier Score**: Predict majority class prior.

If you are unsure of the “*best*” naive algorithm for your metric, perhaps test a few and discover which results in the better performance that you can use as your rock-bottom baseline.

Some options include:

- Predict the majority class in all cases.
- Predict the minority class in all cases.
- Predict a uniform randomly selected class.
- Predict a randomly selected class selected with the prior probabilities of each class.
- Predict the class prior probabilities.

Linear algorithms are those that are often drawn from the field of statistics and make strong assumptions about the functional form of the problem.

We can refer to them as linear because the output is a linear combination of the inputs, or weighted inputs, although this definition is stretched. You might also refer to these algorithms as probabilistic algorithms as they are often fit under a probabilistic framework.

They are often fast to train and often perform very well. Examples of linear algorithms you should consider trying include:

- Logistic Regression
- Linear Discriminant Analysis
- Naive Bayes

Nonlinear algorithms are drawn from the field of machine learning and make few assumptions about the functional form of the problem.

We can refer to them as nonlinear because the output is often a nonlinear mapping of inputs to outputs.

They often require more data than linear algorithms and are slower to train. Examples of nonlinear algorithms you should consider trying include:

- Decision Tree
- k-Nearest Neighbors
- Artificial Neural Networks
- Support Vector Machine

Ensemble algorithms are also drawn from the field of machine learning and combine the predictions from two or more models.

There are many ensemble algorithms to choose from, but when spot-checking algorithms, it is a good idea to focus on ensembles of decision tree algorithms, given that they are known to perform so well in practice on a wide range of problems.

Examples of ensembles of decision tree algorithms you should consider trying include:

- Bagged Decision Trees
- Random Forest
- Extra Trees
- Stochastic Gradient Boosting

We can summarize these suggestions into a framework for testing machine learning algorithms on a dataset.

- Naive Algorithms
- Majority Class
- Minority Class
- Class Priors

- Linear Algorithms
- Logistic Regression
- Linear Discriminant Analysis
- Naive Bayes

- Nonlinear Algorithms
- Decision Tree
- k-Nearest Neighbors
- Artificial Neural Networks
- Support Vector Machine

- Ensemble Algorithms
- Bagged Decision Trees
- Random Forest
- Extra Trees
- Stochastic Gradient Boosting

The order of the steps is probably not flexible. Think of the order of algorithms as increasing in complexity, and in turn, capability.

The order of algorithms within each step is flexible and the list of algorithms is not complete, and probably never could be given the vast number of algorithms available. Limiting the algorithms tested to a subset of the most common or most widely used is a good start. Using data-preparation recommendations and hyperparameter defaults is also a good start.

The figure below summarizes this step of the framework.

Spot-checking imbalanced algorithms is much like spot-checking machine learning algorithms.

The objective is to quickly test a large number of techniques in order to discover what shows promise so that you can focus more attention on it later during hyperparameter tuning.

The spot-checking performed in the previous section provides both naive and modestly skillful models by which all imbalanced techniques can be compared. This allows you to focus on these methods that truly show promise on the problem, rather than getting excited about results that only appear effective compared only to other imbalanced classification techniques (which is an easy trap to fall into).

There are perhaps four types of imbalanced classification techniques to spot check:

- Data Sampling Algorithms
- Cost-Sensitive Algorithms
- One-Class Algorithms
- Probability Tuning Algorithms

Data sampling algorithms change the composition of the training dataset to improve the performance of a standard machine learning algorithm on an imbalanced classification problem.

There are perhaps three main types of data sampling techniques; they are:

- Data Oversamplinug.
- Data Undersampling.
- Combined Oversampling and Undersampling.

Data oversampling involves duplicating examples of the minority class or synthesizing new examples from the minority class from existing examples. Perhaps the most popular methods is SMOTE and variations such as Borderline SMOTE. Perhaps the most important hyperparameter to tune is the amount of oversampling to perform.

Examples of data oversampling methods include:

- Random Oversampling
- SMOTE
- Borderline SMOTE
- SVM SMote
- k-Means SMOTE
- ADASYN

Undersampling involves deleting examples from the majority class, such as randomly or using an algorithm to carefully choose which examples to delete. Popular editing algorithms include the edited nearest neighbors and Tomek links.

Examples of data undersampling methods include:

- Random Undersampling
- Condensed Nearest Neighbor
- Tomek Links
- Edited Nearest Neighbors
- Neighborhood Cleaning Rule
- One-Sided Selection

Almost any oversampling method can be combined with almost any undersampling technique. Therefore, it may be beneficial to test a suite of different combinations of oversampling and undersampling techniques.

Examples of popular combinations of over and undersampling include:

- SMOTE and Random Undersampling
- SMOTE and Tomek Links
- SMOTE and Edited Nearest Neighbors

Data sampling algorithms may perform differently depending on the choice of machine learning algorithm.

As such, it may be beneficial to test a suite of standard machine learning algorithms, such as all or a subset of those algorithms used when spot checking in the previous section.

Additionally, most data sampling algorithms make use of the k-nearest neighbor algorithm internally. This algorithm is very sensitive to the data types and scale of input variables. As such, it may be important to at least normalize input variables that have differing scales prior to testing the methods, and perhaps using specialized methods if some input variables are categorical instead of numerical.

Cost-sensitive algorithms are modified versions of machine learning algorithms designed to take the differing costs of misclassification into account when fitting the model on the training dataset.

These algorithms can be effective when used on imbalanced classification, where the cost of misclassification is configured to be inversely proportional to the distribution of examples in the training dataset.

There are many cost-sensitive algorithms to choose from, although it might be practical to test a range of cost-sensitive versions of linear, nonlinear, and ensemble algorithms.

Some examples of machine learning algorithms that can be configured using cost-sensitive training include:

- Logistic Regression
- Decision Trees
- Support Vector Machines
- Artificial Neural Networks
- Bagged Decision Trees
- Random Forest
- Stochastic Gradient Boosting

Algorithms used for outlier detection and anomaly detection can be used for classification tasks.

Although unusual, when used in this way, they are often referred to as one-class classification algorithms.

In some cases, one-class classification algorithms can be very effective, such as when there is a severe class imbalance with very few examples of the positive class.

Examples of one-class classification algorithms to try include:

- One-Class Support Vector Machines
- Isolation Forests
- Minimum Covariance Determinant
- Local Outlier Factor

Predicted probabilities can be improved in two ways; they are:

- Calibrating Probabilities.
- Tuning the Classification Threshold.

Some algorithms are fit using a probabilistic framework and, in turn, have calibrated probabilities.

This means that when 100 examples are predicted to have the positive class label with a probability of 80 percent, then the algorithm will predict the correct class label 80 percent of the time.

Calibrated probabilities are required from a model to be considered skillful on a binary classification task when probabilities are either required as the output or used to evaluate the model (e.g. ROC AUC or PR AUC).

Some examples of machine learning algorithms that predict calibrated probabilities are as follows:

- Logistic Regression
- Linear Discriminant Analysis
- Naive Bayes
- Artificial Neural Networks

Most nonlinear algorithms do not predict calibrated probabilities, therefore algorithms can be used to post-process the predicted probabilities in order to calibrate them.

Therefore, when probabilities are required directly or are used to evaluate a model, and nonlinear algorithms are being used, it is important to calibrate the predicted probabilities.

Some examples of probability calibration algorithms to try include:

- Platt Scaling
- Isotonic Regression

Some algorithms are designed to naively predict probabilities that later must be mapped to crisp class labels.

This is the case if class labels are required as output for the problem, or the model is evaluated using class labels.

Examples of probabilistic machine learning algorithms that predict a probability by default include:

- Logistic Regression
- Linear Discriminant Analysis
- Naive Bayes
- Artificial Neural Networks

Probabilities are mapped to class labels using a threshold probability value. All probabilities below the threshold are mapped to class 0, and all probabilities equal-to or above the threshold are mapped to class 1.

The default threshold is 0.5, although different thresholds can be used that will dramatically impact the class labels and, in turn, the performance of a machine learning model that natively predicts probabilities.

As such, if probabilistic algorithms are used that natively predict a probability and class labels are required as output or used to evaluate models, it is a good idea to try tuning the classification threshold.

We can summarize these suggestions into a framework for testing imbalanced machine learning algorithms on a dataset.

- Data Sampling Algorithms
- Data Oversampling
- Random Oversampling
- SMOTE
- Borderline SMOTE
- SVM SMote
- k-Means SMOTE
- ADASYN

- Data Undersampling
- Random Undersampling
- Condensed Nearest Neighbor
- Tomek Links
- Edited Nearest Neighbors
- Neighborhood Cleaning Rule
- One Sided Selection

- Combined Oversampling and Undersampling
- SMOTE and Random Undersampling
- SMOTE and Tomek Links
- SMOTE and Edited Nearest Neighbors

- Data Oversampling
- Cost-Sensitive Algorithms
- Logistic Regression
- Decision Trees
- Support Vector Machines
- Artificial Neural Networks
- Bagged Decision Trees
- Random Forest
- Stochastic Gradient Boosting

- One-Class Algorithms
- One-Class Support Vector Machines
- Isolation Forests
- Minimum Covariance Determinant
- Local Outlier Factor

- Probability Tuning Algorithms
- Calibrating Probabilities
- Platt Scaling
- Isotonic Regression

- Tuning the Classification Threshold

- Calibrating Probabilities

The order of the steps is flexible, and the order of algorithms within each step is also flexible, and the list of algorithms is not complete.

The structure is designed to get you thinking systematically about what algorithm to evaluate.

The figure below summarizes the framework.

After spot-checking machine learning algorithms and imbalanced algorithms, you will have some idea of what works and what does not on your specific dataset.

The simplest approach to hyperparameter tuning is to select the top five or 10 algorithms or algorithm combinations that performed well and tune the hyperparameters for each.

There are three popular hyperparameter tuning algorithms that you may choose from:

- Random Search
- Grid Search
- Bayesian Optimization

A good default is grid search if you know what hyperparameter values to try, otherwise, random search should be used. Bayesian optimization should be used if possible but can be more challenging to set up and run.

Tuning the best performing methods is a good start, but not the only approach.

There may be some standard machine learning algorithms that perform well, but do not perform as well when used with data sampling or probability calibration. These algorithms could be tuned in concert with their imbalanced-classification augmentations to see if better performance can be achieved.

Additionally, there may be imbalanced-classification algorithms, such as a data sampling method that results in a dramatic lift in performance for one or more algorithms. These algorithms themselves may provide an interesting basis for further tuning to see if additional lifts in performance can be achieved.

This section provides more resources on the topic if you are looking to go deeper.

- Applied Machine Learning Process
- How to Use a Machine Learning Checklist to Get Accurate Predictions, Reliably

- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

In this tutorial, you discovered a systematic framework for working through an imbalanced classification dataset.

Specifically, you learned:

- The challenge of choosing an algorithm for imbalanced classification.
- A high-level framework for systematically working through an imbalanced classification project.
- Specific algorithm suggestions to try at each step of an imbalanced classification project.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Step-By-Step Framework for Imbalanced Classification Projects appeared first on Machine Learning Mastery.

]]>The post Imbalanced Classification with the Adult Income Dataset appeared first on Machine Learning Mastery.

]]>A popular example is the adult income dataset that involves predicting personal income levels as above or below $50,000 per year based on personal details such as relationship and education level. There are many more cases of incomes less than $50K than above $50K, although the skew is not severe.

This means that techniques for imbalanced classification can be used whilst model performance can still be reported using classification accuracy, as is used with balanced classification problems.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced adult income classification dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict class labels for specific cases.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Adult Income Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Make Prediction on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*Adult Income*” or simply the “*adult*” dataset.

The dataset is credited to Ronny Kohavi and Barry Becker and was drawn from the 1994 United States Census Bureau data and involves using personal details such as education level to predict whether an individual will earn more or less than $50,000 per year.

The Adult dataset is from the Census Bureau and the task is to predict whether a given adult makes more than $50,000 a year based attributes such as education, hours of work per week, etc..

— Scaling Up The Accuracy Of Naive-bayes Classifiers: A Decision-tree Hybrid, 1996.

The dataset provides 14 input variables that are a mixture of categorical, ordinal, and numerical data types. The complete list of variables is as follows:

- Age.
- Workclass.
- Final Weight.
- Education.
- Education Number of Years.
- Marital-status.
- Occupation.
- Relationship.
- Race.
- Sex.
- Capital-gain.
- Capital-loss.
- Hours-per-week.
- Native-country.

The dataset contains missing values that are marked with a question mark character (?).

There are a total of 48,842 rows of data, and 3,620 with missing values, leaving 45,222 complete rows.

There are two class values ‘*>50K*‘ and ‘*<=50K*‘, meaning it is a binary classification task. The classes are imbalanced, with a skew toward the ‘*<=50K*‘ class label.

**‘>50K’**: majority class, approximately 25%.**‘<=50K’**: minority class, approximately 75%.

Given that the class imbalance is not severe and that both class labels are equally important, it is common to use classification accuracy or classification error to report model performance on this dataset.

Using predefined train and test sets, reported good classification error is approximately 14 percent or a classification accuracy of about 86 percent. This might provide a target to aim for when working on this dataset.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Adult dataset is a widely used standard machine learning dataset, used to explore and demonstrate many machine learning algorithms, both generally and those designed specifically for imbalanced classification.

First, download the dataset and save it in your current working directory with the name “*adult-all.csv*”

Review the contents of the file.

The first few lines of the file should look as follows:

39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K 50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K 38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K 53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K 28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K ...

We can see that the input variables are a mixture of numerical and categorical or ordinal data types, where the non-numerical columns are represented using strings. At a minimum, the categorical variables will need to be ordinal or one-hot encoded.

We can also see that the target variable is represented using strings. This column will need to be label encoded with 0 for the majority class and 1 for the minority class, as is the custom for binary imbalanced classification tasks.

Missing values are marked with a ‘*?*‘ character. These values will need to be imputed, or given the small number of examples, these rows could be deleted from the dataset.

The dataset can be loaded as a *DataFrame* using the read_csv() Pandas function, specifying the filename, that there is no header line, and that strings like ‘ *?*‘ should be parsed as *NaN* (missing) values.

... # define the dataset location filename = 'adult-all.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None, na_values='?')

Once loaded, we can remove the rows that contain one or more missing values.

... # drop rows with missing dataframe = dataframe.dropna()

We can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'adult-all.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None, na_values='?') # drop rows with missing dataframe = dataframe.dropna() # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, that is 45,222 rows without missing values and 14 input variables and one target variable.

The class distribution is then summarized, confirming a modest class imbalance with approximately 75 percent for the majority class (<=50K) and approximately 25 percent for the minority class (>50K).

(45222, 15) Class= <=50K, Count=34014, Percentage=75.216% Class= >50K, Count=11208, Percentage=24.784%

We can also take a look at the distribution of the numerical input variables by creating a histogram for each.

First, we can select the columns with numeric variables by calling the select_dtypes() function on the DataFrame. We can then select just those columns from the DataFrame.

... # select columns with numerical data types num_ix = df.select_dtypes(include=['int64', 'float64']).columns # select a subset of the dataframe with the chosen columns subset = df[num_ix]

We can then create histograms of each numeric input variable. The complete example is listed below.

# create histograms of numeric input variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'adult-all.csv' # load the csv file as a data frame df = read_csv(filename, header=None, na_values='?') # drop rows with missing df = df.dropna() # select columns with numerical data types num_ix = df.select_dtypes(include=['int64', 'float64']).columns # select a subset of the dataframe with the chosen columns subset = df[num_ix] # create a histogram plot of each numeric variable subset.hist() pyplot.show()

Running the example creates the figure with one histogram subplot for each of the six input variables in the dataset. The title of each subplot indicates the column number in the DataFrame (e.g. zero-offset).

We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions. We can also see that they all appear to have a very different scale.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 45,222/10, or about 4,522 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 75 percent to 25 percent for the majority and minority classes respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will predict a class label for each example and measure model performance using classification accuracy.

The *evaluate_model()* function below will take the loaded dataset and a defined model and will evaluate it using repeated stratified k-fold cross-validation, then return a list of accuracy scores that can later be summarized.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores

We can define a function to load the dataset and label encode the target column.

We will also return a list of categorical and numeric columns in case we decide to transform them later when fitting models.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None, na_values='?') # drop rows with missing dataframe = dataframe.dropna() # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix

Finally, we can evaluate a baseline model on the dataset using this test harness.

When using classification accuracy, a naive model will predict the majority class for all cases. This provides a baseline in model performance on this problem by which all other models can be compared.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “*strategy*” argument to ‘*most_frequent*‘.

... # define the reference model model = DummyClassifier(strategy='most_frequent')

Once the model is evaluated, we can report the mean and standard deviation of the accuracy scores directly.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of loading the Adult dataset, evaluating a baseline model, and reporting the performance is listed below.

# test harness and baseline model evaluation for the adult dataset from collections import Counter from numpy import mean from numpy import std from numpy import hstack from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None, na_values='?') # drop rows with missing dataframe = dataframe.dropna() # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'adult-all.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='most_frequent') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded. Importantly, we can see that the class labels have the correct mapping to integers, with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.

Next, the average classification accuracy score is reported.

In this case, we can see that the baseline algorithm achieves an accuracy of about 75.2%. This score provides a lower limit on model skill; any model that achieves an average accuracy above about 75.2% has skill, whereas models that achieve a score below this value do not have skill on this dataset.

(45222, 14) (45222,) Counter({0: 34014, 1: 11208}) Mean Accuracy: 0.752 (0.000)

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better classification accuracy performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the adult dataset:

- Decision Tree (CART)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Gradient Boosting Machine (GBM)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 100.

*get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # CART models.append(DecisionTreeClassifier()) names.append('CART') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=100)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=100)) names.append('RF') # GBM models.append(GradientBoostingClassifier(n_estimators=100)) names.append('GBM') return models, names

We will one-hot encode the categorical input variables using a OneHotEncoder, and we will normalize the numerical input variables using the MinMaxScaler. These operations must be performed within each train/test split during the cross-validation process, where the encoding and scaling operations are fit on the training set and applied to the train and test sets.

An easy way to implement this is to use a Pipeline where the first step is a ColumnTransformer that applies a *OneHotEncoder* to just the categorical variables, and a *MinMaxScaler* to just the numerical input variables. To achieve this, we need a list of the column indices for categorical and numerical input variables.

The *load_dataset()* function we defined in the previous section loads and returns both the dataset and lists of columns that have categorical and numerical data types. This can be used to prepare a *Pipeline* to wrap each model prior to evaluating it. First, the *ColumnTransformer* is defined, which specifies what transform to apply to each type of column, then this is used as the first step in a *Pipeline* that ends with the specific model that will be fit and evaluated.

... # define steps steps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), ('n',MinMaxScaler(),num_ix)] # one hot encode categorical, normalize numerical ct = ColumnTransformer(steps) # wrap the model i a pipeline pipeline = Pipeline(steps=[('t',ct),('m',models[i])]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

We can summarize the mean accuracy for each algorithm, this will help to directly compare algorithms.

... # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we will create a separate box and whisker plot for each algorithm’s sample of results. These plots will use the same y-axis scale so we can compare the distribution of results directly.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluation a suite of machine learning algorithms on the adult imbalanced dataset is listed below.

# spot check machine learning algorithms on the adult imbalanced dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None, na_values='?') # drop rows with missing dataframe = dataframe.dropna() # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # CART models.append(DecisionTreeClassifier()) names.append('CART') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=100)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=100)) names.append('RF') # GBM models.append(GradientBoostingClassifier(n_estimators=100)) names.append('GBM') return models, names # define the location of the dataset full_path = 'adult-all.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # define steps steps = [('c',OneHotEncoder(handle_unknown='ignore'),cat_ix), ('n',MinMaxScaler(),num_ix)] # one hot encode categorical, normalize numerical ct = ColumnTransformer(steps) # wrap the model i a pipeline pipeline = Pipeline(steps=[('t',ct),('m',models[i])]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize performance print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

**What scores did you get?**

Post your results in the comments below.

In this case, we can see that all of the chosen algorithms are skillful, achieving a classification accuracy above 75.2%. We can see that the ensemble decision tree algorithms perform the best with perhaps stochastic gradient boosting performing the best with a classification accuracy of about 86.3%.

This is slightly better than the result reported in the original paper, albeit with a different model evaluation procedure.

>CART 0.812 (0.005) >SVM 0.837 (0.005) >BAG 0.852 (0.004) >RF 0.849 (0.004) >GBM 0.863 (0.004)

We can see that the distribution of scores for each algorithm appears to be above the baseline of about 75%, perhaps with a few outliers (circles on the plot). The distribution for each algorithm appears compact, with the median and mean aligning, suggesting the models are quite stable on this dataset and scores do not form a skewed distribution.

This highlights that it is not just the central tendency of the model performance that it is important, but also the spread and even worst-case result that should be considered. Especially with a limited number of examples of the minority class.

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the GradientBoostingClassifier model as our final model that achieved a classification accuracy of about 86.3%. Fitting the final model involves defining the ColumnTransformer to encode the categorical variables and scale the numerical variables, then construct a Pipeline to perform these transforms on the training set prior to fitting the model.

The *Pipeline* can then be used to make predictions on new data directly, and will automatically encode and scale new data using the same operations as were performed on the training dataset.

First, we can define the model as a pipeline.

... # define model to evaluate model = GradientBoostingClassifier(n_estimators=100) # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # define the pipeline pipeline = Pipeline(steps=[('t',ct), ('m',model)])

Once defined, we can fit it on the entire training dataset.

... # fit the model pipeline.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the class label of 0 for “<=50K”, or 1 for “>50K”.

Importantly, we must use the *ColumnTransformer* within the *Pipeline* to correctly prepare new data using the same transforms.

For example:

... # define a row of data row = [...] # make prediction yhat = pipeline.predict([row])

The complete example is listed below.

# fit a model and make predictions for the on the adult dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer from sklearn.ensemble import GradientBoostingClassifier from imblearn.pipeline import Pipeline # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None, na_values='?') # drop rows with missing dataframe = dataframe.dropna() # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # define the location of the dataset full_path = 'adult-all.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define model to evaluate model = GradientBoostingClassifier(n_estimators=100) # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # define the pipeline pipeline = Pipeline(steps=[('t',ct), ('m',model)]) # fit the model pipeline.fit(X, y) # evaluate on some <=50K cases (known class 0) print('<=50K cases:') data = [[24, 'Private', 161198, 'Bachelors', 13, 'Never-married', 'Prof-specialty', 'Not-in-family', 'White', 'Male', 0, 0, 25, 'United-States'], [23, 'Private', 214542, 'Some-college', 10, 'Never-married', 'Farming-fishing', 'Own-child', 'White', 'Male', 0, 0, 40, 'United-States'], [38, 'Private', 309122, '10th', 6, 'Divorced', 'Machine-op-inspct', 'Not-in-family', 'White', 'Female', 0, 0, 40, 'United-States']] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 0)' % (label)) # evaluate on some >50K cases (known class 1) print('>50K cases:') data = [[55, 'Local-gov', 107308, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Husband', 'White', 'Male', 0, 0, 40, 'United-States'], [53, 'Self-emp-not-inc', 145419, '1st-4th', 2, 'Married-civ-spouse', 'Exec-managerial', 'Husband', 'White', 'Male', 7688, 0, 67, 'Italy'], [44, 'Local-gov', 193425, 'Masters', 14, 'Married-civ-spouse', 'Prof-specialty', 'Wife', 'White', 'Female', 4386, 0, 40, 'United-States']] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 1)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of <=50K cases is chosen from the dataset file. We can see that all cases are correctly predicted. Then some >50K cases are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted.

<=50K cases: >Predicted=0 (expected 0) >Predicted=0 (expected 0) >Predicted=0 (expected 0) >50K cases: >Predicted=1 (expected 1) >Predicted=1 (expected 1) >Predicted=1 (expected 1)

This section provides more resources on the topic if you are looking to go deeper.

- pandas.DataFrame.select_dtypes API.
- sklearn.model_selection.RepeatedStratifiedKFold API.
- sklearn.dummy.DummyClassifier API.

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced adult income classification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to systematically evaluate a suite of machine learning models with a robust test harness.
- How to fit a final model and use it to predict class labels for specific cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Classification with the Adult Income Dataset appeared first on Machine Learning Mastery.

]]>The post Predictive Model for the Phoneme Imbalanced Classification Dataset appeared first on Machine Learning Mastery.

]]>Nevertheless, accuracy is equally important in both classes.

An example is the classification of vowel sounds from European languages as either nasal or oral on speech recognition where there are many more examples of nasal than oral vowels. Classification accuracy is important for both classes, although accuracy as a metric cannot be used directly. Additionally, data sampling techniques may be required to transform the training dataset to make it more balanced when fitting machine learning algorithms.

In this tutorial, you will discover how to develop and evaluate models for imbalanced binary classification of nasal and oral phonemes.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data oversampling techniques.
- How to fit a final model and use it to predict class labels for specific cases.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Updated Jan/2021**: Updated links for API documentation.

This tutorial is divided into five parts; they are:

- Phoneme Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Evaluate Data Oversampling Algorithms

- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*Phoneme*” dataset.

This dataset is credited to the ESPRIT (European Strategic Program on Research in Information Technology) project titled “*ROARS*” (Robust Analytical Speech Recognition System) and described in progress reports and technical reports from that project.

The goal of the ROARS project is to increase the robustness of an existing analytical speech recognition system (i,e., one using knowledge about syllables, phonemes and phonetic features), and to use it as part of a speech understanding system with connected words and dialogue capability. This system will be evaluated for a specific application in two European languages

— ESPRIT: The European Strategic Programme for Research and development in Information Technology.

The goal of the dataset was to distinguish between nasal and oral vowels.

Vowel sounds were spoken and recorded to digital files. Then audio features were automatically extracted from each sound.

Five different attributes were chosen to characterize each vowel: they are the amplitudes of the five first harmonics AHi, normalised by the total energy Ene (integrated on all the frequencies): AHi/Ene. Each harmonic is signed: positive when it corresponds to a local maximum of the spectrum and negative otherwise.

— Phoneme Dataset Description.

There are two classes for the two types of sounds; they are:

**Class 0**: Nasal Vowels (majority class).**Class 1**: Oral Vowels (minority class).

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Phoneme dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.

One example is the popular SMOTE data oversampling technique.

First, download the dataset and save it in your current working directory with the name “*phoneme.csv*“.

Review the contents of the file.

The first few lines of the file should look as follows:

1.24,0.875,-0.205,-0.078,0.067,0 0.268,1.352,1.035,-0.332,0.217,0 1.567,0.867,1.3,1.041,0.559,0 0.279,0.99,2.555,-0.738,0.0,0 0.307,1.272,2.656,-0.946,-0.467,0 ...

We can see that the given input variables are numeric and class labels are 0 and 1 for nasal and oral respectively.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

... # define the dataset location filename = 'phoneme.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'phoneme.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, that is 5,404 rows and five input variables and one target variable.

The class distribution is then summarized, confirming a modest class imbalance with approximately 70 percent for the majority class (*nasal*) and approximately 30 percent for the minority class (*oral*).

(5404, 6) Class=0.0, Count=3818, Percentage=70.651% Class=1.0, Count=1586, Percentage=29.349%

We can also take a look at the distribution of the five numerical input variables by creating a histogram for each.

The complete example is listed below.

# create histograms of numeric input variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'phoneme.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # histograms of all variables df.hist() pyplot.show()

Running the example creates the figure with one histogram subplot for each of the five numerical input variables in the dataset, as well as the numerical class label.

We can see that the variables have differing scales, although most appear to have a Gaussian or Gaussian-like distribution.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps standardize the use of some power transforms.

We can also create a scatter plot for each pair of input variables, called a scatter plot matrix.

This can be helpful to see if any variables relate to each other or change in the same direction, e.g. are correlated.

We can also color the dots of each scatter plot according to the class label. In this case, the majority class (*nasal*) will be mapped to blue dots and the minority class (*oral*) will be mapped to red dots.

The complete example is listed below.

# create pairwise scatter plots of numeric input variables from pandas import read_csv from pandas import DataFrame from pandas.plotting import scatter_matrix from matplotlib import pyplot # define the dataset location filename = 'phoneme.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # define a mapping of class values to colors color_dict = {0:'blue', 1:'red'} # map each row to a color based on the class value colors = [color_dict[x] for x in df.values[:, -1]] # drop the target variable inputs = DataFrame(df.values[:, :-1]) # pairwise scatter plots of all numerical variables scatter_matrix(inputs, diagonal='kde', color=colors) pyplot.show()

Running the example creates a figure showing the scatter plot matrix, with five plots by five plots, comparing each of the five numerical input variables with each other. The diagonal of the matrix shows the density distribution of each variable.

Each pairing appears twice, both above and below the top-left to bottom-right diagonal, providing two ways to review the same variable interactions.

We can see that the distributions for many variables do differ for the two class labels, suggesting that some reasonable discrimination between the classes will be feasible.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 5404/10 or about 540 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 70 percent to 30 percent nasal to oral vowels. Repetition indicates that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

Class labels will be predicted and both class labels are equally important. Therefore, we will select a metric that quantifies the performance of a model on both classes separately.

You may remember that the sensitivity is a measure of the accuracy for the positive class and specificity is a measure of the accuracy of the negative class.

- Sensitivity = TruePositives / (TruePositives + FalseNegatives)
- Specificity = TrueNegatives / (TrueNegatives + FalsePositives)

The G-mean seeks a balance of these scores, the geometric mean, where poor performance for one or the other results in a low G-mean score.

- G-Mean = sqrt(Sensitivity * Specificity)

We can calculate the G-mean for a set of predictions made by a model using the geometric_mean_score() function provided by the imbalanced-learn library.

We can define a function to load the dataset and split the columns into input and output variables. The *load_dataset()* function below implements this.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y

We can then define a function that will evaluate a given model on the dataset and return a list of G-Mean scores for each fold and repeat. The *evaluate_model()* function below implements this, taking the dataset and model as arguments and returning the list of scores.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(geometric_mean_score) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the majority class label (0) or the minority class label (1) for all cases will result in a G-mean of zero. As such, a good default strategy would be to randomly predict one class label or another with a 50 percent probability and aim for a G-mean of about 0.5.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “*strategy*” argument to ‘*uniform*‘.

... # define the reference model model = DummyClassifier(strategy='uniform')

Once the model is evaluated, we can report the mean and standard deviation of the G-mean scores directly.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean G-Mean: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.

# test harness and baseline model evaluation from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from imblearn.metrics import geometric_mean_score from sklearn.metrics import make_scorer from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(geometric_mean_score) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'phoneme.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='uniform') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean G-Mean: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded and that we have five audio-derived input variables.

Next, the average of the G-Mean scores is reported.

In this case, we can see that the baseline algorithm achieves a G-Mean of about 0.509, close to the theoretical maximum of 0.5. This score provides a lower limit on model skill; any model that achieves an average G-Mean above about 0.509 (or really above 0.5) has skill, whereas models that achieve a score below this value do not have skill on this dataset.

(5404, 5) (5404,) Counter({0.0: 3818, 1.0: 1586}) Mean G-Mean: 0.509 (0.020)

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**You can do better?** If you can achieve better G-mean performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the phoneme dataset:

- Logistic Regression (LR)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 1,000.

*get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs')) names.append('LR') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names

We can then enumerate the list of models in turn and evaluate each, reporting the mean G-Mean and storing the scores for later plotting.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the phoneme dataset is listed below.

# spot check machine learning algorithms on the phoneme dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from imblearn.metrics import geometric_mean_score from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(geometric_mean_score) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs')) names.append('LR') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # ET models.append(ExtraTreesClassifier(n_estimators=1000)) names.append('ET') return models, names # define the location of the dataset full_path = 'phoneme.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation G-Mean.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving a G-Mean above the default of 0.5 The results suggest that the ensemble of decision tree algorithms perform better on this dataset with perhaps Extra Trees (ET) performing the best with a G-Mean of about 0.896.

>LR 0.637 (0.023) >SVM 0.801 (0.022) >BAG 0.888 (0.017) >RF 0.892 (0.018) >ET 0.896 (0.017)

We can see that all three ensembles of trees algorithms (BAG, RF, and ET) have a tight distribution and a mean and median that closely align, perhaps suggesting a non-skewed and Gaussian distribution of scores, e.g. stable.

Now that we have a good first set of results, let’s see if we can improve them with data oversampling methods.

Data sampling provides a way to better prepare the imbalanced training dataset prior to fitting a model.

The simplest oversampling technique is to duplicate examples in the minority class, called random oversampling. Perhaps the most popular oversampling method is the SMOTE oversampling technique for creating new synthetic examples for the minority class.

We will test five different oversampling methods; specifically:

- Random Oversampling (ROS)
- SMOTE (SMOTE)
- BorderLine SMOTE (BLSMOTE)
- SVM SMOTE (SVMSMOTE)
- ADASYN (ADASYN)

Each technique will be tested with the best performing algorithm from the previous section, specifically Extra Trees.

We will use the default hyperparameters for each oversampling algorithm, which will oversample the minority class to have the same number of examples as the majority class in the training dataset.

The expectation is that each oversampling technique will result in a lift in performance compared to the algorithm without oversampling with the smallest lift provided by Random Oversampling and perhaps the best lift provided by SMOTE or one of its variations.

We can update the *get_models()* function to return lists of oversampling algorithms to evaluate; for example:

# define oversampling models to test def get_models(): models, names = list(), list() # RandomOverSampler models.append(RandomOverSampler()) names.append('ROS') # SMOTE models.append(SMOTE()) names.append('SMOTE') # BorderlineSMOTE models.append(BorderlineSMOTE()) names.append('BLSMOTE') # SVMSMOTE models.append(SVMSMOTE()) names.append('SVMSMOTE') # ADASYN models.append(ADASYN()) names.append('ADASYN') return models, names

We can then enumerate each and create a Pipeline from the imbalanced-learn library that is aware of how to oversample a training dataset. This will ensure that the training dataset within the cross-validation model evaluation is sampled correctly, without data leakage that could result in an optimistic evaluation of model performance.

First, we will normalize the input variables because most oversampling techniques will make use of a nearest neighbor algorithm and it is important that all variables have the same scale when using this technique. This will be followed by a given oversampling algorithm, then ending with the Extra Trees algorithm that will be fit on the oversampled training dataset.

... # define the model model = ExtraTreesClassifier(n_estimators=1000) # define the pipeline steps steps = [('s', MinMaxScaler()), ('o', models[i]), ('m', model)] # define the pipeline pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

Tying this together, the complete example of evaluating oversampling algorithms with Extra Trees on the phoneme dataset is listed below.

# data oversampling algorithms on the phoneme imbalanced dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from imblearn.metrics import geometric_mean_score from sklearn.metrics import make_scorer from sklearn.ensemble import ExtraTreesClassifier from imblearn.over_sampling import RandomOverSampler from imblearn.over_sampling import SMOTE from imblearn.over_sampling import BorderlineSMOTE from imblearn.over_sampling import SVMSMOTE from imblearn.over_sampling import ADASYN from imblearn.pipeline import Pipeline # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(geometric_mean_score) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define oversampling models to test def get_models(): models, names = list(), list() # RandomOverSampler models.append(RandomOverSampler()) names.append('ROS') # SMOTE models.append(SMOTE()) names.append('SMOTE') # BorderlineSMOTE models.append(BorderlineSMOTE()) names.append('BLSMOTE') # SVMSMOTE models.append(SVMSMOTE()) names.append('SVMSMOTE') # ADASYN models.append(ADASYN()) names.append('ADASYN') return models, names # define the location of the dataset full_path = 'phoneme.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # define the model model = ExtraTreesClassifier(n_estimators=1000) # define the pipeline steps steps = [('s', MinMaxScaler()), ('o', models[i]), ('m', model)] # define the pipeline pipeline = Pipeline(steps=steps) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each oversampling method with the Extra Trees model on the dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, as we expected, each oversampling technique resulted in a lift in performance for the ET algorithm without any oversampling (0.896), except the random oversampling technique.

The results suggest that the modified versions of SMOTE and ADASYN performed better than default SMOTE, and in this case, ADASYN achieved the best G-Mean score of 0.910.

>ROS 0.894 (0.018) >SMOTE 0.906 (0.015) >BLSMOTE 0.909 (0.013) >SVMSMOTE 0.909 (0.014) >ADASYN 0.910 (0.013)

The distribution of results can be compared with box and whisker plots.

We can see the distributions all roughly have the same tight distribution and that the difference in means of the results can be used to select a model.

Next, let’s see how we might use a final model to make predictions on new data.

In this section, we will fit a final model and use it to make predictions on single rows of data

We will use the ADASYN oversampled version of the Extra Trees model as the final model and a normalization scaling on the data prior to fitting the model and making a prediction. Using the pipeline will ensure that the transform is always performed correctly.

First, we can define the model as a pipeline.

... # define the model model = ExtraTreesClassifier(n_estimators=1000) # define the pipeline steps steps = [('s', MinMaxScaler()), ('o', ADASYN()), ('m', model)] # define the pipeline pipeline = Pipeline(steps=steps)

Once defined, we can fit it on the entire training dataset.

... # fit the model pipeline.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the class label of 0 for “*nasal*, or 1 for “*oral*“.

For example:

... # define a row of data row = [...] # make prediction yhat = pipeline.predict([row])

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is nasal or oral.

The complete example is listed below.

# fit a model and make predictions for the on the phoneme dataset from pandas import read_csv from sklearn.preprocessing import MinMaxScaler from imblearn.over_sampling import ADASYN from sklearn.ensemble import ExtraTreesClassifier from imblearn.pipeline import Pipeline # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] return X, y # define the location of the dataset full_path = 'phoneme.csv' # load the dataset X, y = load_dataset(full_path) # define the model model = ExtraTreesClassifier(n_estimators=1000) # define the pipeline steps steps = [('s', MinMaxScaler()), ('o', ADASYN()), ('m', model)] # define the pipeline pipeline = Pipeline(steps=steps) # fit the model pipeline.fit(X, y) # evaluate on some nasal cases (known class 0) print('Nasal:') data = [[1.24,0.875,-0.205,-0.078,0.067], [0.268,1.352,1.035,-0.332,0.217], [1.567,0.867,1.3,1.041,0.559]] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 0)' % (label)) # evaluate on some oral cases (known class 1) print('Oral:') data = [[0.125,0.548,0.795,0.836,0.0], [0.318,0.811,0.818,0.821,0.86], [0.151,0.642,1.454,1.281,-0.716]] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 1)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label of nasal cases chosen from the dataset file. We can see that all cases are correctly predicted.

Then some oral cases are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.

Nasal: >Predicted=0 (expected 0) >Predicted=0 (expected 0) >Predicted=0 (expected 0) Oral: >Predicted=1 (expected 1) >Predicted=1 (expected 1) >Predicted=1 (expected 1)

This section provides more resources on the topic if you are looking to go deeper.

- sklearn.model_selection.RepeatedStratifiedKFold API.
- sklearn.dummy.DummyClassifier API.
- imblearn.metrics.geometric_mean_score API.

- Phoneme Dataset
- Phoneme Dataset Description
- Phoneme Dataset on KEEL
- Phoneme Dataset on the ELENA Project

In this tutorial, you discovered how to develop and evaluate models for imbalanced binary classification of nasal and oral phonemes.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data oversampling techniques.
- How to fit a final model and use it to predict class labels for specific cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Predictive Model for the Phoneme Imbalanced Classification Dataset appeared first on Machine Learning Mastery.

]]>The post Imbalanced Classification Model to Detect Mammography Microcalcifications appeared first on Machine Learning Mastery.

]]>A standard imbalanced classification dataset is the mammography dataset that involves detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram. This dataset was constructed by scanning the images, segmenting them into candidate objects, and using computer vision techniques to describe each candidate object.

It is a popular dataset for imbalanced classification because of the severe class imbalance, specifically where 98 percent of candidate microcalcifications are not cancer and only 2 percent were labeled as cancer by an experienced radiographer.

In this tutorial, you will discover how to develop and evaluate models for the imbalanced mammography cancer classification dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data cost-sensitive techniques.
- How to fit a final model and use it to predict class labels for specific cases.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Mammography Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Evaluate Cost-Sensitive Algorithms

- Make Predictions on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “*mammography*” dataset or sometimes “*Woods Mammography*.”

The dataset is credited to Kevin Woods, et al. and the 1993 paper titled “Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography.”

The focus of the problem is on detecting breast cancer from radiological scans, specifically the presence of clusters of microcalcifications that appear bright on a mammogram.

The dataset involved first started with 24 mammograms with a known cancer diagnosis that were scanned. The images were then pre-processed using image segmentation computer vision algorithms to extract candidate objects from the mammogram images. Once segmented, the objects were then manually labeled by an experienced radiologist.

A total of 29 features were extracted from the segmented objects thought to be most relevant to pattern recognition, which was reduced to 18, then finally to seven, as follows (taken directly from the paper):

- Area of object (in pixels).
- Average gray level of the object.
- Gradient strength of the object’s perimeter pixels.
- Root mean square noise fluctuation in the object.
- Contrast, average gray level of the object minus the average of a two-pixel wide border surrounding the object.
- A low order moment based on shape descriptor.

There are two classes and the goal is to distinguish between microcalcifications and non-microcalcifications using the features for a given segmented object.

**Non-microcalcifications**: negative case, or majority class.**Microcalcifications**: positive case, or minority class.

A number of models were evaluated and compared in the original paper, such as neural networks, decision trees, and k-nearest neighbors. Models were evaluated using ROC Curves and compared using the area under ROC Curve, or ROC AUC for short.

ROC Curves and area under ROC Curves were chosen with the intent to minimize the false-positive rate (complement of the specificity) and maximize the true-positive rate (sensitivity), the two axes of the ROC Curve. The use of the ROC Curves also suggests the desire for a probabilistic model from which an operator can select a probability threshold as the cut-off between the acceptable false positive and true positive rates.

Their results suggested a “*linear classifier*” (seemingly a Gaussian Naive Bayes classifier) performed the best with a ROC AUC of 0.936 averaged over 100 runs.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Mammography dataset is a widely used standard machine learning dataset, used to explore and demonstrate many techniques designed specifically for imbalanced classification.

One example is the popular SMOTE data oversampling technique.

A version of this dataset was made available that has some differences to the dataset described in the original paper.

First, download the dataset and save it in your current working directory with the name “*mammography.csv*”

Review the contents of the file.

The first few lines of the file should look as follows:

0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223,'-1' 0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324,'-1' -0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324,'-1' 0.54608818,0.13141457,-0.45638679,-0.85955255,-0.37786573,-0.94572324,'-1' -0.10298725,-0.3949941,-0.14081588,0.97970269,-0.37786573,1.0135658,'-1' ...

We can see that the dataset has six rather than the seven input variables. It is possible that the first input variable listed in the paper (area in pixels) was removed from this version of the dataset.

The input variables are numerical (real-valued) and the target variable is the string with ‘-1’ for the majority class and ‘1’ for the minority class. These values will need to be encoded as 0 and 1 respectively to meet the expectations of classification algorithms on binary imbalanced classification problems.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

... # define the dataset location filename = 'mammography.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'mammography.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%s, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, that is 11,183 rows and six input variables and one target variable.

The class distribution is then summarized, confirming the severe class imbalanced with approximately 98 percent for the majority class (no cancer) and approximately 2 percent for the minority class (cancer).

(11183, 7) Class='-1', Count=10923, Percentage=97.675% Class='1', Count=260, Percentage=2.325%

The dataset appears to generally match the dataset described in the SMOTE paper. Specifically in terms of the ratio of negative to positive examples.

A typical mammography dataset might contain 98% normal pixels and 2% abnormal pixels.

— SMOTE: Synthetic Minority Over-sampling Technique, 2002.

Also, the specific number of examples in the minority and majority classes also matches the paper.

The experiments were conducted on the mammography dataset. There were 10923 examples in the majority class and 260 examples in the minority class originally.

— SMOTE: Synthetic Minority Over-sampling Technique, 2002.

I believe this is the same dataset, although I cannot explain the mismatch in the number of input features, e.g. six compared to seven in the original paper.

We can also take a look at the distribution of the six numerical input variables by creating a histogram for each.

The complete example is listed below.

# create histograms of numeric input variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'mammography.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # histograms of all variables df.hist() pyplot.show()

Running the example creates the figure with one histogram subplot for each of the six numerical input variables in the dataset.

We can see that the variables have differing scales and that most of the variables have an exponential distribution, e.g. most cases falling into one bin, and the rest falling into a long tail. The final variable appears to have a bimodal distribution.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

We can also create a scatter plot for each pair of input variables, called a scatter plot matrix.

This can be helpful to see if any variables relate to each other or change in the same direction, e.g. are correlated.

We can also color the dots of each scatter plot according to the class label. In this case, the majority class (no cancer) will be mapped to blue dots and the minority class (cancer) will be mapped to red dots.

The complete example is listed below.

# create pairwise scatter plots of numeric input variables from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # define the dataset location filename = 'mammography.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # define a mapping of class values to colors color_dict = {"'-1'":'blue', "'1'":'red'} # map each row to a color based on the class value colors = [color_dict[str(x)] for x in df.values[:, -1]] # pairwise scatter plots of all numerical variables scatter_matrix(df, diagonal='kde', color=colors) pyplot.show()

Running the example creates a figure showing the scatter plot matrix, with six plots by six plots, comparing each of the six numerical input variables with each other. The diagonal of the matrix shows the density distribution of each variable.

Each pairing appears twice both above and below the top-left to bottom-right diagonal, providing two ways to review the same variable interactions.

We can see that the distributions for many variables do differ for the two-class labels, suggesting that some reasonable discrimination between the cancer and no cancer cases will be feasible.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 11183/10 or about 1,118 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 98 percent to 2 percent no-cancer to cancer objects. Repetition indicates that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will evaluate and compare models using the area under ROC Curve or ROC AUC calculated via the roc_auc_score() function.

We can define a function to load the dataset and split the columns into input and output variables. We will correctly encode the class labels as 0 and 1. The *load_dataset()* function below implements this.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

We can then define a function that will evaluate a given model on the dataset and return a list of ROC AUC scores for each fold and repeat.

The *evaluate_model()* function below implements this, taking the dataset and model as arguments and returning the list of scores.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) return scores

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the a random class in proportion to the base rate of each class will result in a ROC AUC of 0.5, the baseline in performance on this dataset. This is a so-called “no skill” classifier.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “*strategy*” argument to ‘*stratified*‘.

... # define the reference model model = DummyClassifier(strategy='stratified')

Once the model is evaluated, we can report the mean and standard deviation of the ROC AUC scores directly.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean ROC AUC: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.

# test harness and baseline model evaluation from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'mammography.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='stratified') # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean ROC AUC: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded, and that we have six computer vision derived input variables. Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification datasets.

Next, the average of the ROC AUC scores is reported.

As expected, the no-skill classifier achieves the worst-case performance of a mean ROC AUC of approximately 0.5. This provides a baseline in performance, above which a model can be considered skillful on this dataset.

(11183, 6) (11183,) Counter({0: 10923, 1: 260}) Mean ROC AUC: 0.503 (0.016)

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better?** If you can achieve better ROC AUC performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the mammography dataset:

- Logistic Regression (LR)
- Support Vector Machine (SVM)
- Bagged Decision Trees (BAG)
- Random Forest (RF)
- Gradient Boosting Machine (GBM)

*get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs')) names.append('LR') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # GBM models.append(GradientBoostingClassifier(n_estimators=1000)) names.append('GBM') return models, names

We can then enumerate the list of models in turn and evaluate each, reporting the mean ROC AUC and storing the scores for later plotting.

... # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the mammography dataset is listed below.

# spot check machine learning algorithms on the mammography dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import BaggingClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs')) names.append('LR') # SVM models.append(SVC(gamma='scale')) names.append('SVM') # Bagging models.append(BaggingClassifier(n_estimators=1000)) names.append('BAG') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') # GBM models.append(GradientBoostingClassifier(n_estimators=1000)) names.append('GBM') return models, names # define the location of the dataset full_path = 'mammography.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # evaluate the model and store results scores = evaluate_model(X, y, models[i]) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation ROC AUC.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving a ROC AUC above the default of 0.5.

The results suggest that the ensemble of decision tree algorithms performs better on this dataset with perhaps Random Forest performing the best, with a ROC AUC of about 0.950.

It is interesting to note that this is better than the ROC AUC described in the paper of 0.93, although we used a different model evaluation procedure.

The evaluation was a little unfair to the LR and SVM algorithms as we did not scale the input variables prior to fitting the model. We can explore this in the next section.

>LR 0.919 (0.040) >SVM 0.880 (0.049) >BAG 0.941 (0.041) >RF 0.950 (0.036) >GBM 0.918 (0.037)

We can see that both BAG and RF have a tight distribution and a mean and median that closely align, perhaps suggesting a non-skewed and Gaussian distribution of scores, e.g. stable.

Now that we have a good first set of results, let’s see if we can improve them with cost-sensitive classifiers.

Some machine learning algorithms can be adapted to pay more attention to one class than another when fitting the model.

These are referred to as cost-sensitive machine learning models and they can be used for imbalanced classification by specifying a cost that is inversely proportional to the class distribution. For example, with a 98 percent to 2 percent distribution for the majority and minority classes, we can specify to give errors on the minority class a weighting of 98 and errors for the majority class a weighting of 2.

Three algorithms that offer this capability are:

- Logistic Regression (LR)
- Support Vector Machine (SVM)
- Random Forest (RF)

This can be achieved in scikit-learn by setting the “*class_weight*” argument to “*balanced*” to make these algorithms cost-sensitive.

For example, the updated *get_models()* function below defines the cost-sensitive version of these three algorithms to be evaluated on our dataset.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', class_weight='balanced')) names.append('LR') # SVM models.append(SVC(gamma='scale', class_weight='balanced')) names.append('SVM') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') return models, names

Additionally, when exploring the dataset, we noticed that many of the variables had a seemingly exponential data distribution. Sometimes we can better spread the data for a variable by using a power transform on each variable. This will be particularly helpful to the LR and SVM algorithm and may also help the RF algorithm.

We can implement this within each fold of the cross-validation model evaluation process using a Pipeline. The first step will learn the PowerTransformer on the training set folds and apply it to the training and test set folds. The second step will be the model that we are evaluating. The pipeline can then be evaluated directly using our *evaluate_model()* function, for example:

... # defines pipeline steps steps = [('p', PowerTransformer()), ('m',models[i])] # define pipeline pipeline = Pipeline(steps=steps) # evaluate the pipeline and store results scores = evaluate_model(X, y, pipeline)

Tying this together, the complete example of evaluating power transformed cost-sensitive machine learning algorithms on the mammography dataset is listed below.

# cost-sensitive machine learning algorithms on the mammography dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.pipeline import Pipeline from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import RandomForestClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='lbfgs', class_weight='balanced')) names.append('LR') # SVM models.append(SVC(gamma='scale', class_weight='balanced')) names.append('SVM') # RF models.append(RandomForestClassifier(n_estimators=1000)) names.append('RF') return models, names # define the location of the dataset full_path = 'mammography.csv' # load the dataset X, y = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # defines pipeline steps steps = [('p', PowerTransformer()), ('m',models[i])] # define pipeline pipeline = Pipeline(steps=steps) # evaluate the pipeline and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation ROC AUC.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all three of the tested algorithms achieved a lift on ROC AUC compared to their non-transformed and cost-insensitive versions. It would be interesting to repeat the experiment without the transform to see if it was the transform or the cost-sensitive version of the algorithms, or both that resulted in the lifts in performance.

In this case, we can see the SVM achieved the best performance, performing better than RF in this and the previous section and achieving a mean ROC AUC of about 0.957.

>LR 0.922 (0.036) >SVM 0.957 (0.024) >RF 0.951 (0.035)

Box and whisker plots are then created comparing the distribution of ROC AUC scores.

The SVM distribution appears compact compared to the other two models. As such the performance is likely stable and may make a good choice for a final model.

Next, let’s see how we might use a final model to make predictions on new data.

In this section, we will fit a final model and use it to make predictions on single rows of data

We will use the cost-sensitive version of the SVM model as the final model and a power transform on the data prior to fitting the model and making a prediction. Using the pipeline will ensure that the transform is always performed correctly on input data.

First, we can define the model as a pipeline.

... # define model to evaluate model = SVC(gamma='scale', class_weight='balanced') # power transform then fit model pipeline = Pipeline(steps=[('t',PowerTransformer()), ('m',model)])

Once defined, we can fit it on the entire training dataset.

... # fit the model pipeline.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the class label of 0 for “*no cancer”*, or 1 for “*cancer*“.

For example:

... # define a row of data row = [...] # make prediction yhat = model.predict([row])

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is a no cancer or cancer.

The complete example is listed below.

# fit a model and make predictions for the on the mammography dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import PowerTransformer from sklearn.svm import SVC from sklearn.pipeline import Pipeline # load the dataset def load_dataset(full_path): # load the dataset as a numpy array data = read_csv(full_path, header=None) # retrieve numpy array data = data.values # split into input and output elements X, y = data[:, :-1], data[:, -1] # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # define the location of the dataset full_path = 'mammography.csv' # load the dataset X, y = load_dataset(full_path) # define model to evaluate model = SVC(gamma='scale', class_weight='balanced') # power transform then fit model pipeline = Pipeline(steps=[('t',PowerTransformer()), ('m',model)]) # fit the model pipeline.fit(X, y) # evaluate on some no cancer cases (known class 0) print('No Cancer:') data = [[0.23001961,5.0725783,-0.27606055,0.83244412,-0.37786573,0.4803223], [0.15549112,-0.16939038,0.67065219,-0.85955255,-0.37786573,-0.94572324], [-0.78441482,-0.44365372,5.6747053,-0.85955255,-0.37786573,-0.94572324]] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 0)' % (label)) # evaluate on some cancer (known class 1) print('Cancer:') data = [[2.0158239,0.15353258,-0.32114211,2.1923706,-0.37786573,0.96176503], [2.3191888,0.72860087,-0.50146835,-0.85955255,-0.37786573,-0.94572324], [0.19224721,-0.2003556,-0.230979,1.2003796,2.2620867,1.132403]] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 1)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of no cancer cases is chosen from the dataset file. We can see that all cases are correctly predicted.

Then some cases of actual cancer are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.

No Cancer: >Predicted=0 (expected 0) >Predicted=0 (expected 0) >Predicted=0 (expected 0) Cancer: >Predicted=1 (expected 1) >Predicted=1 (expected 1) >Predicted=1 (expected 1)

This section provides more resources on the topic if you are looking to go deeper.

- Comparative Evaluation Of Pattern Recognition Techniques For Detection Of Microcalcifications In Mammography, 1993.
- SMOTE: Synthetic Minority Over-sampling Technique, 2002.

- sklearn.model_selection.RepeatedStratifiedKFold API.
- sklearn.metrics.roc_auc_score API.
- sklearn.dummy.DummyClassifier API.
- sklearn.svm.SVC API.

In this tutorial, you discovered how to develop and evaluate models for imbalanced mammography cancer classification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data cost-sensitive techniques.
- How to fit a final model and use it to predict class labels for specific cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Imbalanced Classification Model to Detect Mammography Microcalcifications appeared first on Machine Learning Mastery.

]]>The post Develop a Model for the Imbalanced Classification of Good and Bad Credit appeared first on Machine Learning Mastery.

]]>One example is the problem of classifying bank customers as to whether they should receive a loan or not. Giving a loan to a bad customer marked as a good customer results in a greater cost to the bank than denying a loan to a good customer marked as a bad customer.

This requires careful selection of a performance metric that both promotes minimizing misclassification errors in general, and favors minimizing one type of misclassification error over another.

The **German credit dataset** is a standard imbalanced classification dataset that has this property of differing costs to misclassification errors. Models evaluated on this dataset can be evaluated using the Fbeta-Measure that provides a way of both quantifying model performance generally, and captures the requirement that one type of misclassification error is more costly than another.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced German credit classification dataset.

After completing this tutorial, you will know:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data undersampling techniques.
- How to fit a final model and use it to predict class labels for specific cases.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Feb/2020**: Added section on further model improvements.**Update Jan/2021**: Updated links for API documentation.

This tutorial is divided into five parts; they are:

- German Credit Dataset
- Explore the Dataset
- Model Test and Baseline Result
- Evaluate Models
- Evaluate Machine Learning Algorithms
- Evaluate Undersampling
- Further Model Improvements

- Make Prediction on New Data

In this project, we will use a standard imbalanced machine learning dataset referred to as the “German Credit” dataset or simply “*German*.”

The dataset was used as part of the Statlog project, a European-based initiative in the 1990s to evaluate and compare a large number (at the time) of machine learning algorithms on a range of different classification tasks. The dataset is credited to Hans Hofmann.

The fragmentation amongst different disciplines has almost certainly hindered communication and progress. The StatLog project was designed to break down these divisions by selecting classification procedures regardless of historical pedigree, testing them on large-scale and commercially important problems, and hence to determine to what extent the various techniques met the needs of industry.

— Page 4, Machine Learning, Neural and Statistical Classification, 1994.

The german credit dataset describes financial and banking details for customers and the task is to determine whether the customer is good or bad. The assumption is that the task involves predicting whether a customer will pay back a loan or credit.

The dataset includes 1,000 examples and 20 input variables, 7 of which are numerical (integer) and 13 are categorical.

- Status of existing checking account
- Duration in month
- Credit history
- Purpose
- Credit amount
- Savings account
- Present employment since
- Installment rate in percentage of disposable income
- Personal status and sex
- Other debtors
- Present residence since
- Property
- Age in years
- Other installment plans
- Housing
- Number of existing credits at this bank
- Job
- Number of dependents
- Telephone
- Foreign worker

Some of the categorical variables have an ordinal relationship, such as “*Savings account*,” although most do not.

There are two classes, 1 for good customers and 2 for bad customers. Good customers are the default or negative class, whereas bad customers are the exception or positive class. A total of 70 percent of the examples are good customers, whereas the remaining 30 percent of examples are bad customers.

**Good Customers**: Negative or majority class (70%).**Bad Customers**: Positive or minority class (30%).

A cost matrix is provided with the dataset that gives a different penalty to each misclassification error for the positive class. Specifically, a cost of five is applied to a false negative (marking a bad customer as good) and a cost of one is assigned for a false positive (marking a good customer as bad).

**Cost for False Negative**: 5**Cost for False Positive**: 1

This suggests that the positive class is the focus of the prediction task and that it is more costly to the bank or financial institution to give money to a bad customer than to not give money to a good customer. This must be taken into account when selecting a performance metric.

Next, let’s take a closer look at the data.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, download the dataset and save it in your current working directory with the name “*german.csv*“.

Review the contents of the file.

The first few lines of the file should look as follows:

A11,6,A34,A43,1169,A65,A75,4,A93,A101,4,A121,67,A143,A152,2,A173,1,A192,A201,1 A12,48,A32,A43,5951,A61,A73,2,A92,A101,2,A121,22,A143,A152,1,A173,1,A191,A201,2 A14,12,A34,A46,2096,A61,A74,2,A93,A101,3,A121,49,A143,A152,1,A172,2,A191,A201,1 A11,42,A32,A42,7882,A61,A74,2,A93,A103,4,A122,45,A143,A153,1,A173,2,A191,A201,1 A11,24,A33,A40,4870,A61,A73,3,A93,A101,4,A124,53,A143,A153,2,A173,2,A191,A201,2 ...

We can see that the categorical columns are encoded with an *Axxx* format, where “*x*” are integers for different labels. A one-hot encoding of the categorical variables will be required.

We can also see that the numerical variables have different scales, e.g. 6, 48, and 12 in column 2, and 1169, 5951, etc. in column 5. This suggests that scaling of the integer columns will be needed for those algorithms that are sensitive to scale.

The target variable or class is the last column and contains values of 1 and 2. These will need to be label encoded to 0 and 1, respectively, to meet the general expectation for imbalanced binary classification tasks where 0 represents the negative case and 1 represents the positive case.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

... # define the dataset location filename = 'german.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None)

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

... # summarize the shape of the dataset print(dataframe.shape)

We can also summarize the number of examples in each class using the Counter object.

... # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Tying this together, the complete example of loading and summarizing the dataset is listed below.

# load and summarize the dataset from pandas import read_csv from collections import Counter # define the dataset location filename = 'german.csv' # load the csv file as a data frame dataframe = read_csv(filename, header=None) # summarize the shape of the dataset print(dataframe.shape) # summarize the class distribution target = dataframe.values[:,-1] counter = Counter(target) for k,v in counter.items(): per = v / len(target) * 100 print('Class=%d, Count=%d, Percentage=%.3f%%' % (k, v, per))

Running the example first loads the dataset and confirms the number of rows and columns, that is 1,000 rows and 20 input variables and 1 target variable.

The class distribution is then summarized, confirming the number of good and bad customers and the percentage of cases in the minority and majority classes.

(1000, 21) Class=1, Count=700, Percentage=70.000% Class=2, Count=300, Percentage=30.000%

We can also take a look at the distribution of the seven numerical input variables by creating a histogram for each.

First, we can select the columns with numeric variables by calling the select_dtypes() function on the DataFrame. We can then select just those columns from the DataFrame. We would expect there to be seven, plus the numerical class labels.

... # select columns with numerical data types num_ix = df.select_dtypes(include=['int64', 'float64']).columns # select a subset of the dataframe with the chosen columns subset = df[num_ix]

We can then create histograms of each numeric input variable. The complete example is listed below.

# create histograms of numeric input variables from pandas import read_csv from matplotlib import pyplot # define the dataset location filename = 'german.csv' # load the csv file as a data frame df = read_csv(filename, header=None) # select columns with numerical data types num_ix = df.select_dtypes(include=['int64', 'float64']).columns # select a subset of the dataframe with the chosen columns subset = df[num_ix] # create a histogram plot of each numeric variable ax = subset.hist() # disable axis labels to avoid the clutter for axis in ax.flatten(): axis.set_xticklabels([]) axis.set_yticklabels([]) # show the plot pyplot.show()

Running the example creates the figure with one histogram subplot for each of the seven input variables and one class label in the dataset. The title of each subplot indicates the column number in the DataFrame (e.g. zero-offset from 0 to 20).

We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 1000/10 or 100 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 70 percent to 30 percent good to bad customers. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will predict class labels of whether a customer is good or not. Therefore, we need a measure that is appropriate for evaluating the predicted class labels.

The focus of the task is on the positive class (bad customers). Precision and recall are a good place to start. Maximizing precision will minimize the false positives and maximizing recall will minimize the false negatives in the predictions made by a model.

- Precision = TruePositives / (TruePositives + FalsePositives)
- Recall = TruePositives / (TruePositives + FalseNegatives)

Using the F-Measure will calculate the harmonic mean between precision and recall. This is a good single number that can be used to compare and select a model on this problem. The issue is that false negatives are more damaging than false positives.

- F-Measure = (2 * Precision * Recall) / (Precision + Recall)

Remember that false negatives on this dataset are cases of a bad customer being marked as a good customer and being given a loan. False positives are cases of a good customer being marked as a bad customer and not being given a loan.

**False Negative**: Bad Customer (class 1) predicted as a Good Customer (class 0).**False Positive**: Good Customer (class 0) predicted as a Bad Customer (class 1).

False negatives are more costly to the bank than false positives.

- Cost(False Negatives) > Cost(False Positives)

Put another way, we are interested in the F-measure that will summarize a model’s ability to minimize misclassification errors for the positive class, but we want to favor models that are better are minimizing false negatives over false positives.

This can be achieved by using a version of the F-measure that calculates a weighted harmonic mean of precision and recall but favors higher recall scores over precision scores. This is called the Fbeta-measure, a generalization of F-measure, where “*beta*” is a parameter that defines the weighting of the two scores.

- Fbeta-Measure = ((1 + beta^2) * Precision * Recall) / (beta^2 * Precision + Recall)

A beta value of 2 will weight more attention on recall than precision and is referred to as the F2-measure.

- F2-Measure = ((1 + 2^2) * Precision * Recall) / (2^2 * Precision + Recall)

We will use this measure to evaluate models on the German credit dataset. This can be achieved using the fbeta_score() scikit-learn function.

We can define a function to load the dataset and split the columns into input and output variables. We will one-hot encode the categorical variables and label encode the target variable. You might recall that a one-hot encoding replaces the categorical variable with one new column for each value of the variable and marks values with a 1 in the column for that value.

First, we must split the DataFrame into input and output variables.

... # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix]

Next, we need to select all input variables that are categorical, then apply a one-hot encoding and leave the numerical variables untouched.

This can be achieved using a ColumnTransformer and defining the transform as a OneHotEncoder applied only to the column indices for categorical variables.

... # select categorical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns # one hot encode cat features only ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough') X = ct.fit_transform(X)

We can then label encode the target variable.

... # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y)

The *load_dataset()* function below ties all of this together and loads and prepares the dataset for modeling.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns # one hot encode cat features only ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough') X = ct.fit_transform(X) # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y

Next, we need a function that will evaluate a set of predictions using the *fbeta_score()* function with *beta* set to 2.

# calculate f2 score def f2(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2)

We can then define a function that will evaluate a given model on the dataset and return a list of F2-Measure scores for each fold and repeat.

The *evaluate_model()* function below implements this, taking the dataset and model as arguments and returning the list of scores.

# evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation the metric metric = make_scorer(f2) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the minority class for examples will achieve a maximum recall score and a baseline precision score. This provides a baseline in model performance on this problem by which all other models can be compared.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “*strategy*” argument to “*constant*” and the “*constant*” argument to “*1*” for the minority class.

... # define the reference model model = DummyClassifier(strategy='constant', constant=1)

Once the model is evaluated, we can report the mean and standard deviation of the F2-Measure scores directly.

... # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))

Tying this together, the complete example of loading the German Credit dataset, evaluating a baseline model, and reporting the performance is listed below.

# test harness and baseline model evaluation for the german credit dataset from collections import Counter from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from sklearn.dummy import DummyClassifier # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns # one hot encode cat features only ct = ColumnTransformer([('o',OneHotEncoder(),cat_ix)], remainder='passthrough') X = ct.fit_transform(X) # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X, y # calculate f2 score def f2(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'german.csv' # load the dataset X, y = load_dataset(full_path) # summarize the loaded dataset print(X.shape, y.shape, Counter(y)) # define the reference model model = DummyClassifier(strategy='constant', constant=1) # evaluate the model scores = evaluate_model(X, y, model) # summarize performance print('Mean F2: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded, and through the one-hot encoding of the categorical input variables, we have increased the number of input variables from 20 to 61. That suggests that the 13 categorical variables were encoded into a total of 54 columns.

Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.

Next, the average of the F2-Measure scores is reported.

In this case, we can see that the baseline algorithm achieves an F2-Measure of about 0.682. This score provides a lower limit on model skill; any model that achieves an average F2-Measure above about 0.682 has skill, whereas models that achieve a score below this value do not have skill on this dataset.

(1000, 61) (1000,) Counter({0: 700, 1: 300}) Mean F2: 0.682 (0.000)

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

**Can you do better? **If you can achieve better F2-Measure performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Let’s start by evaluating a mixture of probabilistic machine learning models on the dataset.

It can be a good idea to spot check a suite of different linear and nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention, and what doesn’t.

We will evaluate the following machine learning models on the German credit dataset:

- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- Naive Bayes (NB)
- Gaussian Process Classifier (GPC)
- Support Vector Machine (SVM)

We will use mostly default model hyperparameters.

*get_models()* function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

# define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='liblinear')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # NB models.append(GaussianNB()) names.append('NB') # GPC models.append(GaussianProcessClassifier()) names.append('GPC') # SVM models.append(SVC(gamma='scale')) names.append('SVM') return models, names

We will one-hot encode the categorical input variables as we did in the previous section, and in this case, we will normalize the numerical input variables. This is best performed using the MinMaxScaler within each fold of the cross-validation evaluation process.

An easy way to implement this is to use a Pipeline where the first step is a ColumnTransformer that applies a OneHotEncoder to just the categorical variables, and a *MinMaxScaler* to just the numerical input variables. To achieve this, we need a list of the column indices for categorical and numerical input variables.

We can update the *load_dataset()* to return the column indexes as well as the input and output elements of the dataset. The updated version of this function is listed below.

# load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix

We can then call this function to get the data and the list of categorical and numerical variables.

... # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path)

This can be used to prepare a *Pipeline* to wrap each model prior to evaluating it.

First, the *ColumnTransformer* is defined, which specifies what transform to apply to each type of column, then this is used as the first step in a Pipeline that ends with the specific model that will be fit and evaluated.

... # evaluate each model for i in range(len(models)): # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # wrap the model i a pipeline pipeline = Pipeline(steps=[('t',ct),('m',models[i])]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

We can summarize the mean F2-Measure for each algorithm; this will help to directly compare algorithms.

... # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores)))

At the end of the run, we will create a separate box and whisker plot for each algorithm’s sample of results.

These plots will use the same y-axis scale so we can compare the distribution of results directly.

... # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Tying this all together, the complete example of evaluating a suite of machine learning algorithms on the German credit dataset is listed below.

# spot check machine learning algorithms on the german credit dataset from numpy import mean from numpy import std from pandas import read_csv from matplotlib import pyplot from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.svm import SVC # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # calculate f2-measure def f2_measure(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2_measure) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define models to test def get_models(): models, names = list(), list() # LR models.append(LogisticRegression(solver='liblinear')) names.append('LR') # LDA models.append(LinearDiscriminantAnalysis()) names.append('LDA') # NB models.append(GaussianNB()) names.append('NB') # GPC models.append(GaussianProcessClassifier()) names.append('GPC') # SVM models.append(SVC(gamma='scale')) names.append('SVM') return models, names # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # wrap the model i a pipeline pipeline = Pipeline(steps=[('t',ct),('m',models[i])]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates each algorithm in turn and reports the mean and standard deviation F2-Measure.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that none of the tested models have an F2-measure above the default of predicting the majority class in all cases (0.682). None of the models are skillful. This is surprising, although suggests that perhaps the decision boundary between the two classes is noisy.

>LR 0.497 (0.072) >LDA 0.519 (0.072) >NB 0.639 (0.049) >GPC 0.219 (0.061) >SVM 0.436 (0.077)

Now that we have some results, let’s see if we can improve them with some undersampling.

Undersampling is perhaps the least widely used technique when addressing an imbalanced classification task as most of the focus is put on oversampling the majority class with SMOTE.

Undersampling can help to remove examples from the majority class along the decision boundary that make the problem challenging for classification algorithms.

In this experiment we will test the following undersampling algorithms:

- Tomek Links (TL)
- Edited Nearest Neighbors (ENN)
- Repeated Edited Nearest Neighbors (RENN)
- One Sided Selection (OSS)
- Neighborhood Cleaning Rule (NCR)

The Tomek Links and ENN methods select examples from the majority class to delete, whereas OSS and NCR both select examples to keep and examples to delete. We will use the balanced version of the logistic regression algorithm to test each undersampling method, to keep things simple.

The *get_models()* function from the previous section can be updated to return a list of undersampling techniques to test with the logistic regression algorithm. We use the implementations of these algorithms from the imbalanced-learn library.

The updated version of the *get_models()* function defining the undersampling methods is listed below.

# define undersampling models to test def get_models(): models, names = list(), list() # TL models.append(TomekLinks()) names.append('TL') # ENN models.append(EditedNearestNeighbours()) names.append('ENN') # RENN models.append(RepeatedEditedNearestNeighbours()) names.append('RENN') # OSS models.append(OneSidedSelection()) names.append('OSS') # NCR models.append(NeighbourhoodCleaningRule()) names.append('NCR') return models, names

The Pipeline provided by scikit-learn does not know about undersampling algorithms. Therefore, we must use the Pipeline implementation provided by the imbalanced-learn library.

As in the previous section, the first step of the pipeline will be one hot encoding of categorical variables and normalization of numerical variables, and the final step will be fitting the model. Here, the middle step will be the undersampling technique, correctly applied within the cross-validation evaluation on the training dataset only.

... # define model to evaluate model = LogisticRegression(solver='liblinear', class_weight='balanced') # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then undersample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline)

Tying this together, the complete example of evaluating logistic regression with different undersampling methods on the German credit dataset is listed below.

We would expect the undersampling to to result in a lift on skill in logistic regression, ideally above the baseline performance of predicting the minority class in all cases.

The complete example is listed below.

# evaluate undersampling with logistic regression on the imbalanced german credit dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from matplotlib import pyplot from sklearn.linear_model import LogisticRegression from imblearn.pipeline import Pipeline from imblearn.under_sampling import TomekLinks from imblearn.under_sampling import EditedNearestNeighbours from imblearn.under_sampling import RepeatedEditedNearestNeighbours from imblearn.under_sampling import NeighbourhoodCleaningRule from imblearn.under_sampling import OneSidedSelection # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # calculate f2-measure def f2_measure(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2_measure) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define undersampling models to test def get_models(): models, names = list(), list() # TL models.append(TomekLinks()) names.append('TL') # ENN models.append(EditedNearestNeighbours()) names.append('ENN') # RENN models.append(RepeatedEditedNearestNeighbours()) names.append('RENN') # OSS models.append(OneSidedSelection()) names.append('OSS') # NCR models.append(NeighbourhoodCleaningRule()) names.append('NCR') return models, names # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define models models, names = get_models() results = list() # evaluate each model for i in range(len(models)): # define model to evaluate model = LogisticRegression(solver='liblinear', class_weight='balanced') # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then undersample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', models[i]), ('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) results.append(scores) # summarize and store print('>%s %.3f (%.3f)' % (names[i], mean(scores), std(scores))) # plot the results pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example evaluates the logistic regression algorithm with five different undersampling techniques.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that three of the five undersampling techniques resulted in an F2-measure that provides an improvement over the baseline of 0.682. Specifically, ENN, RENN and NCR, with repeated edited nearest neighbors resulting in the best performance with an F2-measure of about 0.716.

The results suggest *SMOTE* achieved the best score with an F2-Measure of 0.604.

>TL 0.669 (0.057) >ENN 0.706 (0.048) >RENN 0.714 (0.041) >OSS 0.670 (0.054) >NCR 0.693 (0.052)

Box and whisker plots are created for each evaluated undersampling technique, showing that they generally have the same spread.

It is encouraging to see that for the well performing methods, the boxes spread up around 0.8, and the mean and median for all three methods are are around 0.7. This highlights that the distributions are skewing high and are let down on occasion by a few bad evaluations.

Next, let’s see how we might use a final model to make predictions on new data.

This is a new section that provides a minor departure to the above section. Here, we will test specific models that result in a further lift in F2-measure performance and I will update this section as new models are reported/discovered.

An F2-measure of about **0.727** can be achieved using balanced Logistic Regression with InstanceHardnessThreshold undersampling.

The complete example is listed below.

# improve performance on the imbalanced german credit dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from sklearn.linear_model import LogisticRegression from imblearn.pipeline import Pipeline from imblearn.under_sampling import InstanceHardnessThreshold # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # calculate f2-measure def f2_measure(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2_measure) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define model to evaluate model = LogisticRegression(solver='liblinear', class_weight='balanced') # define the data sampling sampling = InstanceHardnessThreshold() # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then sample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) print('%.3f (%.3f)' % (mean(scores), std(scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example gives the follow results.

0.727 (0.033)

An F2-measure of about **0.730** can be achieved using LDA with SMOTEENN, where the ENN parameter is set to an ENN instance with sampling_strategy set to majority.

The complete example is listed below.

# improve performance on the imbalanced german credit dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from imblearn.pipeline import Pipeline from imblearn.combine import SMOTEENN from imblearn.under_sampling import EditedNearestNeighbours # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # calculate f2-measure def f2_measure(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2_measure) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define model to evaluate model = LinearDiscriminantAnalysis() # define the data sampling sampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority')) # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then sample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) print('%.3f (%.3f)' % (mean(scores), std(scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example gives the follow results.

0.730 (0.046)

An F2-measure of about **0.741** can be achieved with further improvements to the SMOTEENN using a RidgeClassifier instead of LDA and using a StandardScaler for the numeric inputs instead of a MinMaxScaler.

The complete example is listed below.

# improve performance on the imbalanced german credit dataset from numpy import mean from numpy import std from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import StandardScaler from sklearn.compose import ColumnTransformer from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.metrics import fbeta_score from sklearn.metrics import make_scorer from sklearn.linear_model import RidgeClassifier from imblearn.pipeline import Pipeline from imblearn.combine import SMOTEENN from imblearn.under_sampling import EditedNearestNeighbours # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # calculate f2-measure def f2_measure(y_true, y_pred): return fbeta_score(y_true, y_pred, beta=2) # evaluate a model def evaluate_model(X, y, model): # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define the model evaluation metric metric = make_scorer(f2_measure) # evaluate model scores = cross_val_score(model, X, y, scoring=metric, cv=cv, n_jobs=-1) return scores # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define model to evaluate model = RidgeClassifier() # define the data sampling sampling = SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy='majority')) # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',StandardScaler(),num_ix)]) # scale, then sample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', sampling), ('m',model)]) # evaluate the model and store results scores = evaluate_model(X, y, pipeline) print('%.3f (%.3f)' % (mean(scores), std(scores)))

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example gives the follow results.

0.741 (0.034)

**Can you do even better?**

Let me know in the comments below.

Given the variance in results, a selection of any of the undersampling methods is probably sufficient. In this case, we will select logistic regression with Repeated ENN.

This model had an F2-measure of about about 0.716 on our test harness.

We will use this as our final model and use it to make predictions on new data.

First, we can define the model as a pipeline.

... # define model to evaluate model = LogisticRegression(solver='liblinear', class_weight='balanced') # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then undersample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)])

Once defined, we can fit it on the entire training dataset.

... # fit the model pipeline.fit(X, y)

Once fit, we can use it to make predictions for new data by calling the *predict()* function. This will return the class label of 0 for “*good customer*”, or 1 for “*bad customer*”.

Importantly, we must use the *ColumnTransformer* that was fit on the training dataset in the *Pipeline* to correctly prepare new data using the same transforms.

For example:

... # define a row of data row = [...] # make prediction yhat = pipeline.predict([row])

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know if the case is a good customer or bad.

The complete example is listed below.

# fit a model and make predictions for the german credit dataset from pandas import read_csv from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import MinMaxScaler from sklearn.compose import ColumnTransformer from sklearn.linear_model import LogisticRegression from imblearn.pipeline import Pipeline from imblearn.under_sampling import RepeatedEditedNearestNeighbours # load the dataset def load_dataset(full_path): # load the dataset as a numpy array dataframe = read_csv(full_path, header=None) # split into inputs and outputs last_ix = len(dataframe.columns) - 1 X, y = dataframe.drop(last_ix, axis=1), dataframe[last_ix] # select categorical and numerical features cat_ix = X.select_dtypes(include=['object', 'bool']).columns num_ix = X.select_dtypes(include=['int64', 'float64']).columns # label encode the target variable to have the classes 0 and 1 y = LabelEncoder().fit_transform(y) return X.values, y, cat_ix, num_ix # define the location of the dataset full_path = 'german.csv' # load the dataset X, y, cat_ix, num_ix = load_dataset(full_path) # define model to evaluate model = LogisticRegression(solver='liblinear', class_weight='balanced') # one hot encode categorical, normalize numerical ct = ColumnTransformer([('c',OneHotEncoder(),cat_ix), ('n',MinMaxScaler(),num_ix)]) # scale, then undersample, then fit model pipeline = Pipeline(steps=[('t',ct), ('s', RepeatedEditedNearestNeighbours()), ('m',model)]) # fit the model pipeline.fit(X, y) # evaluate on some good customers cases (known class 0) print('Good Customers:') data = [['A11', 6, 'A34', 'A43', 1169, 'A65', 'A75', 4, 'A93', 'A101', 4, 'A121', 67, 'A143', 'A152', 2, 'A173', 1, 'A192', 'A201'], ['A14', 12, 'A34', 'A46', 2096, 'A61', 'A74', 2, 'A93', 'A101', 3, 'A121', 49, 'A143', 'A152', 1, 'A172', 2, 'A191', 'A201'], ['A11', 42, 'A32', 'A42', 7882, 'A61', 'A74', 2, 'A93', 'A103', 4, 'A122', 45, 'A143', 'A153', 1, 'A173', 2, 'A191', 'A201']] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 0)' % (label)) # evaluate on some bad customers (known class 1) print('Bad Customers:') data = [['A13', 18, 'A32', 'A43', 2100, 'A61', 'A73', 4, 'A93', 'A102', 2, 'A121', 37, 'A142', 'A152', 1, 'A173', 1, 'A191', 'A201'], ['A11', 24, 'A33', 'A40', 4870, 'A61', 'A73', 3, 'A93', 'A101', 4, 'A124', 53, 'A143', 'A153', 2, 'A173', 2, 'A191', 'A201'], ['A11', 24, 'A32', 'A43', 1282, 'A62', 'A73', 4, 'A92', 'A101', 2, 'A123', 32, 'A143', 'A152', 1, 'A172', 1, 'A191', 'A201']] for row in data: # make prediction yhat = pipeline.predict([row]) # get the label label = yhat[0] # summarize print('>Predicted=%d (expected 1)' % (label))

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of a good customer for cases chosen from the dataset file. We can see that most cases are correctly predicted. This highlights that although we chose a good model, it is not perfect.

Then some cases of actual bad customers are used as input to the model and the label is predicted. As we might have hoped, the correct labels are predicted for all cases.

Good Customers: >Predicted=0 (expected 0) >Predicted=0 (expected 0) >Predicted=0 (expected 0) Bad Customers: >Predicted=0 (expected 1) >Predicted=1 (expected 1) >Predicted=1 (expected 1)

This section provides more resources on the topic if you are looking to go deeper.

- pandas.DataFrame.select_dtypes API.
- sklearn.metrics.fbeta_score API.
- sklearn.compose.ColumnTransformer API.
- sklearn.preprocessing.OneHotEncoder API.
- imblearn.pipeline.Pipeline API.

- Statlog (German Credit Data) Dataset, UCI Machine Learning Repository.
- German Credit Dataset.
- German Credit Dataset Description

In this tutorial, you discovered how to develop and evaluate a model for the imbalanced German credit classification dataset.

Specifically, you learned:

- How to load and explore the dataset and generate ideas for data preparation and model selection.
- How to evaluate a suite of machine learning models and improve their performance with data undersampling techniques.
- How to fit a final model and use it to predict class labels for specific cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Develop a Model for the Imbalanced Classification of Good and Bad Credit appeared first on Machine Learning Mastery.

]]>The post How to Calibrate Probabilities for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>Probabilities provide a required level of granularity for evaluating and comparing models, especially on imbalanced classification problems where tools like ROC Curves are used to interpret predictions and the ROC AUC metric is used to compare model performance, both of which use probabilities.

Unfortunately, the probabilities or probability-like scores predicted by many models are not calibrated. This means that they may be over-confident in some cases and under-confident in other cases. Worse still, the severely skewed class distribution present in imbalanced classification tasks may result in even more bias in the predicted probabilities as they over-favor predicting the majority class.

As such, it is often a good idea to calibrate the predicted probabilities for nonlinear machine learning models prior to evaluating their performance. Further, it is good practice to calibrate probabilities in general when working with imbalanced datasets, even of models like logistic regression that predict well-calibrated probabilities when the class labels are balanced.

In this tutorial, you will discover how to calibrate predicted probabilities for imbalanced classification.

After completing this tutorial, you will know:

- Calibrated probabilities are required to get the most out of models for imbalanced classification problems.
- How to calibrate predicted probabilities for nonlinear models like SVMs, decision trees, and KNN.
- How to grid search different probability calibration methods on a dataset with a skewed class distribution.

**Kick-start your project** with my new book Imbalanced Classification with Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

This tutorial is divided into five parts; they are:

- Problem of Uncalibrated Probabilities
- How to Calibrate Probabilities
- SVM With Calibrated Probabilities
- Decision Tree With Calibrated Probabilities
- Grid Search Probability Calibration with KNN

Many machine learning algorithms can predict a probability or a probability-like score that indicates class membership.

For example, logistic regression can predict the probability of class membership directly and support vector machines can predict a score that is not a probability but could be interpreted as a probability.

The probability can be used as a measure of uncertainty on those problems where a probabilistic prediction is required. This is particularly the case in imbalanced classification, where crisp class labels are often insufficient both in terms of evaluating and selecting a model. The predicted probability provides the basis for more granular model evaluation and selection, such as through the use of ROC and Precision-Recall diagnostic plots, metrics like ROC AUC, and techniques like threshold moving.

As such, using machine learning models that predict probabilities is generally preferred when working on imbalanced classification tasks. The problem is that few machine learning models have calibrated probabilities.

… to be usefully interpreted as probabilities, the scores should be calibrated.

— Page 57, Learning from Imbalanced Data Sets, 2018.

Calibrated probabilities means that the probability reflects the likelihood of true events.

This might be confusing if you consider that in classification, we have class labels that are correct or not instead of probabilities. To clarify, recall that in binary classification, we are predicting a negative or positive case as class 0 or 1. If 100 examples are predicted with a probability of 0.8, then 80 percent of the examples will have class 1 and 20 percent will have class 0, if the probabilities are calibrated. Here, calibration is the concordance of predicted probabilities with the occurrence of positive cases.

Uncalibrated probabilities suggest that there is a bias in the probability scores, meaning the probabilities are overconfident or under-confident in some cases.

**Calibrated Probabilities**. Probabilities match the true likelihood of events.**Uncalibrated Probabilities**. Probabilities are over-confident and/or under-confident.

This is common for machine learning models that are not trained using a probabilistic framework and for training data that has a skewed distribution, like imbalanced classification tasks.

There are two main causes for uncalibrated probabilities; they are:

- Algorithms not trained using a probabilistic framework.
- Biases in the training data.

Few machine learning algorithms produce calibrated probabilities. This is because for a model to predict calibrated probabilities, it must explicitly be trained under a probabilistic framework, such as maximum likelihood estimation. Some examples of algorithms that provide calibrated probabilities include:

- Logistic Regression.
- Linear Discriminant Analysis.
- Naive Bayes.
- Artificial Neural Networks.

Many algorithms either predict a probability-like score or a class label and must be coerced in order to produce a probability-like score. As such, these algorithms often require their “*probabilities*” to be calibrated prior to use. Examples include:

- Support Vector Machines.
- Decision Trees.
- Ensembles of Decision Trees (bagging, random forest, gradient boosting).
- k-Nearest Neighbors.

A bias in the training dataset, such as a skew in the class distribution, means that the model will naturally predict a higher probability for the majority class than the minority class on average.

The problem is, models may overcompensate and give too much focus to the majority class. This even applies to models that typically produce calibrated probabilities like logistic regression.

… class probability estimates attained via supervised learning in imbalanced scenarios systematically underestimate the probabilities for minority class instances, despite ostensibly good overall calibration.

— Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them), 2012.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Probabilities are calibrated by rescaling their values so they better match the distribution observed in the training data.

… we desire that the estimated class probabilities are reflective of the true underlying probability of the sample. That is, the predicted class probability (or probability-like value) needs to be well-calibrated. To be well-calibrated, the probabilities must effectively reflect the true likelihood of the event of interest.

— Page 249, Applied Predictive Modeling, 2013.

Probability predictions are made on training data and the distribution of probabilities is compared to the expected probabilities and adjusted to provide a better match. This often involves splitting a training dataset and using one portion to train the model and another portion as a validation set to scale the probabilities.

There are two main techniques for scaling predicted probabilities; they are Platt scaling and isotonic regression.

**Platt Scaling**. Logistic regression model to transform probabilities.**Isotonic Regression**. Weighted least-squares regression model to transform probabilities.

Platt scaling is a simpler method and was developed to scale the output from a support vector machine to probability values. It involves learning a logistic regression model to perform the transform of scores to calibrated probabilities. Isotonic regression is a more complex weighted least squares regression model. It requires more training data, although it is also more powerful and more general. Here, isotonic simply refers to monotonically increasing mapping of the original probabilities to the rescaled values.

Platt Scaling is most effective when the distortion in the predicted probabilities is sigmoid-shaped. Isotonic Regression is a more powerful calibration method that can correct any monotonic distortion.

— Predicting Good Probabilities With Supervised Learning, 2005.

The scikit-learn library provides access to both Platt scaling and isotonic regression methods for calibrating probabilities via the CalibratedClassifierCV class.

This is a wrapper for a model (like an SVM). The preferred scaling technique is defined via the “*method*” argument, which can be ‘*sigmoid*‘ (Platt scaling) or ‘*isotonic*‘ (isotonic regression).

Cross-validation is used to scale the predicted probabilities from the model, set via the “*cv*” argument. This means that the model is fit on the training set and calibrated on the test set, and this process is repeated k-times for the k-folds where predicted probabilities are averaged across the runs.

Setting the “*cv*” argument depends on the amount of data available, although values such as 3 or 5 can be used. Importantly, the split is stratified, which is important when using probability calibration on imbalanced datasets that often have very few examples of the positive class.

... # example of wrapping a model with probability calibration model = ... calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=3)

Now that we know how to calibrate probabilities, let’s look at some examples of calibrating probability for models on an imbalanced classification dataset.

In this section, we will review how to calibrate the probabilities for an SVM model on an imbalanced classification dataset.

First, let’s define a dataset using the make_classification() function. We will generate 10,000 examples, 99 percent of which will belong to the negative case (class 0) and 1 percent will belong to the positive case (class 1).

... # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4)

Next, we can define an SVM with default hyperparameters. This means that the model is not tuned to the dataset, but will provide a consistent basis of comparison.

... # define model model = SVC(gamma='scale')

We can then evaluate this model on the dataset using repeated stratified k-fold cross-validation with three repeats of 10-folds.

We will evaluate the model using ROC AUC and calculate the mean score across all repeats and folds. The ROC AUC will make use of the uncalibrated probability-like scores provided by the SVM.

... # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Tying this together, the complete example is listed below.

# evaluate svm with uncalibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.svm import SVC # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = SVC(gamma='scale') # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the SVM with uncalibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the SVM achieved a ROC AUC of about 0.804.

Mean ROC AUC: 0.804

Next, we can try using the *CalibratedClassifierCV* class to wrap the SVM model and predict calibrated probabilities.

We are using stratified 10-fold cross-validation to evaluate the model; that means 9,000 examples are used for train and 1,000 for test on each fold.

With *CalibratedClassifierCV* and 3-folds, the 9,000 examples of one fold will be split into 6,000 for training the model and 3,000 for calibrating the probabilities. This does not leave many examples of the minority class, e.g. 90/10 in 10-fold cross-validation, then 60/30 for calibration.

When using calibration, it is important to work through these numbers based on your chosen model evaluation scheme and either adjust the number of folds to ensure the datasets are sufficiently large or even switch to a simpler train/test split instead of cross-validation if needed. Experimentation might be required.

We will define the SVM model as before, then define the *CalibratedClassifierCV* with isotonic regression, then evaluate the calibrated model via repeated stratified k-fold cross-validation.

... # define model model = SVC(gamma='scale') # wrap the model calibrated = CalibratedClassifierCV(model, method='isotonic', cv=3)

Because SVM probabilities are not calibrated by default, we would expect that calibrating them would result in an improvement to the ROC AUC that explicitly evaluates a model based on their probabilities.

Tying this together, the full example below of evaluating SVM with calibrated probabilities is listed below.

# evaluate svm with calibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.calibration import CalibratedClassifierCV from sklearn.svm import SVC # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = SVC(gamma='scale') # wrap the model calibrated = CalibratedClassifierCV(model, method='isotonic', cv=3) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(calibrated, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the SVM with calibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the SVM achieved a lift in ROC AUC from about 0.804 to about 0.875.

Mean ROC AUC: 0.875

Probability calibration can be evaluated in conjunction with other modifications to the algorithm or dataset to address the skewed class distribution.

For example, SVM provides the “*class_weight*” argument that can be set to “*balanced*” to adjust the margin to favor the minority class. We can include this change to SVM and calibrate the probabilities, and we might expect to see a further lift in model skill; for example:

... # define model model = SVC(gamma='scale', class_weight='balanced')

Tying this together, the complete example of a class weighted SVM with calibrated probabilities is listed below.

# evaluate weighted svm with calibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.calibration import CalibratedClassifierCV from sklearn.svm import SVC # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = SVC(gamma='scale', class_weight='balanced') # wrap the model calibrated = CalibratedClassifierCV(model, method='isotonic', cv=3) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(calibrated, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the class-weighted SVM with calibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the SVM achieved a further lift in ROC AUC from about 0.875 to about 0.966.

Mean ROC AUC: 0.966

Decision trees are another highly effective machine learning that does not naturally produce probabilities.

Instead, class labels are predicted directly and a probability-like score can be estimated based on the distribution of examples in the training dataset that fall into the leaf of the tree that is predicted for the new example. As such, the probability scores from a decision tree should be calibrated prior to being evaluated and used to select a model.

We can define a decision tree using the DecisionTreeClassifier scikit-learn class.

The model can be evaluated with uncalibrated probabilities on our synthetic imbalanced classification dataset.

The complete example is listed below.

# evaluate decision tree with uncalibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = DecisionTreeClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the decision tree with uncalibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the decision tree achieved a ROC AUC of about 0.842.

Mean ROC AUC: 0.842

We can then evaluate the same model using the calibration wrapper.

In this case, we will use the Platt Scaling method configured by setting the “*method*” argument to “*sigmoid*“.

... # wrap the model calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=3)

The complete example of evaluating the decision tree with calibrated probabilities for imbalanced classification is listed below.

# decision tree with calibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.calibration import CalibratedClassifierCV from sklearn.tree import DecisionTreeClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = DecisionTreeClassifier() # wrap the model calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=3) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(calibrated, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the decision tree with calibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the decision tree achieved a lift in ROC AUC from about 0.842 to about 0.859.

Mean ROC AUC: 0.859

Probability calibration can be sensitive to both the method and the way in which the method is employed.

As such, it is a good idea to test a suite of different probability calibration methods on your model in order to discover what works best for your dataset. One approach is to treat the calibration method and cross-validation folds as hyperparameters and tune them. In this section, we will look at using a grid search to tune these hyperparameters.

The k-nearest neighbor, or KNN, algorithm is another nonlinear machine learning algorithm that predicts a class label directly and must be modified to produce a probability-like score. This often involves using the distribution of class labels in the neighborhood.

We can evaluate a KNN with uncalibrated probabilities on our synthetic imbalanced classification dataset using the KNeighborsClassifier class with a default neighborhood size of 5.

The complete example is listed below.

# evaluate knn with uncalibrated probabilities for imbalanced classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = KNeighborsClassifier() # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='roc_auc', cv=cv, n_jobs=-1) # summarize performance print('Mean ROC AUC: %.3f' % mean(scores))

Running the example evaluates the KNN with uncalibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the KNN achieved a ROC AUC of about 0.864.

Mean ROC AUC: 0.864

Knowing that the probabilities are dependent on the neighborhood size and are uncalibrated, we would expect that some calibration would improve the performance of the model using ROC AUC.

Rather than spot-checking one configuration of the *CalibratedClassifierCV* class, we will instead use the GridSearchCV to grid search different configurations.

First, the model and calibration wrapper are defined as before.

... # define model model = KNeighborsClassifier() # wrap the model calibrated = CalibratedClassifierCV(model)

We will test both “*sigmoid*” and “*isotonic*” “method” values, and different “*cv*” values in [2,3,4]. Recall that “*cv*” controls the split of the training dataset that is used to estimate the calibrated probabilities.

We can define the grid of parameters as a dict with the names of the arguments to the *CalibratedClassifierCV* we want to tune and provide lists of values to try. This will test 3 * 2 or 6 different combinations.

... # define grid param_grid = dict(cv=[2,3,4], method=['sigmoid','isotonic'])

We can then define the *GridSearchCV* with the model and grid of parameters and use the same repeated stratified k-fold cross-validation we used before to evaluate each parameter combination.

... # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid search grid = GridSearchCV(estimator=calibrated, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc') # execute the grid search grid_result = grid.fit(X, y)

Once evaluated, we will then summarize the configuration found with the highest ROC AUC, then list the results for all combinations.

# report the best configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # report all configurations means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Tying this together, the complete example of grid searching probability calibration for imbalanced classification with a KNN model is listed below.

# grid search probability calibration with knn for imbalance classification from numpy import mean from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.neighbors import KNeighborsClassifier from sklearn.calibration import CalibratedClassifierCV # generate dataset X, y = make_classification(n_samples=10000, n_features=2, n_redundant=0, n_clusters_per_class=1, weights=[0.99], flip_y=0, random_state=4) # define model model = KNeighborsClassifier() # wrap the model calibrated = CalibratedClassifierCV(model) # define grid param_grid = dict(cv=[2,3,4], method=['sigmoid','isotonic']) # define evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid search grid = GridSearchCV(estimator=calibrated, param_grid=param_grid, n_jobs=-1, cv=cv, scoring='roc_auc') # execute the grid search grid_result = grid.fit(X, y) # report the best configuration print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_)) # report all configurations means = grid_result.cv_results_['mean_test_score'] stds = grid_result.cv_results_['std_test_score'] params = grid_result.cv_results_['params'] for mean, stdev, param in zip(means, stds, params): print("%f (%f) with: %r" % (mean, stdev, param))

Running the example evaluates the KNN with a suite of different types of calibrated probabilities on the imbalanced classification dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the best result was achieved with a “*cv*” of 2 and an “*isotonic*” value for “*method*” achieving a mean ROC AUC of about 0.895, a lift from 0.864 achieved with no calibration.

Best: 0.895120 using {'cv': 2, 'method': 'isotonic'} 0.895084 (0.062358) with: {'cv': 2, 'method': 'sigmoid'} 0.895120 (0.062488) with: {'cv': 2, 'method': 'isotonic'} 0.885221 (0.061373) with: {'cv': 3, 'method': 'sigmoid'} 0.881924 (0.064351) with: {'cv': 3, 'method': 'isotonic'} 0.881865 (0.065708) with: {'cv': 4, 'method': 'sigmoid'} 0.875320 (0.067663) with: {'cv': 4, 'method': 'isotonic'}

This provides a template that you can use to evaluate different probability calibration configurations on your own models.

This section provides more resources on the topic if you are looking to go deeper.

- Predicting Good Probabilities With Supervised Learning, 2005.
- Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them), 2012.

- Learning from Imbalanced Data Sets, 2018.
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Applied Predictive Modeling, 2013.

- sklearn.calibration.CalibratedClassifierCV API.
- sklearn.svm.SVC API.
- sklearn.tree.DecisionTreeClassifier API.
- sklearn.neighbors.KNeighborsClassifier API.
- sklearn.model_selection.GridSearchCV API.

- Calibration (statistics), Wikipedia.
- Probabilistic classification, Wikipedia.
- Platt scaling, Wikipedia.
- Isotonic regression, Wikipedia.

In this tutorial, you discovered how to calibrate predicted probabilities for imbalanced classification.

Specifically, you learned:

- Calibrated probabilities are required to get the most out of models for imbalanced classification problems.
- How to calibrate predicted probabilities for nonlinear models like SVMs, decision trees, and KNN.
- How to grid search different probability calibration methods on datasets with a skewed class distribution.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Calibrate Probabilities for Imbalanced Classification appeared first on Machine Learning Mastery.

]]>