A Gentle Introduction to Probability Metrics for Imbalanced Classification

Last Updated on

Classification predictive modeling involves predicting a class label for examples, although some problems require the prediction of a probability of class membership.

For these problems, the crisp class labels are not required, and instead, the likelihood that each example belonging to each class is required and later interpreted. As such, small relative probabilities can carry a lot of meaning and specialized metrics are required to quantify the predicted probabilities.

In this tutorial, you will discover metrics for evaluating probabilistic predictions for imbalanced classification.

After completing this tutorial, you will know:

  • Probability predictions are required for some classification predictive modeling problems.
  • Log loss quantifies the average difference between predicted and expected probability distributions.
  • Brier score quantifies the average difference between predicted and expected probabilities.

Discover SMOTE, one-class classification, cost-sensitive learning, threshold moving, and much more in my new book, with 30 step-by-step tutorials and full Python source code.

Let’s get started.

A Gentle Introduction to Probability Metrics for Imbalanced Classification

A Gentle Introduction to Probability Metrics for Imbalanced Classification
Photo by a4gpa, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Probability Metrics
  2. Log Loss for Imbalanced Classification
  3. Brier Score for Imbalanced Classification

Probability Metrics

Classification predictive modeling involves predicting a class label for an example.

On some problems, a crisp class label is not required, and instead a probability of class membership is preferred. The probability summarizes the likelihood (or uncertainty) of an example belonging to each class label. Probabilities are more nuanced and can be interpreted by a human operator or a system in decision making.

Probability metrics are those specifically designed to quantify the skill of a classifier model using the predicted probabilities instead of crisp class labels. They are typically scores that provide a single value that can be used to compare different models based on how well the predicted probabilities match the expected class probabilities.

In practice, a dataset will not have target probabilities. Instead, it will have class labels.

For example, a two-class (binary) classification problem will have the class labels 0 for the negative case and 1 for the positive case. When an example has the class label 0, then the probability of the class labels 0 and 1 will be 1 and 0 respectively. When an example has the class label 1, then the probability of class labels 0 and 1 will be 0 and 1 respectively.

  • Example with Class=0: P(class=0) = 1, P(class=1) = 0
  • Example with Class=1: P(class=0) = 0, P(class=1) = 1

We can see how this would scale to three classes or more; for example:

  • Example with Class=0: P(class=0) = 1, P(class=1) = 0, P(class=2) = 0
  • Example with Class=1: P(class=0) = 0, P(class=1) = 1, P(class=2) = 0
  • Example with Class=2: P(class=0) = 0, P(class=1) = 0, P(class=2) = 1

In the case of binary classification problems, this representation can be simplified to just focus on the positive class.

That is, we only require the probability of an example belonging to class 1 to represent the probabilities for binary classification (the so-called Bernoulli distribution); for example:

  • Example with Class=0: P(class=1) = 0
  • Example with Class=1: P(class=1) = 1

Probability metrics will summarize how well the predicted distribution of class membership matches the known class probability distribution.

This focus on predicted probabilities may mean that the crisp class labels predicted by a model are ignored. This focus may mean that a model that predicts probabilities may appear to have terrible performance when evaluated according to its crisp class labels, such as using accuracy or a similar score. This is because although the predicted probabilities may show skill, they must be interpreted with an appropriate threshold prior to being converted into crisp class labels.

Additionally, the focus on predicted probabilities may also require that the probabilities predicted by some nonlinear models to be calibrated prior to being used or evaluated. Some models will learn calibrated probabilities as part of the training process (e.g. logistic regression), but many will not and will require calibration (e.g. support vector machines, decision trees, and neural networks).

A given probability metric is typically calculated for each example, then averaged across all examples in the training dataset.

There are two popular metrics for evaluating predicted probabilities; they are:

  • Log Loss
  • Brier Score

Let’s take a closer look at each in turn.

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Log Loss for Imbalanced Classification

Logarithmic loss or log loss for short is a loss function known for training the logistic regression classification algorithm.

The log loss function calculates the negative log likelihood for probability predictions made by the binary classification model. Most notably, this is logistic regression, but this function can be used by other models, such as neural networks, and is known by other names, such as cross-entropy.

Generally, the log loss can be calculated using the expected probabilities for each class and the natural logarithm of the predicted probabilities for each class; for example:

  • LogLoss = -(P(class=0) * log(P(class=0)) + (P(class=1)) * log(P(class=1)))

The best possible log loss is 0.0, and values are positive to infinite for progressively worse scores.

If you are just predicting the probability for the positive class, then the log loss function can be calculated for one binary classification prediction (yhat) compared to the expected probability (y) as follows:

  • LogLoss = -((1 – y) * log(1 – yhat) + y * log(yhat))

For example, if the expected probability was 1.0 and the model predicted 0.8, the log loss would be:

  • LogLoss = -((1 – y) * log(1 – yhat) + y * log(yhat))
  • LogLoss = -((1 – 1.0) * log(1 – 0.8) + 1.0 * log(0.8))
  • LogLoss = -(-0.0 + -0.223)
  • LogLoss = 0.223

This calculation can be scaled up for multiple classes by adding additional terms; for example:

  • LogLoss = -( sum c in C y_c * log(yhat_c))

This generalization is also known as cross-entropy and calculates the number of bits (if log base-2 is used) or nats (if log base-e is used) by which two probability distributions differ.

Specifically, it builds upon the idea of entropy from information theory and calculates the average number of bits required to represent or transmit an event from one distribution compared to the other distribution.

… the cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q …

— Page 57, Machine Learning: A Probabilistic Perspective, 2012.

The intuition for this definition comes if we consider a target or underlying probability distribution P and an approximation of the target distribution Q, then the cross-entropy of Q from P is the number of additional bits to represent an event using Q instead of P.

We will stick with log loss for now, as it is the term most commonly used when using this calculation as an evaluation metric for classifier models.

When calculating the log loss for a set of predictions compared to a set of expected probabilities in a test dataset, the average of the log loss across all samples is calculated and reported; for example:

  • AverageLogLoss = 1/N * sum i in N -((1 – y) * log(1 – yhat) + y * log(yhat))

The average log loss for a set of predictions on a training dataset is often simply referred to as the log loss.

We can demonstrate calculating log loss with a worked example.

First, let’s define a synthetic binary classification dataset. We will use the make_classification() function to create 1,000 examples, with 99%/1% split for the two classes. The complete example of creating and summarizing the dataset is listed below.

Running the example creates the dataset and reports the distribution of examples in each class.

Next, we will develop an intuition for naive predictions of probabilities.

A naive prediction strategy would be to predict certainty for the majority class, or P(class=0) = 1. An alternative strategy would be to predict the minority class, or P(class=1) = 1.

Log loss can be calculated using the log_loss() scikit-learn function. It takes the probability for each class as input and returns the average log loss. Specifically, each example must have a prediction with one probability per class, meaning a prediction for one example for a binary classification problem must have a probability for class 0 and class 1.

Therefore, predicting certain probabilities for class 0 for all examples would be implemented as follows:

We can do the same thing for P(class1)=1.

These two strategies are expected to perform terribly.

A better naive strategy would be to predict the class distribution for each example. For example, because our dataset has a 99%/1% class distribution for the majority and minority classes, this distribution can be “predicted” for each example to give a baseline for probability predictions.

Finally, we can also calculate the log loss for perfectly predicted probabilities by taking the target values for the test set as predictions.

Tying this all together, the complete example is listed below.

Running the example reports the log loss for each naive strategy.

As expected, predicting certainty for each class label is punished with large log loss scores, with the case of being certain for the minority class in all cases resulting in a much larger score.

We can see that predicting the distribution of examples in the dataset as the baseline results in a better score than either of the other naive measures. This baseline represents the no skill classifier and log loss scores below this strategy represent a model that has some skill.

Finally, we can see that a log loss for perfectly predicted probabilities is 0.0, indicating no difference between actual and predicted probability distributions.

Now that we are familiar with log loss, let’s take a look at the Brier score.

Brier Score for Imbalanced Classification

The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values.

The score summarizes the magnitude of the error in the probability forecasts and is designed for binary classification problems. It is focused on evaluating the probabilities for the positive class. Nevertheless, it can be adapted for problems with multiple classes.

As such, it is an appropriate probabilistic metric for imbalanced classification problems.

The evaluation of probabilistic scores is generally performed by means of the Brier Score. The basic idea is to compute the mean squared error (MSE) between predicted probability scores and the true class indicator, where the positive class is coded as 1, and negative class 0.

— Page 57, Learning from Imbalanced Data Sets, 2018.

The error score is always between 0.0 and 1.0, where a model with perfect skill has a score of 0.0.

The Brier score can be calculated for positive predicted probabilities (yhat) compared to the expected probabilities (y) as follows:

  • BrierScore = 1/N * Sum i to N (yhat_i – y_i)^2

For example, if a predicted positive class probability is 0.8 and the expected probability is 1.0, then the Brier score is calculated as:

  • BrierScore = (yhat_i – y_i)^2
  • BrierScore = (0.8 – 1.0)^2
  • BrierScore = 0.04

We can demonstrate calculating Brier score with a worked example using the same dataset and naive predictive models as were used in the previous section.

The Brier score can be calculated using the brier_score_loss() scikit-learn function. It takes the probabilities for the positive class only, and returns an average score.

As in the previous section, we can evaluate naive strategies of predicting the certainty for each class label. In this case, as the score only considered the probability for the positive class, this will involve predicting 0.0 for P(class=1)=0 and 1.0 for P(class=1)=1. For example:

We can also test the no skill classifier that predicts the ratio of positive examples in the dataset, which in this case is 1 percent or 0.01.

Finally, we can also confirm the Brier score for perfectly predicted probabilities.

Tying this together, the complete example is listed below.

Running the example, we can see the scores for the naive models and the baseline no skill classifier.

As we might expect, we can see that predicting a 0.0 for all examples results in a low score, as the mean squared error between all 0.0 predictions and mostly 0 classes in the test set results in a small value. Conversely, the error between 1.0 predictions and mostly 0 class values results in a larger error score.

Importantly, we can see that the default no skill classifier results in a lower score than predicting all 0.0 values. Again, this represents the baseline score, below which models will demonstrate skill.

The Brier scores can become very small and the focus will be on fractions well below the decimal point. For example, the difference in the above example between Baseline and Perfect scores is slight at four decimal places.

A common practice is to transform the score using a reference score, such as the no skill classifier. This is called a Brier Skill Score, or BSS, and is calculated as follows:

  • BrierSkillScore = 1 – (BrierScore / BrierScore_ref)

We can see that if the reference score was evaluated, it would result in a BSS of 0.0. This represents a no skill prediction. Values below this will be negative and represent worse than no skill. Values above 0.0 represent skillful predictions with a perfect prediction value of 1.0.

We can demonstrate this by developing a function to calculate the Brier skill score listed below.

We can then calculate the BSS for each of the naive forecasts, as well as for a perfect prediction.

The complete example is listed below.

Running the example first calculates the reference Brier score used in the BSS calculation.

We can then see that predicting certainty scores for each class results in a negative BSS score, indicating that they are worse than no skill. Finally, we can see that evaluating the reference forecast itself results in 0.0, indicating no skill and evaluating the true values as predictions results in a perfect score of 1.0.

As such, the Brier Skill Score is a best practice for evaluating probability predictions and is widely used where probability classification prediction are evaluated routinely, such as in weather forecasts (e.g. rain or not).

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

API

Articles

Summary

In this tutorial, you discovered metrics for evaluating probabilistic predictions for imbalanced classification.

Specifically, you learned:

  • Probability predictions are required for some classification predictive modeling problems.
  • Log loss quantifies the average difference between predicted and expected probability distributions.
  • Brier score quantifies the average difference between predicted and expected probabilities.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

26 Responses to A Gentle Introduction to Probability Metrics for Imbalanced Classification

  1. marco January 11, 2020 at 3:41 am #

    Hello Jason,
    I have a problem. I have a hundred observations for training. Features are related to workers. y values are city names. In the future I have to predict the city name where to move workers given a new input. It seems a classification problem. Right? First question is how to encode city names? Then is better to use Keras or Sklearn? In case of Sklearn which agorithm? And finally how to decode the y value? Thanks
    Marco

  2. marco January 13, 2020 at 6:29 am #

    Hello Jason,
    Is it possibile to use also one-hot encoding for encoding cities? What is the differences between one hot and label encoding. When to use them?
    Thanks,
    Marco

  3. marco January 13, 2020 at 8:08 pm #

    Hello Jason,
    I’m trying XGBClassifier using IRIS dataset.
    1) Is it machine learning algorithm o deep learning algortihm?
    2) When I encode IRIS y using labelEncoder(), why I don’t need to normalize it? (more in general do I have to normalize y?)
    3) IRIS X_train values are > 1, do I need to normalize them?
    Thanks,
    Marco

    • Jason Brownlee January 14, 2020 at 7:21 am #

      XGBoost is machine learning, not deep learning.

      Encoded labels are integers. No need to normalize the class labels.

      Generally, you don’t need to normalzie inputs for tree algorithms, like xgboost.

  4. marco January 14, 2020 at 8:14 pm #

    Hello Jason,
    it seems that XGBClassifier is great algorithm. I tried it with the churn dataset.
    I also built a Keras MLP (with 100 epochs).
    XGBClassifier is MUCH faster and are MORE accurate than the Keras one.
    1. So is it possible that a macchine learning algorithm is better than a deep learning?
    2. You said “Generally, you don’t need to normalzie inputs for tree algorithms, like xgboost”
    it is a great news what are the other two?
    3. Does XGBClassifier work for linear regression as well. Do you have any example?
    Thanks,
    Marco

    • Jason Brownlee January 15, 2020 at 8:24 am #

      Yes, each dataset is different and you must use controlled experiments to discover what works best for each dataset.

      There is no “best” algorithm for all problems.

      XGBoost can be used for regression. I may have an example on the blog, try searching.

  5. Markus January 15, 2020 at 8:18 am #

    Could you please in simple terms explain why one should favour Brier Skill Score to Log Loss? What is the benefit of using it compared to Log Loss?

    Thanks

    • Jason Brownlee January 15, 2020 at 8:34 am #

      Log loss is great for comparing distributions.

      Brier score, specifically Brier Skill Score is great for presenting scores that are relative to a baseline.

  6. marco January 17, 2020 at 7:00 pm #

    Hello Jason,
    one more question about XGBoost.
    I’ve seen also scikit-learn has a two versions of gradient boosting (ensemble.GradientBoostingClassifier / ensemble.ensemble.HistGradientBoostingClassifier and ensemble.GradientBoostingRegressor / ensemble.HistGradientBoostingRegressor).
    What are differences with XGBoost functions? what I have to use? Are they faster?
    Can I replace XGBoost functions (XGBClassifier and XGBRegressor) with then without major changes?
    Thanks

    • Jason Brownlee January 18, 2020 at 8:41 am #

      Excellent question!

      GBM in sklearn is the standard algorithm.

      Hist-based-GBM in sklearn is experimental and is a faster version of the algorithm based on lightgbm by microsoft.

      xgboost is an efficient implementation of GBM and is way faster than the sklearn implementation.

      Generally, speed improvements in xgboost, lightgbm and catboost often also lead to model skill improvements also.

  7. marco January 19, 2020 at 9:16 pm #

    Hello Jason,
    a I have a couple of question. The parameters inside the function are called hyperpaametrers? Or what are the hyperpaametrers in scikit-learn functions?

    GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’deprecated’, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)[source]¶

    The second question is about sklearn.model_selection.GridSearchCV (I’d like to use it with XGBClassifier and XGBRegressor).
    Can it be used for cross validation? Do you have an example?
    Thanks,
    Marco

  8. Venkatesh Gandi January 20, 2020 at 6:29 am #

    Greatly explained, Thanks 🙂

  9. marco January 21, 2020 at 12:35 am #

    Hello Jason,
    I’ve seen that among ensemble methods there are AdaBoostClassifier and AdaBoostRegressor.
    What are differences between AdaBoostClassifier vs. GradientBoostingClassifier and AdaBoostRegressor vs. GradientBoostingRegressor.
    When is better to use Ada functions?
    Thanks,
    Marco

  10. marco January 21, 2020 at 3:04 am #

    Hello Jason,
    one more question is how to navigate the scikit-learn map (https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).
    It is clear to me the flow until Ensemble Classifier (i.e in case of classification yes -> SVC -> not working -> text data -> no -> KNeighbors Classifier -> not working -> then Ensemble Classifiers). Then how to choose the right flow within Ensemble Classifiers?
    Is there somewhere any overall picture or map to help to choose the right ensemble classifier (or ensemble regressor in case of regression?).
    Thanks,
    Marco

  11. nini June 4, 2020 at 7:07 pm #

    Hi Jason, actually I have a dataset with 7 types of defect. Right now I have done with classifying all the type of defect which is only one output here. But do you have an idea on how to know the percentage of the defect will have?This means I have to make it multiple input which is first:
    1. the type of defect
    2. The percentage of the defect chances.

    Thanks,

    Nini

    • nini June 4, 2020 at 7:08 pm #

      Correction:
      This means I have to do multiple output which is is first:
      1. the type of defect
      2. The percentage of the defect chances.

      Thanks,

      Nini

      • Jason Brownlee June 5, 2020 at 8:08 am #

        No, a model can predict the probability for each class directly. E.g. an LDA, logistic regression, naive bayes, and many more.

    • Jason Brownlee June 5, 2020 at 8:08 am #

      Yes, you can use a model that predicts a class membership probability. Or a model that predicts a probability like score and use model calibration.

      • nini June 5, 2020 at 10:39 am #

        Is that means the multiclass for imbalance classification cannot be used to find the percentage of the defect type?:(

Leave a Reply