It can be more flexible to predict probabilities of an observation belonging to each class in a classification problem rather than predicting classes directly.

This flexibility comes from the way that probabilities may be interpreted using different thresholds that allow the operator of the model to trade-off concerns in the errors made by the model, such as the number of false positives compared to the number of false negatives. This is required when using models where the cost of one error outweighs the cost of other types of errors.

Two diagnostic tools that help in the interpretation of probabilistic forecast for binary (two-class) classification predictive modeling problems are ROC Curves and Precision-Recall curves.

In this tutorial, you will discover ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems.

After completing this tutorial, you will know:

- ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
- Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
- ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Let’s get started.

**Update Aug/2018**: Fixed bug in the representation of the no skill line for the precision-recall plot. Also fixed typo where I referred to ROC as relative rather than receiver (thanks spellcheck).**Update Nov/2018**: Fixed description on interpreting size of values on each axis, thanks Karl Humphries.**Update Jun/2019**: Fixed typo when interpreting imbalanced results.

## Tutorial Overview

This tutorial is divided into 6 parts; they are:

- Predicting Probabilities
- What Are ROC Curves?
- ROC Curves and AUC in Python
- What Are Precision-Recall Curves?
- Precision-Recall Curves and AUC in Python
- When to Use ROC vs. Precision-Recall Curves?

## Predicting Probabilities

In a classification problem, we may decide to predict the class values directly.

Alternately, it can be more flexible to predict the probabilities for each class instead. The reason for this is to provide the capability to choose and even calibrate the threshold for how to interpret the predicted probabilities.

For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1).

This threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error.

When making a prediction for a binary or two-class classification problem, there are two types of errors that we could make.

**False Positive**. Predict an event when there was no event.**False Negative**. Predict no event when in fact there was an event.

By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model.

For example, in a smog prediction system, we may be far more concerned with having low false negatives than low false positives. A false negative would mean not warning about a smog day when in fact it is a high smog day, leading to health issues in the public that are unable to take precautions. A false positive means the public would take precautionary measures when they didn’t need to.

A common way to compare models that predict probabilities for two-class problems is to use a ROC curve.

## What Are ROC Curves?

A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve, or ROC curve.

It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.

The true positive rate is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. It describes how good the model is at predicting the positive class when the actual outcome is positive.

1 |
True Positive Rate = True Positives / (True Positives + False Negatives) |

The true positive rate is also referred to as sensitivity.

1 |
Sensitivity = True Positives / (True Positives + False Negatives) |

The false positive rate is calculated as the number of false positives divided by the sum of the number of false positives and the number of true negatives.

It is also called the false alarm rate as it summarizes how often a positive class is predicted when the actual outcome is negative.

1 |
False Positive Rate = False Positives / (False Positives + True Negatives) |

The false positive rate is also referred to as the inverted specificity where specificity is the total number of true negatives divided by the sum of the number of true negatives and false positives.

1 |
Specificity = True Negatives / (True Negatives + False Positives) |

Where:

1 |
False Positive Rate = 1 - Specificity |

The ROC curve is a useful tool for a few reasons:

- The curves of different models can be compared directly in general or for different thresholds.
- The area under the curve (AUC) can be used as a summary of the model skill.

The shape of the curve contains a lot of information, including what we might care about most for a problem, the expected false positive rate, and the false negative rate.

To make this clear:

- Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives.
- Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.

If you are confused, remember, when we predict a binary outcome, it is either a correct prediction (true positive) or not (false positive). There is a tension between these options, the same with true negative and false negative.

A skilful model will assign a higher probability to a randomly chosen real positive occurrence than a negative occurrence on average. This is what we mean when we say that the model has skill. Generally, skilful models are represented by curves that bow up to the top left of the plot.

A model with no skill is represented at the point [0.5, 0.5]. A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5.

A model with perfect skill is represented at a point [0.0 ,1.0]. A model with perfect skill is represented by a line that travels from the bottom left of the plot to the top left and then across the top to the top right.

An operator may plot the ROC curve for the final model and choose a threshold that gives a desirable balance between the false positives and false negatives.

## ROC Curves and AUC in Python

We can plot a ROC curve for a model in Python using the *roc_curve()* scikit-learn function.

The function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. The function returns the false positive rates for each threshold, true positive rates for each threshold and thresholds.

1 2 |
# calculate roc curve fpr, tpr, thresholds = roc_curve(y, probs) |

The AUC for the ROC can be calculated using the *roc_auc_score()* function.

Like the *roc_curve()* function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. It returns the AUC score between 0.0 and 1.0 for no skill and perfect skill respectively.

1 2 3 |
# calculate AUC auc = roc_auc_score(y, probs) print('AUC: %.3f' % auc) |

A complete example of calculating the ROC curve and AUC for a kNN model on a small test problem is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# roc curve and auc from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(testy, probs) print('AUC: %.3f' % auc) # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model pyplot.plot(fpr, tpr, marker='.') # show the plot pyplot.show() |

Running the example prints the area under the ROC curve.

1 |
AUC: 0.895 |

A plot of the ROC curve for the model is also created showing that the model has skill.

## What Are Precision-Recall Curves?

There are many ways to evaluate the skill of a prediction model.

An approach in the related field of information retrieval (finding documents based on queries) measures precision and recall.

These measures are also useful in applied machine learning for evaluating binary classification models.

Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Precision is referred to as the positive predictive value.

1 |
Positive Predictive Power = True Positives / (True Positives + False Positives) |

or

1 |
Precision = True Positives / (True Positives + False Positives) |

Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.

1 |
Recall = True Positives / (True Positives + False Negatives) |

or

1 |
Sensitivity = True Positives / (True Positives + False Negatives) |

1 |
Recall == Sensitivity |

Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives.

Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1.

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve.

The no-skill line is defined by the total number of positive cases divide by the total number of positive and negative cases. For a dataset with an equal number of positive and negative cases, this is a straight line at 0.5. Points above this line show skill.

A model with perfect skill is depicted as a point at [1.0,1.0]. A skilful model is represented by a curve that bows towards [1.0,1.0] above the flat line of no skill.

There are also composite scores that attempt to summarize the precision and recall; three examples include:

**F score**or F1 score: that calculates the harmonic mean of the precision and recall (harmonic mean because the precision and recall are ratios).**Average precision**: that summarizes the weighted increase in precision with each change in recall for the thresholds in the precision-recall curve.**Area Under Curve**: like the AUC, summarizes the integral or an approximation of the area under the precision-recall curve.

In terms of model selection, F1 summarizes model skill for a specific probability threshold, whereas average precision and area under curve summarize the skill of a model across thresholds, like ROC AUC.

This makes precision-recall and a plot of precision vs. recall and summary measures useful tools for binary classification problems that have an imbalance in the observations for each class.

## Precision-Recall Curves in Python

Precision and recall can be calculated in scikit-learn via the *precision_score()* and *recall_score()* functions.

The precision and recall can be calculated for thresholds using the *precision_recall_curve()* function that takes the true output values and the probabilities for the positive class as output and returns the precision, recall and threshold values.

1 2 |
# calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) |

The F1 score can be calculated by calling the *f1_score()* function that takes the true class values and the predicted class values as arguments.

1 2 |
# calculate F1 score f1 = f1_score(testy, yhat) |

The area under the precision-recall curve can be approximated by calling the *auc()* function and passing it the recall and precision values calculated for each threshold.

1 2 |
# calculate precision-recall AUC auc = auc(recall, precision) |

Finally, the average precision can be calculated by calling the *average_precision_score()* function and passing it the true class values and the predicted class values.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# precision-recall curve and f1 from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from sklearn.metrics import average_precision_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) # calculate F1 score f1 = f1_score(testy, yhat) # calculate precision-recall AUC auc = auc(recall, precision) # calculate average precision score ap = average_precision_score(testy, probs) print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap)) # plot no skill pyplot.plot([0, 1], [0.5, 0.5], linestyle='--') # plot the precision-recall curve for the model pyplot.plot(recall, precision, marker='.') # show the plot pyplot.show() |

Running the example first prints the F1, area under curve (AUC) and average precision (AP) scores.

1 |
f1=0.836 auc=0.892 ap=0.840 |

The precision-recall curve plot is then created showing the precision/recall for each threshold compared to a no skill model.

## When to Use ROC vs. Precision-Recall Curves?

Generally, the use of ROC curves and precision-recall curves are as follows:

- ROC curves should be used when there are roughly equal numbers of observations for each class.
- Precision-Recall curves should be used when there is a moderate to large class imbalance.

The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class imbalance.

However, ROC curves can present an overly optimistic view of an algorithm’s performance if there is a large skew in the class distribution. […] Precision-Recall (PR) curves, often used in Information Retrieval , have been cited as an alternative to ROC curves for tasks with a large skew in the class distribution.

— The Relationship Between Precision-Recall and ROC Curves, 2006.

Some go further and suggest that using a ROC curve with an imbalanced dataset might be deceptive and lead to incorrect interpretations of the model skill.

[…] the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. [Precision-recall curve] plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions

The main reason for this optimistic picture is because of the use of true negatives in the False Positive Rate in the ROC Curve and the careful avoidance of this rate in the Precision-Recall curve.

If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance does not. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.

— ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003.

We can make this concrete with a short example.

Below is the same ROC Curve example with a modified problem where there is a 10:1 ratio of class=0 to class=1 observations.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
# roc curve and auc on imbalanced dataset from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9,0.09], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(testy, probs) print('AUC: %.3f' % auc) # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle='--') # plot the precision-recall curve for the model pyplot.plot(fpr, tpr, marker='.') # show the plot pyplot.show() |

Running the example suggests that the model has skill.

1 |
AUC: 0.713 |

Indeed, it has skill, but much of that skill is measured as making correct true negative predictions and there are a lot of negative predictions to make.

A plot of the ROC Curve confirms the AUC interpretation of a skilful model for most probability thresholds.

We can also repeat the test of the same model on the same dataset and calculate a precision-recall curve and statistics instead.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
# precision-recall curve and auc from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from sklearn.metrics import average_precision_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9,0.09], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) # calculate F1 score f1 = f1_score(testy, yhat) # calculate precision-recall AUC auc = auc(recall, precision) # calculate average precision score ap = average_precision_score(testy, probs) print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap)) # plot no skill pyplot.plot([0, 1], [0.1, 0.1], linestyle='--') # plot the precision-recall curve for the model pyplot.plot(recall, precision, marker='.') # show the plot pyplot.show() |

Running the example first prints the F1, AUC and AP scores.

The scores do not look encouraging, given skilful models are generally above 0.5.

1 |
f1=0.278 auc=0.302 ap=0.236 |

From the plot, we can see that after precision and recall crash fast.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Papers

- A critical investigation of recall and precision as measures of retrieval system performance, 1989.
- The Relationship Between Precision-Recall and ROC Curves, 2006.
- The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015.
- ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003.

### API

- sklearn.metrics.roc_curve API
- sklearn.metrics.roc_auc_score API
- sklearn.metrics.precision_recall_curve API
- sklearn.metrics.auc API
- sklearn.metrics.average_precision_score API
- Precision-Recall, scikit-learn
- Precision, recall and F-measures, scikit-learn

### Articles

- Receiver operating characteristic on Wikipedia
- Sensitivity and specificity on Wikipedia
- Precision and recall on Wikipedia
- Information retrieval on Wikipedia
- F1 score on Wikipedia
- ROC and precision-recall with imbalanced datasets, blog.

### Summary

In this tutorial, you discovered ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems.

Specifically, you learned:

- ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
- Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
- ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

I don’t think a diagonal straight line is the right baseline for P/R curve. The baseline “dumb” classifier should be a straight line with precision=positive%

You’re right, thanks!

Fixed.

Thanks for the intro on the topic:

The following line raises questions:

>>> The scores do not look encouraging, given skilful models are generally above 0.5.

The baseline of a random model is n_positive/(n_positive+n_negative). Or just the fraction of positives, so it makes sense to compare auc of precision-recall curve to that.

Sorry, I don’t follow your question. Can you elaborate?

I’m sorry I was not clear enough above.

Here’s what I meant:

for ROC the auc of the random model is 0.5.

for PR curve the auc of the random model is n_positive/(n_positive+n_negative).

Perhaps it would make sense to highlight that the PR auc should be compared to n_positive/(n_positive+n_negative)?

In the first reading the phrase

>>> The scores do not look encouraging, given skilful models are generally above 0.5.

in the context of PR curve auc looked ambiguous.

Thank you!

Thanks.

I second Alexander on this, .

The random classifier in PR curve gives an AUPR of 0.1

The AUPR is better for imbalanced datasets because it shows that there is still room for improvement, while AUROC seems saturated.

Great tutorial.

Thanks!

How about the Mathews Correlation Coefficient ?

I’ve not used it, got some refs?

This is a nice simple explanation

https://lettier.github.io/posts/2016-08-05-matthews-correlation-coefficient.html

I have also been advised that in the field of horse racing ratings produced using ML if you already have probabilistic outputs, then it makes much more sense to use a metric directly on the probabilities themselves (eg: McFadden’s pseudo-R^2, Brier score, etc).

Thanks.

Do you not think that a model with no skill (which I assume means a random coin toss) should have an AUC of 0.5 and not 0.0?

A ROC AUC of 0.0 means that the model is perfectly in-correct.

A ROC AUC of 0.5 would be a naive model.

I do not know what you mean by a naive model. Going by what you’ve used to describe a model with no skill, it should have an AUC of 0.5 while a model that perfectly misclassifies every point will have an AUC of 0.

Perfectly misclassifying every point is just as hard as perfectly classifying every point.

A naive model is still right sometimes. The most common naive model always predicts the most common class, and such a model will have a minimum AUC of 0.5.

Excellent point, thanks!

Thanks for explaining the difference in simpler way.

I’m happy it helped.

there’s a typo here, should be “is”:

“A common way to compare models that predict probabilities for two-class problems us to use a

ROC curve.”

nice article 🙂 thanks for sharing!

Thanks, fixed!

“To make this clear:

Larger values on the x-axis of the plot indicate higher true positives and lower false negatives.

Smaller values on the y-axis of the plot indicate lower false positives and higher true negatives.”

Should swap x & y in this description of ROC curves??

You’re right, fixed. Thanks!

Hi, Thanks for the nice tutorial 🙂

I have one comment though.

you have written that ‘A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.0.’

I think AUC is the area under the curve of ROC. According to your Explantation (diagonal line from the bottom left of the plot to the top right) the area under the the diagonal line that passes through (0.5, 0.5) is 0.5 and not 0. Thus in this case AUC = 0. 5(?)

Maybe I misunderstood sth here.

You’re correct, fixed.

Hi Jason.

I went through your nice tutorial again and a question came to my mind.

Within sklearn, it is possible that we use the average precision score to evaluate the skill of the model (applied on highly imbalanced dataset) and perform cross validation. For some ML algorithms like Lightgbm we can not use such a metric for cross validation, instead there are other metrics such as binary logloss.

The question is that does binary logloss is a good metric as average precision score for such kind of imbalanced problems?

Yes, log loss (cross entropy) can be a good measure for imbalanced classes. It captures the difference in the predicted and actual probability distributions.

Hi Jason,

Thank you for a summary.

Your statement

“Generally, the use of ROC curves and precision-recall curves are as follows:

* ROC curves should be used when there are roughly equal numbers of observations for each class.

* Precision-Recall curves should be used when there is a moderate to large class imbalance.”

…is misleading, if not just wrong. Even articles you cite do not say that.

Usually it is advised to use PRC in addition to ROC for highly inbalanced datatsets, which means for dataset with ratio of positives to negatives less then 1:100 or so. Moreover, high ideas around PRC are aimed at having no negatives for high values of scores, only positives. It just might not be the goal of the study and classifier. Also, as mentioned in one of the articles you cite, AUROC can be misleading even for balanced datasets, as it “weights” equally true positives and true negatives. I would also mention that AUROC is an estimator of the “probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one” and that it is related to Mann–Whitney U test.

To sum it up, I would always recommend to

1) Use AUROC, AUPRC, accuracy and any other metrics which are relevant to the goals of the study

2) Plot distributions of positives and negatives and analyse it

Let me know what you think

Thanks for the note.

Hi Jason,

in these examples, you always use APIs, so all of them have calculated functions. But I dont understand how to use the equations, for example:

True Positive Rate = True Positives / (True Positives + False Negatives)

this ‘True Positives’ are all single float numbers, then how we have array to plot?

(True Positives + False Negatives): is sum of total final predicted of test data?

I really confuse when calculate by hand

They are counts, e.g. the number of examples that were true positives, etc.

Can you please explain how to plot roc curve for multilabel classification.

Generally, ROC Curves are not used for multi-label classification, as far as I know.

Hi Jason,

I’ve plotted ROC which you can see in the following link but I don’t know why it’s not like a real ROC.

Could you please check oy out and let me what could be my mistake?

https://imgur.com/a/WWq0bl2

hist = model.fit(x_train, y_train, batch_size= 10, epochs= 10, verbose= 2)

y_predic = model.predict(x_test)

y_predic = (y_predic> 0.5)

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predic)

plt.figure()

plt.plot([0, 1], [0, 1], ‘k–‘)

plt.plot(fpr, tpr)

plt.xlabel(‘False positive rate’, fontsize = 16)

plt.ylabel(‘True positive rate’, fontsize = 16)

plt.title(‘ROC curve’, fontsize = 16)

plt.legend(loc=’best’, fontsize = 14)

plt.show()

I’m happy to answer questions, but I don’t have the capacity to debug your code sorry.

Thanks a lot for your reply.

No, I meant if it’s possible please check the plot and let me know your idea about it.

Hi Jason,

sorry, there’s a little confusing here,

we generate 2 classes dataset, why we use n_neighbors=3?

appreciate your help.

Alex

Yes, 2 classes is unrelated to the number of samples (k=3) used in kNN.

A dataset is comprised of many examples or rows of data, some will belong to class 0 and some to class 1. We will look at 3 sample in kNN to choose the class of a new example.

Hi, Jason, on top of this part of the code, you mentioned that “A complete example of calculating the ROC curve and AUC for a logistic regression model on a small test problem is listed below”. Is the KNN considered a “logistic regression”? I’m a little confused.

Looks like a typo. Fixed. Thanks!

Hi Jason, thank you for your excellent tutorials!

Is it EXACTLY the same to judge a model by PR-AUC vs F1-score? since both metrics rely exclusively on Precision and Recall? or am I missing something here?

thanks!

I don’t think so, off the cuff.

Nice post — what inferences may we make for a particular segment of a PR curve that is monotonically increasing (i.e. as recall increases, precision increases) vs another segment where the PR curve is monotonically decreasing (i.e. as recall increases, precision decreases)?

In the PR curve, it should be decreasing, never increasing – it will always have the same general shape downward.

If not, it might be a case of poorly calibrated predictions/model or highly imbalance data (e.g. like in the tutorial) resulting in an artefact in the precision/recall relationship.

I have been thinking about the same, https://stats.stackexchange.com/questions/183504/are-precision-and-recall-supposed-to-be-monotonic-to-classification-threshold the first answer here has a simple demonstration of why the y-axis (precision) is not monotonically decreasing while x-axis(recall) is monotonically increasing while threshold decreases, because at each threshold step, either the numerator or denominator may grow for precision, but only the numerator may grow for recall.

Thanks for sharing.

Hi Jason,

great stuff as usual. Just a small thing but may cause slight confusion, in the code for all precision-recall curves the comment indicates a ROC curve.

# plot the roc curve for the model

pyplot.plot(recall, precision, marker=’.’)

Regards

Gerry

Thanks, fixed!

Hi Jason,

Thanks for the article! You always wrote articles I have trouble finding answers anywhere else. This is an awesome summary! A quick question – when you used ‘smog system’ as an example to describe FP vs. FN cost, did you mean we will be more concerns about HIGH FN than HIGH FP? Correct me if I did not get what you meant.

Regards,

Sunny

Thanks.

Yes, it might be confusing. I was saying we want (are concerned with) low false neg, not false pos.

High false neg is a problem, high false pos is less of a problem.

Hi Jason,

How do we decide on what is a good operating point for precision% and recall %? I know it depends on the use case, but can you give your thoughts on how to approach it?

Thanks!

Yes, establish a baseline score with a naive method, and compare more sophisticated methods to the baseline.

Great post. Thank you Jason.

One query. what is the difference between area under the PR curve and the average precision score? Both have similar definitions I guess.

Also what approach do you recommend for selecting a threshold from the precision-recall curve, like the way we can use Youden’s index for ROC curve?

I’d recommend looking at the curve for your model and choose a point where the trade off makes sense for your domain/stakeholders.

Great question. They are similar.

I can’t give a good answer off the cuff, I’d have to write about about it and go through worked examples.

I am guessing both average precision score and area under precision recall curve are same. The difference arises in the way these metrics are calculated. As per the documentation page for AUC, it says

“Compute Area Under the Curve (AUC) using the trapezoidal rule

This is a general function, given points on a curve. For computing the area under the ROC-curve, see roc_auc_score. For an alternative way to summarize a precision-recall curve, see average_precision_score.”

So i guess, it finds the area under any curve using trapezoidal rule which is not the case with average_precision_score.

Thanks for the nice and clear post 🙂

Shouldn’t it be “false negatives” instead of “false positives” in the following phrase:

“here is a tension between these options, the same with true negative and false positives.”

I think you’re right. Fixed.

Thanks for the nice and clear article.

i used GaussianNB model, i got the thresholds [2.00000e+000, 1.00000e+000, 1.00000e+000, 9.59632e-018 ].

is it noraml that the thresholds have very small value??

thanx in advance

This line makes no sense to me at all : “Indeed, it has skill, but much of that skill is measured as making correct false negative predictions”

What is a “correct false negative”? The “correct” to my current understanding consist TP and TN, not FP or FN. If it’s correct, why is it false? If it’s false, how can it be correct?

Could you explain correct according to what? y_true or something else?

Looks like a typo, I believe I wanted to talk about true negatives, e.g. the abundant class.

Fixed. Thanks.

Great post, I found this very intuitive.

But why keep probabilities for the positive outcome only for the precision_recall_curve?

I tried with the probabilities for the negative class and the plot was weird. Please, I will like you to explain the intuition behind using the probabilities for the positive outcome and not the one for the negative outcome?

Actually scikit learn “predict_proba()” predict probability for each class for a row and it sums upto 1. In binary classification case, it predicts the probability for an example to be negative and positive and 2nd column shows how much probability of an example belongs to positive class.

When we pass only positive probability, ROC evaluate on different thresholds and check if given probability > threshold (say 0.5), it belongs to positive class otherwise it belongs to negative class. Similarly, it evaluates on different thresholds and give roc_auc score.

Thanks for explaining the ROC curve, i would like to aske how i can compare the Roc curves of many algorithms means SVM knn, RandomForest and so on.

Typically they are all plotted together.

You can also compare the Area under the ROC Curve for each algorithm.

can anyone explain whats the significance of average precision score?

Yes, see this:

https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision

Thanks a lot for this tutourial. There are actually not a lot of resources like this.

Thanks, I’m glad it helped!