When you build a model for a classification problem you almost always want to look at the accuracy of that model as the number of correct predictions from all predictions made.
This is the classification accuracy.
In a previous post, we have looked at evaluating the robustness of a model for making predictions on unseen data using cross-validation and multiple cross-validation where we used classification accuracy and average classification accuracy.
Once you have a model that you believe can make robust predictions you need to decide whether it is a good enough model to solve your problem. Classification accuracy alone is typically not enough information to make this decision.
In this post, we will look at Precision and Recall performance measures you can use to evaluate your model for a binary classification problem.
Recurrence of Breast Cancer
The breast cancer dataset is a standard machine learning dataset. It contains 9 attributes describing 286 women that have suffered and survived breast cancer and whether or not breast cancer recurred within 5 years.
It is a binary classification problem. Of the 286 women, 201 did not suffer a recurrence of breast cancer, leaving the remaining 85 that did.
I think that False Negatives are probably worse than False Positives for this problem. Do you agree? More detailed screening can clear the False Positives, but False Negatives are sent home and lost to follow-up evaluation.
Classification accuracy is our starting point. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.
All No Recurrence
A model that only predicted no recurrence of breast cancer would achieve an accuracy of (201/286)*100 or 70.28%. We’ll call this our “All No Recurrence”. This is a high accuracy, but a terrible model. If it was used alone for decision support to inform doctors (impossible, but play along), it would send home 85 women with incorrectly thinking their breast cancer was not going to reoccur (high False Negatives).
A model that only predicted the recurrence of breast cancer would achieve an accuracy of (85/286)*100 or 29.72%. We’ll call this our “All Recurrence”. This model has terrible accuracy and would send home 201 women thinking that had a recurrence of breast cancer but really didn’t (high False Positives).
CART or Classification And Regression Trees is a powerful yet simple decision tree algorithm. On this problem, CART can achieve an accuracy of 69.23%. This is lower than our “All No Recurrence” model, but is this model more valuable?
We can see that classification accuracy alone is not sufficient to select a model for this problem.
For a binary classification problem the table has 2 rows and 2 columns. Across the top is the observed class labels and down the side are the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell.
In this case, a perfect classifier would correctly predict 201 no recurrence and 85 recurrence which would be entered into the top left cell no recurrence/no recurrence (True Negatives) and bottom right cell recurrence/recurrence (True Positives).
Incorrect predictions are clearly broken down into the two other cells. False Negatives which are recurrence that the classifier has marked as no recurrence. We do not have any of those. False Positives are no recurrence that the classifier has marked as recurrence.
This is a useful table that presents both the class distribution in the data and the classifiers predicted class distribution with a breakdown of error types.
All No Recurrence Confusion Matrix
The confusion matrix highlights the large number (85) of False Negatives.
All Recurrence Confusion Matrix
The confusion matrix highlights the large number (201) of False Positives.
CART Confusion Matrix
This looks like a more valuable classifier because it correctly predicted 10 recurrence events as well as 188 no recurrence events. The model also shows a modest number of False Negatives (75) and False Positives (13).
As we can see in this example, accuracy can be misleading. Sometimes it may be desirable to select a model with a lower accuracy because it has a greater predictive power on the problem.
For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy, the problem is that this model is not useful in the problem domain. As we saw in our breast cancer example.
This is called the Accuracy Paradox. For problems like, this additional measures are required to evaluate a classifier.
Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).
Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of False Positives.
- The precision of the All No Recurrence model is 0/(0+0) or not a number, or 0.
- The precision of the All Recurrence model is 85/(85+201) or 0.30.
- The precision of the CART model is 10/(10+13) or 0.43.
The precision suggests CART is a better model and that the All Recurrence is more useful than the All No Recurrence model even though it has a lower accuracy. The difference in precision between the All Recurrence model and the CART can be explained by the large number of False Positives predicted by the All Recurrence model.
Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate.
Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False Negatives.
- The recall of the All No Recurrence model is 0/(0+85) or 0.
- The recall of the All Recurrence model is 85/(85+0) or 1.
- The recall of CART is 10/(10+75) or 0.12.
As you would expect, the All Recurrence model has a perfect recall because it predicts “recurrence” for all instances. The recall for CART is lower than that of the All Recurrence model. This can be explained by the large number (75) of False Negatives predicted by the CART model.
The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall.
- The F1 for the All No Recurrence model is 2*((0*0)/0+0) or 0.
- The F1 for the All Recurrence model is 2*((0.3*1)/0.3+1) or 0.46.
- The F1 for the CART model is 2*((0.43*0.12)/0.43+0.12) or 0.19.
If we were looking to select a model based on a balance between precision and recall, the F1 measure suggests that All Recurrence model is the one to beat and that CART model is not yet sufficiently competitive.
In this post, you learned about the Accuracy Paradox and problems with a class imbalance when Classification Accuracy alone cannot be trusted to select a well-performing model.
Through example, you learned about the Confusion Matrix as a way of describing the breakdown of errors in predictions for an unseen dataset. You learned about measures that summarize the precision (exactness) and recall (completeness) of a model and a description of the balance between the two in the F1 Score.