Classification Accuracy is Not Enough: More Performance Measures You Can Use

When you build a model for a classification problem you almost always want to look at the accuracy of that model as the number of correct predictions from all predictions made.

This is the classification accuracy.

In a previous post, we have looked at evaluating the robustness of a model for making predictions on unseen data using cross-validation and multiple cross-validation where we used classification accuracy and average classification accuracy.

Once you have a model that you believe can make robust predictions you need to decide whether it is a good enough model to solve your problem. Classification accuracy alone is typically not enough information to make this decision.

Classification Accuracy

Classification Accuracy
Photo by Nina Matthews Photography, some rights reserved

In this post, we will look at Precision and Recall performance measures you can use to evaluate your model for a binary classification problem.

Recurrence of Breast Cancer

The breast cancer dataset is a standard machine learning dataset. It contains 9 attributes describing 286 women that have suffered and survived breast cancer and whether or not breast cancer recurred within 5 years.

It is a binary classification problem. Of the 286 women, 201 did not suffer a recurrence of breast cancer, leaving the remaining 85 that did.

I think that False Negatives are probably worse than False Positives for this problem. Do you agree? More detailed screening can clear the False Positives, but False Negatives are sent home and lost to follow-up evaluation.

Classification Accuracy

Classification accuracy is our starting point. It is the number of correct predictions made divided by the total number of predictions made, multiplied by 100 to turn it into a percentage.

All No Recurrence

A model that only predicted no recurrence of breast cancer would achieve an accuracy of (201/286)*100 or 70.28%. We’ll call this our “All No Recurrence”. This is a high accuracy, but a terrible model. If it was used alone for decision support to inform doctors (impossible, but play along), it would send home 85 women with incorrectly thinking their breast cancer was not going to reoccur (high False Negatives).

All Recurrence

A model that only predicted the recurrence of breast cancer would achieve an accuracy of (85/286)*100 or 29.72%. We’ll call this our “All Recurrence”. This model has terrible accuracy and would send home 201 women thinking that had a recurrence of breast cancer but really didn’t (high False Positives).

CART

CART or Classification And Regression Trees is a powerful yet simple decision tree algorithm. On this problem, CART can achieve an accuracy of 69.23%. This is lower than our “All No Recurrence” model, but is this model more valuable?

We can see that classification accuracy alone is not sufficient to select a model for this problem.

Confusion Matrix

A clean and unambiguous way to present the prediction results of a classifier is to use a confusion matrix (also called a contingency table).

For a binary classification problem the table has 2 rows and 2 columns. Across the top is the observed class labels and down the side are the predicted class labels. Each cell contains the number of predictions made by the classifier that fall into that cell.

Truth Table Confusion Matrix

Truth Table Confusion Matrix

In this case, a perfect classifier would correctly predict 201 no recurrence and 85 recurrence which would be entered into the top left cell no recurrence/no recurrence (True Negatives) and bottom right cell recurrence/recurrence (True Positives).

Incorrect predictions are clearly broken down into the two other cells. False Negatives which are recurrence that the classifier has marked as no recurrence. We do not have any of those. False Positives are no recurrence that the classifier has marked as recurrence.

This is a useful table that presents both the class distribution in the data and the classifiers predicted class distribution with a breakdown of error types.

All No Recurrence Confusion Matrix

The confusion matrix highlights the large number (85) of False Negatives.

All No Recurrence Confusion Matrix

All No Recurrence Confusion Matrix

All Recurrence Confusion Matrix

The confusion matrix highlights the large number (201) of False Positives.

All Recurrence Confusion Matrix

All Recurrence Confusion Matrix

CART Confusion Matrix

This looks like a more valuable classifier because it correctly predicted 10 recurrence events as well as 188 no recurrence events. The model also shows a modest number of False Negatives (75) and False Positives (13).

CART Confusion Matrix

CART Confusion Matrix

Accuracy Paradox

As we can see in this example, accuracy can be misleading. Sometimes it may be desirable to select a model with a lower accuracy because it has a greater predictive power on the problem.

For example, in a problem where there is a large class imbalance, a model can predict the value of the majority class for all predictions and achieve a high classification accuracy, the problem is that this model is not useful in the problem domain. As we saw in our breast cancer example.

This is called the Accuracy Paradox. For problems like, this additional measures are required to evaluate a classifier.

Precision

Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the Positive Predictive Value (PPV).

Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of False Positives.

  • The precision of the All No Recurrence model is 0/(0+0) or not a number, or 0.
  • The precision of the All Recurrence model is 85/(85+201) or 0.30.
  • The precision of the CART model is 10/(10+13) or 0.43.

The precision suggests CART is a better model and that the All Recurrence is more useful than the All No Recurrence model even though it has a lower accuracy. The difference in precision between the All Recurrence model and the CART can be explained by the large number of False Positives predicted by the All Recurrence model.

Recall

Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate.

Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False Negatives.

  • The recall of the All No Recurrence model is 0/(0+85) or 0.
  • The recall of the All Recurrence model is 85/(85+0) or 1.
  • The recall of CART is 10/(10+75) or 0.12.

As you would expect, the All Recurrence model has a perfect recall because it predicts “recurrence” for all instances. The recall for CART is lower than that of the All Recurrence model. This can be explained by the large number (75) of False Negatives predicted by the CART model.

F1 Score

The F1 Score is the 2*((precision*recall)/(precision+recall)). It is also called the F Score or the F Measure. Put another way, the F1 score conveys the balance between the precision and the recall.

  • The F1 for the All No Recurrence model is 2*((0*0)/0+0) or 0.
  • The F1 for the All Recurrence model is 2*((0.3*1)/0.3+1) or 0.46.
  • The F1 for the CART model is 2*((0.43*0.12)/0.43+0.12) or 0.19.

If we were looking to select a model based on a balance between precision and recall, the F1 measure suggests that All Recurrence model is the one to beat and that CART model is not yet sufficiently competitive.

Summary

In this post, you learned about the Accuracy Paradox and problems with a class imbalance when Classification Accuracy alone cannot be trusted to select a well-performing model.

Through example, you learned about the Confusion Matrix as a way of describing the breakdown of errors in predictions for an unseen dataset. You learned about measures that summarize the precision (exactness) and recall (completeness) of a model and a description of the balance between the two in the F1 Score.

85 Responses to Classification Accuracy is Not Enough: More Performance Measures You Can Use

  1. Yonglin October 5, 2015 at 7:03 pm #

    Hey Jason,

    There is a spelling mistake in the first paragraph of “Confusion Matrix” section where you wrote “A clean and unambiguous way to present the prediction results of a classifier is to use a use a confusion matrix (also called a contingency table).”.

    You see two “use a”. 🙂

    Thank you.

    • mubarak December 13, 2017 at 3:30 am #

      you also made a typo by repeating “use a”

  2. vedika November 12, 2015 at 10:02 am #

    In the last point that you made,

    If we were looking to select a model based on a balance between precision and recall, the F1 measure suggests that All Recurrence model is the one to beat and that CART model is not yet sufficiently competitive.

    here do we then select the all recurrence model because it is giving a better balance of precision and recall? or do we try to get a model which performs at least better than cart?

  3. Asim November 15, 2015 at 12:28 pm #

    The F1 metric is not a suitable method of combining precision and recall if there is a class imbalance, as there is here. More appropriate would be to use the Matthew’s Correlation Coefficient (https://en.wikipedia.org/wiki/Matthews_correlation_coefficient). By my calculations the results are:

    All No Recurrence = 0
    All Recurrence = 0
    CART = 0.089

    Meaning the “All No Recurrence” and “All Recurrence” models are no better than randomly guessing, and CART is only marginally better. Unfortunately no useful models were presented in this article, but using MCC it’s possible to catch this. Here’s how to interpret MCC:

    http://stats.stackexchange.com/questions/118219/how-to-interpret-matthews-correlation-coefficient-mcc

  4. Moloy November 20, 2015 at 5:40 am #

    We know 0 is the worst value and 1 is the best value for F1 Score while choosing among the models. Is there any standard value of F1 Score (like p-value) above which we accept the model and below which we reject the model?

  5. ashish March 20, 2016 at 8:54 pm #

    what is the R Code for calculating accuracy of decision tree of cancer data

  6. Hichame Moriceau May 21, 2016 at 12:55 am #

    Hi Jason,

    [Reporting mistake in article]

    If we look at the examples of F1 score calculations we can see that there are missing parentheses at the denominator. Just reporting, you might want to update this according to the correct formula you previously stated! 🙂

    https://en.wikipedia.org/wiki/F1_score

    Best,
    Hichame

  7. Mrinal October 7, 2016 at 2:03 pm #

    I found, the precision and recall value given by the caret package in R are different from the actual definition of them in https://en.wikipedia.org/wiki/Precision_and_recall. Could you tell me why it is? In fact , I got an online confusion matrix where both results are showing. http://www.marcovanetti.com/pages/cfmatrix/ . I can’t understand which one I should use.

  8. seggs November 25, 2016 at 11:55 am #

    Balanced accuracy can be used as a better metrics than accuracy for a multi class imbalanced dataset classification task. have you tried to review that to affirm if that is correct or not. If yes, can you drop your implementation on your blog

  9. Society of Data Scientists January 5, 2017 at 8:24 am #

    It is helpful to know that the F1/F Score is a measure of how accurate a model is by using Precision and Recall following the formula of:

    F1_Score = 2 * ((Precision * Recall) / (Precision + Recall))

    Precision is commonly called positive predictive value. It is also interesting to note that the PPV can be derived using Bayes’ theorem as well.

    Precision = True Positives / (True Positives + False Positives)

    Recall is also known as the True Positive Rate and is defined as the following:

    Recall = True Positives / (True Positives + False Negatives)

  10. kara larson February 2, 2017 at 5:50 am #

    is auc better?

  11. Ankur J March 3, 2017 at 11:23 am #

    A very good explanation to a very common analytics scenario!

  12. MTHead April 5, 2017 at 7:46 pm #

    It would be useful to see how the F-beta (specifically the F2) measure would perform in your scenario, particularly as we are seeking to minimise false negatives … and how that would compare with AUC

  13. Zhenghong Lai April 21, 2017 at 1:08 am #

    Hello, I am a beginner for ML. Recently, I’m doing a project about Feature Selection. I have fineshed the most part of it. And writing the code with the help of the matlab toolboxs is OK. Now I haved learned that we can build a decision tree with the class classregtree in matlab. And we can get the Cost of Misclassification with the method test of classregtree. BUT what should I do next to get the classification accuracy? Is there any methods can get the classification accuracy? or we can calculate it by the Cost of Misclassification? Any help that you can give it to me will be appreciated.

    • Jason Brownlee April 21, 2017 at 8:38 am #

      You can make predictions on unsee data (data not used to fit the model). This will give you an estimate of the skill of the model when making predictions on new data.

      • Zhenghong Lai April 21, 2017 at 11:38 am #

        Thanks a lot

  14. Kaleb April 29, 2017 at 4:36 am #

    This article should describe Balanced Accuracy = (Recall + Specificity)/2, which addresses the data set imbalance problem. Using the 3 models above:

    The balanced accuracy of the All No Recurrence model is ((0/85)+(201/201))/2 or 0.5.
    The balanced accuracy of the All Recurrence model is ((85/85)+(0/201))/2 or 0.5.
    The precision of the CART model is ((10/85)+(188/201))/2 or 0.53

    Making the CART the one to choose if there are no preferences for minimizing the false positive or false negative rates

  15. Aryo Pradipta Gema April 29, 2017 at 3:48 pm #

    Hi, i’m considerably a beginner at ML especially when dealing with measuring its performance. I’ve recently tried to measure the performance of a deep learning architecture in doing a classification task. The dataset used on that task is highly imbalanced. In proportion, the first class only take 33% of the entire data in terms of amount. I tried to use Accuracy, F1, and Area Under ROC Curve. I also used StratifiedKFold for the cross validation algorithm. But, the F1 value is higher than the accuracy with 3-5% margin. The Area Under ROC Curve value is still under the accuracy. Me, and my research supervisor never saw something like this. But, i, personally, believe that it is possible. One of my hypothesis is because of the imbalance dataset that gives a smaller true negative value in the accuracy calculation. Is it possible? Is there any explanation for it?

  16. Aryo Pradipta Gema April 29, 2017 at 8:42 pm #

    Is it possible to have a lower f measure value than the accuracy if the data is imbalanced (divided into 2 classes, 33% for first class, and 67% for the second class)? I assume it’s because of the low true negative value, is it correct?

  17. Eduardo May 11, 2017 at 3:17 am #

    Hi,

    I think it would be nice including those informations in your ebook “Machine Learning Mastery with R”

    I am afraid there is a lot missing there about this subject,

    Thanks,

  18. Ali May 20, 2017 at 5:48 pm #

    I would love to see the same explanation for multi-class problems.Defining confusion matrix for binary classes is old now.

  19. Wojtek June 14, 2017 at 6:09 pm #

    When training a classifier (e.g. DNN based) with a continuous output p is it possible to specifically optimize for e.g. high recall? I am not talking about just making a cut on p at a point where recall is high but for instance setting a high recall goal e.g. 95% and minimizing the false positive rate obtainable at that recall goal.

    • Jason Brownlee June 15, 2017 at 8:44 am #

      A real-valued outcome makes your prediction problem regression not classification.

      Recall and precision cannot be measured on regression problems.

  20. Richard June 22, 2017 at 10:29 am #

    The confusion matrix you presented shows predictions as rows and observed classes as columns. Isn’t this the wrong way round?

  21. Vic July 11, 2017 at 9:37 am #

    When we examine how well the classifier is, do we care about True Negatives? Also, how to apply these measures on a multi-class problem? To me it feels like the Positive and Negative are two classes. But if we extend this to, say a 3 classes problem. Wouldn’t we need to extend this idea of Precision and Recall to all three classes to find the best classifier?

    • Jason Brownlee July 11, 2017 at 10:41 am #

      It really depends on your specific problem and on the area of your predictions that are most important to you.

  22. Marcos Marx July 17, 2017 at 8:52 pm #

    Great article Jason!

    I didn’t know the difference of accuracy and f1-score. And here presenting the paradox of accuracy, it was explicitly clear the importance of validation models.

  23. NAGARAJA M S August 25, 2017 at 9:40 pm #

    Hello Jason Brownlee
    Great explaination!!
    Please could you me on how to calculate model classification and predication ability in case of multiclass respones variable. help me in provide relavent infornation on this topic.

    Thank you

  24. aquaq August 28, 2017 at 7:27 pm #

    Hi Jason, thanks for this article. May I ask if you could reflect to the problem I asked in this question regarding MCC?
    https://stats.stackexchange.com/questions/299333/question-about-imbalanced-training-and-test-sets

    Also, do you have any opinion about using G-mean as a performance metric for imbalanced datasets?

    Thanks for all your help!

  25. Efendi November 13, 2017 at 1:10 am #

    In that case, what will be the good Precision and Recall values to determine that our model prediction is good?
    Thanks.

    • Jason Brownlee November 13, 2017 at 10:17 am #

      Great question.

      You want results that are better relative to a baseline model, such as the Zero Rule algorithm.

  26. Maciej November 29, 2017 at 2:26 am #

    Is ‘Confusion matrix’ useful in detection of incorrect value in timeseries data ? If yes, how to classify output of neural network as true positive, true negative etc.
    For example, when I put value ‘x’ to an input of NN it says ‘y’, but I can see that in my test data the value is ‘z’ (let’s say that ‘z’ is the incorrect value and value ‘y’ is the correct one). Should I consider it as a true positive ( assuming that incorrect values are represented by ‘positives’).

    • Jason Brownlee November 29, 2017 at 8:27 am #

      No, it is for classification problems, and time series are often regression problems.

  27. Abdul December 5, 2017 at 9:42 pm #

    Thanks Jason,
    great article.
    Is it possible to compare different binary based classification models (using imbalanced data set) in terms of 7 different performance measures (recall, specificity, balanced accuracy, precision, F-score, MCC and AUC) and how can we decide which model is the best?

    Thanks

    • Jason Brownlee December 6, 2017 at 9:01 am #

      Yes. Model selection will be specific your project goals.

  28. Abdul December 8, 2017 at 2:18 am #

    Hi Jason,

    I have a question about decision making.

    How can we interpret the following results when there is a conflict between different measures and what decision can we make?

    For example, in terms of high balanced accuracy, the Kernel-SVM was the best model with 98.09%, followed by RBF-NN with 97.74% and CART DT with 95.26%. For the F-score, MCC and AUC measures , the RBF-NN model achieved the highest results (99.21 %, 92.82 and 0.98) followed by the CART DT (98.43%, 85.34% and 0.91) and the Kernel-SVM model (98.05%, 81.32% and 0.97).

    Thanks

    • Jason Brownlee December 8, 2017 at 5:43 am #

      It comes down to the measure that best reflects your goals and the simplest-skillful model on that measure.

  29. Abdul December 8, 2017 at 10:04 pm #

    Thanks Jason,

    Is it possible to compare different classification models based on the overall mean of different performance measures?

    • Jason Brownlee December 9, 2017 at 5:41 am #

      Sure, you can compare algorithms anyway you wish for your specific requirements.

  30. Rizwan Mian December 28, 2017 at 12:39 pm #

    Jason, thanks.

    I see a string of useful evaluation metrics. Take classification for example, we see accuracy, F-measure, area under ROC etc.

    dumb question: is there a utility score metric that combines many (all?) of them and give us a universal score? probably not, why not?

    For example, F-measure combines precision and recall.

    ps. How can I get email notifications when somebody replies to my questions or comments.

    • Jason Brownlee December 28, 2017 at 2:11 pm #

      Model skill is really a balance of trade-offs. You must find the right trade-off for your specific problem (e.g. by talking to stakeholders).

      Sorry, I don’t have notifications yet, I hope to add them in the future. Thanks for the prompt!

  31. Vishnu Priya January 29, 2018 at 4:11 am #

    Hi!Mr.Jason. I did a multiclass classification and found the confusion matrix for it.Then I found precision and sensitivity for each class and now I want to calculate Fscore.So what do I do?Should I calculate Fscore for each class and then average???or find average precision and sensitivity and find Fscore? or something

  32. Jesús Martínez March 15, 2018 at 11:04 am #

    Thanks for the article! One of the biggest and first mistakes I made when I was starting out in machine learning was to think that accuracy tells the whole story. In fact, I found that more complex metric such as AUC, F1 and Precision and Recall are used more frequently than accuracy. In particular, Kaggle tends to favor AUC over accuracy or F1 in classification competitions.

  33. 3mer April 14, 2018 at 12:00 pm #

    Hi Jason! Thanks for this information. Does your book have this content?

    • Jason Brownlee April 15, 2018 at 6:19 am #

      I do cover measures in some of my books, but not in great detail.

      What do you need help with exactly?

  34. Krishna April 18, 2018 at 3:51 am #

    Typo: The recall metrics in the CART F-score calculation is missing the decimals i.e. reads as 12 instead of 0.12

    “The F1 for the CART model is 2*((0.43*12)/0.43+12) or 0.19.”

    Great blog Jason!

  35. Narendra Chintala June 25, 2018 at 11:17 pm #

    I am doing a binary classification on images and I am fine-tuning the resnet50 pre-trained on imagenet dataset and fine tuning the all layers but I can only get upto 91% validation accuracy.How can I achieve higher than this?

    • Jason Brownlee June 26, 2018 at 6:37 am #

      I have some general ideas here:
      http://machinelearningmastery.com/improve-deep-learning-performance/

      • Narendra Chintala June 27, 2018 at 5:02 pm #

        Here is the total code that I have used.

        import os
        import glob
        import numpy as np
        import json
        import pickle
        import cv2
        import ntpath
        import random
        import pdb
        import datetime
        from sklearn.preprocessing import LabelEncoder
        import numpy as np
        import h5py
        import datetime
        import time

        # keras imports
        #from keras.applications.mobilenet import MobileNet, preprocess_input
        from keras.applications.resnet50 import ResNet50, preprocess_input
        from keras.preprocessing import image
        from keras.models import Model , load_model
        from keras.models import model_from_json
        from keras.layers import Input , Dense , Dropout , GlobalAveragePooling2D
        from tensorflow.python.keras._impl.keras.layers import Conv2D , Reshape
        from keras.preprocessing.image import ImageDataGenerator
        from keras.optimizers import SGD,Adam
        from keras import models
        from keras import layers
        from keras.callbacks import ReduceLROnPlateau , ModelCheckpoint , Callback
        from keras import regularizers

        #print (“[STATUS] start time – {}”.format(datetime.datetime.now().strftime(“%Y-%m-%d %H:%M”)))
        #start = time.time()

        image_size = 224
        #prepare the data
        train_datagen = ImageDataGenerator(
        rescale=1./255,
        vertical_flip=True,
        horizontal_flip=True,
        rotation_range=20)

        validation_datagen = ImageDataGenerator(rescale=1./255)

        # Change the batchsize according to your system RAM
        train_batchsize = 16
        val_batchsize = 16

        train_generator = train_datagen.flow_from_directory(
        train_dir,
        target_size=(224, 224),
        batch_size=train_batchsize,
        class_mode=’categorical’)

        validation_generator = validation_datagen.flow_from_directory(
        validation_dir,
        target_size=(224, 224),
        batch_size=val_batchsize,
        class_mode=’categorical’,
        shuffle=False)

        #Model
        resnet50 = ResNet50()
        resnet50.layers.pop()

        #x = mobilenet.layers[-6].output
        #x = Dense(512 , activation = “relu”)(x)
        #x = Dropout(0.2)(x)
        #predictions = Dense(2 , activation = “softmax”)(x)

        #model = Model(inputs = mobilenet.inputs , outputs = predictions)
        #print(model.summary())

        x = resnet50.layers[-1].output
        x = Dropout(0.5)(x)
        predictions = Dense(2 , activation = “softmax”)(x)
        model = Model(inputs = resnet50.inputs , outputs = predictions)

        for layer in resnet50.layers:
        layer.trainable = True

        filepath=”weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5″
        #Compile the model
        model.compile(optimizer=Adam(lr=0.000001), loss=’categorical_crossentropy’, metrics=[‘accuracy’])

        #Callbacks
        checkpointer = ModelCheckpoint(filepath, monitor=’val_loss’ , verbose=1, save_best_only=True , mode = ‘min’)

        history = model.fit_generator(
        train_generator,
        steps_per_epoch=train_generator.samples/train_generator.batch_size ,
        epochs=200,
        validation_data=validation_generator,
        validation_steps=validation_generator.samples/validation_generator.batch_size,
        verbose=1,
        callbacks=[checkpointer]
        )

        Can you share any ideas based on this.

        • Jason Brownlee June 28, 2018 at 6:11 am #

          I’m eager to help, but I don’t have the capacity to debug your code.

          • Narendra Chintala June 28, 2018 at 3:16 pm #

            Sorry for that…I haven’t asked you to debug.I just wanted you to look at the parameters and augmentation techniques and suggest any ideas.
            My training data is 6000 images and validation data is 1600 images.

  36. Anam July 21, 2018 at 10:11 pm #

    Dear Jason,
    A very informative article but here I have a query that if values of precision and recall are identical(means same)that what does it shows?Thanks for your precious time.

  37. Anam July 22, 2018 at 12:23 pm #

    Dear Jason,
    Below is an example of identical precision and recall values.

    Precision Recall
    0.82 0.85
    0.85 0.81

    avg 0.83 0.83

    I want to know that what is the reason that the precision and recall values appear to be same.

  38. Elisa July 31, 2018 at 7:14 am #

    Dear Jason, thank you for your clear post. I am new in ML and I have a question on this topic. For the classification I divide my dataset into training and test sets. Iam wondering if it is proper or not to iterate the prediction of che classification (and the related confusion matrices) several time to assess the robustness of the model, namely to see what appens when the parameters used to build the model on the training data change. I hope I’ve been clear enough, thank you for your help.

  39. Sahil Sharma August 14, 2018 at 1:46 am #

    Which is more important to you– model accuracy, or model performance?

    • Jason Brownlee August 14, 2018 at 6:22 am #

      Accuracy is a performance metric.

      Perhaps I don’t follow your question? Do you mean performance as in computational complexity?

  40. Duc August 19, 2018 at 2:20 am #

    I think All Recurrence accuracy should be (85/286) instead of (75/286)

  41. Tyler August 29, 2018 at 9:33 am #

    Thanks for this write-up, it’s a helpful example that makes it easier for me to communicate this stuff to the bosses.

    Questions:
    Why doesn’t a prediction of No Recurrence count as a true-positive for the outcome of “No Recurrence”? Is that just because this is a binary example? And how is a binary problem really any different from classification of two classes that are mutually exclusive of each other…say apples and bananas?

  42. James September 16, 2018 at 1:37 pm #

    I would think even a metric as simple as (TPR + TNR)/2 would be useful for evaluating accuracy. What makes F1 better?

    • Jason Brownlee September 17, 2018 at 6:29 am #

      There is no “better”, just different approaches to try, one might be a good fit for your problem.

  43. Lee Fischer October 8, 2018 at 3:15 am #

    Great article! I find myself referring to the F1 score a lot in statistical modeling of disease diagnosis. Besides balancing precision and recall, it also corresponds to the lowest false detection rate (FDR), which is something we have to be aware of in the real world. AUROC and F1 similarly describe performance, but sometimes a high AUROC can also have a high FDR (not usually true with F1). But as you say, there is no better, it really depends on what the problem is, and what types are errors are more acceptable. It’s all about trade-offs 🙂

    • Jason Brownlee October 8, 2018 at 9:27 am #

      For sure. You really need to know what is important in measuring performance of a model no your problem and focus on that like a laser.

  44. Shabana November 2, 2018 at 6:25 am #

    hi can anybody help me how the values of “CART Confusion Matrix” are calculated/displayed please help me I tried my best but did not understand. please explain in detail

Leave a Reply