How to Report Classifier Performance with Confidence Intervals

Last Updated on

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.

This is important so that you can set the expectations for the model on new data.

A common mistake is to report the classification accuracy of the model alone.

In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

How to Report Classifier Performance with Confidence Intervals

How to Report Classifier Performance with Confidence Intervals
Photo by Andrew, some rights reserved.

Classification Accuracy

The skill of a classification machine learning algorithm is often reported as classification accuracy.

This is the percentage of the correct predictions from all predictions made. It is calculated as follows:

A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.

Classification Error

When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.

This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.

You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.

A classifier may have an error of 0.25 or 0.02.

This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.

Validation Dataset

What dataset do you use to calculate model skill?

It is a good practice to hold out a validation dataset from the modeling process.

This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.

After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Confidence Interval

Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.

A confidence interval is comprised of two things:

  • Range. This is the lower and upper limit on the skill that can be expected on the model.
  • Probability. This is the probability that the skill of the model will fall within the range.

In general, the confidence interval for classification error can be calculated as follows:

Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.

The values for const are provided from statistics, and common values used are:

  • 1.64 (90%)
  • 1.96 (95%)
  • 2.33 (98%)
  • 2.58 (99%)

Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

  • Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
  • At least 30 observations were used to evaluate the model.

This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.

Confidence Interval Example

Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).

We can calculate the 95% confidence interval (const = 1.96) as follows:

Or, stated another way:

There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.

Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.

Further Reading

Summary

In this post, you discovered how to calculate confidence intervals for your classifier.

Specifically, you learned:

  • How to calculate classification accuracy and classification error when reporting results.
  • What dataset to use when calculating model skill that is to be reported.
  • How to calculate a lower and upper bound on classification error for a chosen level of likelihood.

Do you have any questions about classifier confidence intervals?
Ask your questions in the comments below.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

See What's Inside

55 Responses to How to Report Classifier Performance with Confidence Intervals

  1. Birkey June 2, 2017 at 3:12 pm #

    How’s this (confidence interval) differ from F1 score, which is widely used and, IMHO, easier to comprehend, since it’s one score covers both precision and recall.

    • Jason Brownlee June 3, 2017 at 7:20 am #

      The F1 is a skill measure for the model. It could be accuracy or anything else.

      In this post, we are talking about the confidence (uncertainty) on the calculated skill score.

  2. Elie Kawerk June 3, 2017 at 3:17 am #

    Hi Jason,

    Thank you for the nice post. This error confidence interval that you report corresponds to binary classification only. How about multi-class classification?

    Regards

    • Jason Brownlee June 3, 2017 at 7:25 am #

      Really great question. I expect you would use logloss or AUC and report confidence on that.

      • Elie Kawerk June 3, 2017 at 5:03 pm #

        I see,

        But then the expression of the confidence interval (for AUC or any other metric) would be different I presume since the process wouldn’t be described using the binomial distribution.

        For multi-class classification, wouldn’t the distribution be a multinomial distribution? And in this case the expression for the error confidence interval would change I presume.

        Regards
        Elie

        • Jason Brownlee June 4, 2017 at 7:49 am #

          I see, yes you are correct. I would recommend an empirical approach to summarizing the distribution using the bootstrap method (a post is scheduled).

  3. Jonad June 4, 2017 at 1:23 am #

    Hi Jason,
    Really good post. But I have a question. Does the classification error differ if we use a different skill – for instance F1-score – for our model?
    Thanks

    • Jason Brownlee June 4, 2017 at 7:54 am #

      Hi Jonad,

      Different measures will evaluate skill in different ways. They will provide different perspectives on the same underlying model error.

      Does that make sense?

      • jonad June 5, 2017 at 2:45 am #

        Yes, I was thinking that the classification error formula ( incorrect predictions / total predictions) might differ depending on the evaluation metrics. Now I understand it better.
        Thanks

  4. Simone June 7, 2017 at 9:25 pm #

    Great post!
    How could I use confidence intervals and cross-validation together?

    • Jason Brownlee June 8, 2017 at 7:42 am #

      It’s a tough one, we are generally interested in the variance of model skill during model selection and during the presentation of the final model.

      Often standard deviation of CV score is used to capture model skill variance, perhaps that is generally sufficient and we can leave confidence intervals for presenting the final model or specific predictions?

      I’m open to better ideas.

      • Simone June 9, 2017 at 6:32 pm #

        Ok, Thanks!
        The last question: when I’m using k-fold cv, the value of ‘n’ is equal to the number of all observations or all observations – k?

        • yerart September 15, 2019 at 7:41 am #

          @Simone the value of n is AFAIK an empirical value that is chosen to be 5 or 10. Jason explains that very well (as usual) in this post:

          https://machinelearningmastery.com/k-fold-cross-validation/

        • yerartdev September 15, 2019 at 7:45 am #

          Ah @Simone, by the way, if ‘n’ is equal to the number of all observations that is a type of cross-validation that is called LOOCV (Leave-one-out cross-validation) and uses a single observation from the original sample as the validation data, and the remaining observations as the training data.

      • yerart September 15, 2019 at 8:00 am #

        @Jason what about this? (I haven’t fully read it yet and I’m struggling to understand it but .. well, I think it might work as well):

        Mach Learn. 2018; 107(12): 1895–1922. Published online 2018 May 9. doi: 10.1007/s10994-018-5714-4. PMCID: PMC6191021, PMID: 30393425. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Ioannis Tsamardinos, lissavet Greasidou (corresponding author), and Giorgos Borboudakis.

        https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191021/

    • yerart September 15, 2019 at 7:36 am #

      Well @Simone, from the point of view of a developer if you take a look at the scikit-learn documentation and go over the section “3.1.1. Computing cross-validated metrics” (https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics) you will see that the 95% confidence interval of the score estimate is reported as Jason states in this post.

  5. Sathish February 6, 2018 at 8:47 am #

    Hi Jason
    Is there R code for calculating the CI and graphing them?

    Thanks

    • Jason Brownlee February 6, 2018 at 9:27 am #

      I bet there is, I don’t have it on hand, sorry.

  6. Thomas February 12, 2018 at 11:25 pm #

    The error is just the reverse of the accuracy, wouldn’t that be a simpler statement to make?

    This leads to the fundamental problem that accuracy or classification error itself is often mediocre to useless metric because data sets usually are imbalanced. And hence the confidence on that error is just as useless.

    I found this post for a different reason as I wanted to find if anyone else does what i do, namely provide metrics grouped by class probability. What is the precision if the model has 0.9 class probability vs 0.6 for example. That can be very useful information for end users because the metric will often vary greatly based on class probability.

  7. Peipei May 3, 2018 at 12:02 am #

    Hi Jason,

    Nice post. When calculating the confidence interval for error, AUC or other metrics, the standard error of the metric is needed. How should I calculate the standard error?

    • Jason Brownlee May 3, 2018 at 6:34 am #

      Great question, here is the equation:
      https://en.wikipedia.org/wiki/Standard_error

      • Peipei May 3, 2018 at 7:48 pm #

        Thanks for replying. Does this mean I need to get multiple errors by running multiple times (bootstrap or cross-validation) to calculate the standard error?

        • Jason Brownlee May 4, 2018 at 7:43 am #

          Yes, if you are looking to calculate the standard error of the bootstrap result distribution.

  8. Manoj May 3, 2018 at 1:33 pm #

    Hi Jason,

    I am trying to group my customers. Say GAP HK, GAP US should be under the group customer GAP.

    Few of the customers are already grouped. Say GAP HK is grouped under GAP but GAP US is not.

    I am using random forest classifier. I used already grouped customer name as training data. Group customer code is the label that I am trying to predict.

    The classifier is assigning labels as expected. The problem I am facing is that the classifier is also assigning labels or group customer code to the customers although the customer name does not match closely with the training data. It is doing the best possible match. It is problem for me because I need to manually ungroup these customers. Can you suggest how to overcome this problem? Is it possible to know classifier correct probability for each predicted label? If yes, then I can ignore the once with low probability.

    Thank you in advance for advice.

  9. Anish July 12, 2018 at 3:31 am #

    Hi Jason,
    I am not sure if anyone else brought this up but I’ve found one issue here. The confidence interval based measure you suggested is not the “Wilson score interval”. according to the Wikipedia page(which is cited in that link). It’s actually “Normal approximation interval” which is above Wilson score paragraph. Correct me if I am wrong.

    Thanks
    -Anish

  10. AB September 4, 2018 at 1:15 am #

    Hi Jason,
    I’m interested on the relation of Cross Validation and this approach.

    With 150 examples I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have 100×5 results and I can use the mean and std dev of the error rates to estimate the variance of the model skills:

    mean(errorRate) +/- 1.96*(std(errorRate))

    I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

    mean(errorRate) +/- const * sqrt( (mean(errorRate) * (1 – mean(errorRate))) / n)

    Two questions:
    1. Do you think this approach is correct?
    2. Is correct to set n=150 in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV?

  11. Kostas Theodor September 15, 2018 at 2:22 am #

    Hi Jason, thanks for the great posts on confidence intervals/ bootstraps for machine learning.

    Suppose you use
    A) 5-fold CV
    B) 30-fold CV

    for model evaluation. You pick the final model and train it on all the data at hand.

    What are the options one has for reporting on final model skill with a range for uncertainty in each case?
    Should one have still held out a number of datapoints for validation+binomial confidence interval?
    Is it too late to use the bootstrap confidence intervals as the final model was trained?

    Thanks

    • Jason Brownlee September 15, 2018 at 6:16 am #

      Not sure I follow your question?

      Pick a final model and use a preferred method to report expected performance. It is unrelated to how you chose that model.

      • Kostas Theodor September 16, 2018 at 10:29 pm #

        Thanks Jason. I found your other post https://machinelearningmastery.com/difference-test-validation-datasets/ very helpful.
        Can I confirm that the above procedure of reporting classifier performance with confidence intervals is relevant for the final trained model? If that is so, it seems that the validation dataset mentioned should be called test set to align with the definitions of the linked post?

  12. mars October 6, 2018 at 6:30 am #

    Hi Jason,

    Thank you for the post!

    In your example you use accuracy and error rate and calculate a confidence interval.

    Can one replace “error rate” with, say, precision, recall or f1? Why and why not?

    For example, say we have a sample size=50, f1=0.02
    Does that mean …

    there is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true F1 of the model on unseen data?

    Thanks!

    • Jason Brownlee October 6, 2018 at 11:42 am #

      Perhaps for some scores. The example in this example is specific for a ratio. I believe you can use it for other ratios like precision, recall and f1.

  13. jecy November 1, 2018 at 12:49 pm #

    Hi Jason

    Thank you for your post.

    How get the standard error of the AUC curve in python

    • Jason Brownlee November 1, 2018 at 2:33 pm #

      Not sure I follow. Standard error refers to a statistical quantity on a distribution, not sure how you would calculate it for a curve.

  14. Lorenzo Famiglini December 29, 2018 at 1:17 am #

    Hello Jason,
    I was wondering if I can compute confidence interval for Recall and Precision. If yes can you explain how can I do this?
    Thank you so much,
    best regards

    Lorenzo

    • Jason Brownlee December 29, 2018 at 5:54 am #

      Yes, I expect the bootstrap would be a good place to start.

  15. Franco Arda April 14, 2019 at 7:23 am #

    Another excellent post Jason. Thank you.

    There might a minor typo in “[0.0, 0.0588]” – should be [0.0, 0.0388] I think.

  16. Daniel Wigmore April 19, 2019 at 12:12 am #

    I am running a classifier with a training set of 41 and a validation set of 14 (55 total observations). I rerun this 50 times with different random slices of the data as training and test. Obviously I cannot make confidence intervals with this small validation set.

    However, because I am rerunning it with different training and validation slices, can I get the mean error rate over the 50 tests and calculate a confidence interval?

    const * sqrt( (error * (1 – error)) / n)

    N would be 700 (14*50). If I had 50 tests which averaged out to an accurracy of 77.4% (error is 0.226), the confidence intervals would be 0.26 and 0.2

    Does this work? or would these confidence intervals be unreliable?

    Thanks for your excellent article

    Dan

    • Jason Brownlee April 19, 2019 at 6:12 am #

      Why would they be unreliable?

      What are you worried about exactly – the dataset selection for each trial?

      • Daniel April 23, 2019 at 5:37 am #

        Hi Mr Brownlee,

        Thank you for the quick reply and apologies for my late response. I am dealing with social science data and the validation set is rather limited. I am worried about assertaining confidence intervals for a limited validation sample.

        A friend of mine came up with a solution in which I keep all the accuracy outputs in a vector and plot them like a histogram (I can’t seem to paste one into this reply window but can send it over if necessary by email).

        Would I be able to get the confidence intervals by looking at the 5th and 95th percentile of the accuracy vector?

        Would there be an advantage to randomly sampling (bootstrapping) with replacement over without replacement?

        Thanks again

        Dan

        • Jason Brownlee April 23, 2019 at 7:58 am #

          Just repeats without bootstrap? I think the distribution would only capture the variance in the model, not the data.

          I would encourage you to use the bootstrap to calculate the distribution of accuracy scores.

  17. Daniel April 24, 2019 at 10:03 pm #

    Thanks Jason,

    Final question,

    My trained model has the below output and the best tuned number of neighbours is 5.

    I am creating confidence intervals through creating a histogram of the accuracies in the resample. Is there a way to subset the resample results in which the model is best tuned (in this case k = 5)?

    No pre-processing
    Resampling: Bootstrapped (1000 reps)
    Summary of sample sizes: 32, 32, 32, 32, 32, 32, …
    Resampling results across tuning parameters:

    k Accuracy Kappa
    5 0.7262690 0.4593792
    7 0.6830904 0.3830819
    9 0.6655405 0.3522427

    Accuracy was used to select the optimal model using the largest value.
    The final value used for the model was k = 5.

    • Jason Brownlee April 25, 2019 at 8:15 am #

      I recommend using a new procedure with the chosen config and the bootstrap to estimate the confidence intervals on model performance.

  18. Daniel April 25, 2019 at 5:50 am #

    sorry as an update, I am selecting ‘final’ as the returnResamp argument in the train control method. I believe that this should retain the resamples from only the best-tuned model

    But typically when I check the mean of the resample, mean(model$resample$Accuracy), the mean is lower than the k=5 accuracy (typically 0.65). Is there a reason for this? I would have thought that the mean accuracy of the best tune resamples would equal the model accuracy in the results.

    After this I promise to leave you alone (and thanks for your patience so far)

  19. Kalen Gordon June 18, 2019 at 7:19 am #

    This was very helpful for classification.

    How would I go about calculating confidence intervals for regression analysis.
    Could i use the same formula but instead of using classification error would I be able to use MAE should I use RMSE?

Leave a Reply