# How to Report Classifier Performance with Confidence Intervals

Last Updated on August 14, 2020

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.

This is important so that you can set the expectations for the model on new data.

A common mistake is to report the classification accuracy of the model alone.

In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.

Kick-start your project with my new book Statistics for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Report Classifier Performance with Confidence Intervals
Photo by Andrew, some rights reserved.

## Classification Accuracy

The skill of a classification machine learning algorithm is often reported as classification accuracy.

This is the percentage of the correct predictions from all predictions made. It is calculated as follows:

A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.

## Classification Error

When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.

This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.

You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.

A classifier may have an error of 0.25 or 0.02.

This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.

## Validation Dataset

What dataset do you use to calculate model skill?

It is a good practice to hold out a validation dataset from the modeling process.

This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.

After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Confidence Interval

Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.

A confidence interval is comprised of two things:

• Range. This is the lower and upper limit on the skill that can be expected on the model.
• Probability. This is the probability that the skill of the model will fall within the range.

In general, the confidence interval for classification error can be calculated as follows:

Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.

The values for const are provided from statistics, and common values used are:

• 1.64 (90%)
• 1.96 (95%)
• 2.33 (98%)
• 2.58 (99%)

Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

• Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
• At least 30 observations were used to evaluate the model.

This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.

## Confidence Interval Example

Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).

We can calculate the 95% confidence interval (const = 1.96) as follows:

Or, stated another way:

There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.

Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.

## Summary

In this post, you discovered how to calculate confidence intervals for your classifier.

Specifically, you learned:

• How to calculate classification accuracy and classification error when reporting results.
• What dataset to use when calculating model skill that is to be reported.
• How to calculate a lower and upper bound on classification error for a chosen level of likelihood.

Do you have any questions about classifier confidence intervals?

## Get a Handle on Statistics for Machine Learning!

#### Develop a working understanding of statistics

...by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more...

### 83 Responses to How to Report Classifier Performance with Confidence Intervals

1. Birkey June 2, 2017 at 3:12 pm #

How’s this (confidence interval) differ from F1 score, which is widely used and, IMHO, easier to comprehend, since it’s one score covers both precision and recall.

• Jason Brownlee June 3, 2017 at 7:20 am #

The F1 is a skill measure for the model. It could be accuracy or anything else.

In this post, we are talking about the confidence (uncertainty) on the calculated skill score.

2. Elie Kawerk June 3, 2017 at 3:17 am #

Hi Jason,

Thank you for the nice post. This error confidence interval that you report corresponds to binary classification only. How about multi-class classification?

Regards

• Jason Brownlee June 3, 2017 at 7:25 am #

Really great question. I expect you would use logloss or AUC and report confidence on that.

• Elie Kawerk June 3, 2017 at 5:03 pm #

I see,

But then the expression of the confidence interval (for AUC or any other metric) would be different I presume since the process wouldn’t be described using the binomial distribution.

For multi-class classification, wouldn’t the distribution be a multinomial distribution? And in this case the expression for the error confidence interval would change I presume.

Regards
Elie

• Jason Brownlee June 4, 2017 at 7:49 am #

I see, yes you are correct. I would recommend an empirical approach to summarizing the distribution using the bootstrap method (a post is scheduled).

3. Jonad June 4, 2017 at 1:23 am #

Hi Jason,
Really good post. But I have a question. Does the classification error differ if we use a different skill – for instance F1-score – for our model?
Thanks

• Jason Brownlee June 4, 2017 at 7:54 am #

Different measures will evaluate skill in different ways. They will provide different perspectives on the same underlying model error.

Does that make sense?

• jonad June 5, 2017 at 2:45 am #

Yes, I was thinking that the classification error formula ( incorrect predictions / total predictions) might differ depending on the evaluation metrics. Now I understand it better.
Thanks

4. Simone June 7, 2017 at 9:25 pm #

Great post!
How could I use confidence intervals and cross-validation together?

• Jason Brownlee June 8, 2017 at 7:42 am #

It’s a tough one, we are generally interested in the variance of model skill during model selection and during the presentation of the final model.

Often standard deviation of CV score is used to capture model skill variance, perhaps that is generally sufficient and we can leave confidence intervals for presenting the final model or specific predictions?

I’m open to better ideas.

• Simone June 9, 2017 at 6:32 pm #

Ok, Thanks!
The last question: when I’m using k-fold cv, the value of ‘n’ is equal to the number of all observations or all observations – k?

• yerart September 15, 2019 at 7:41 am #

@Simone the value of n is AFAIK an empirical value that is chosen to be 5 or 10. Jason explains that very well (as usual) in this post:

https://machinelearningmastery.com/k-fold-cross-validation/

• yerartdev September 15, 2019 at 7:45 am #

Ah @Simone, by the way, if ‘n’ is equal to the number of all observations that is a type of cross-validation that is called LOOCV (Leave-one-out cross-validation) and uses a single observation from the original sample as the validation data, and the remaining observations as the training data.

• yerart September 15, 2019 at 8:00 am #

@Jason what about this? (I haven’t fully read it yet and I’m struggling to understand it but .. well, I think it might work as well):

Mach Learn. 2018; 107(12): 1895–1922. Published online 2018 May 9. doi: 10.1007/s10994-018-5714-4. PMCID: PMC6191021, PMID: 30393425. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Ioannis Tsamardinos, lissavet Greasidou (corresponding author), and Giorgos Borboudakis.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191021/

• yerart September 15, 2019 at 7:36 am #

Well @Simone, from the point of view of a developer if you take a look at the scikit-learn documentation and go over the section “3.1.1. Computing cross-validated metrics” (https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics) you will see that the 95% confidence interval of the score estimate is reported as Jason states in this post.

• Manuel Gonçalves October 3, 2020 at 11:39 am #

I don,t understand why in https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics they just use (std * 2) to give the CI on the mean of the k-fold results:

The mean score and the 95% confidence interval of the score estimate are hence given by:
>>> print(“Accuracy: %0.2f (+/- %0.2f)” % (scores.mean(), scores.std() * 2))
Accuracy: 0.98 (+/- 0.03)

• Jason Brownlee October 3, 2020 at 12:30 pm #

Yes, that is a quick and dirty way to calculate an interval assuming a Gaussian distribution.

5. Sathish February 6, 2018 at 8:47 am #

Hi Jason
Is there R code for calculating the CI and graphing them?

Thanks

• Jason Brownlee February 6, 2018 at 9:27 am #

I bet there is, I don’t have it on hand, sorry.

6. Thomas February 12, 2018 at 11:25 pm #

The error is just the reverse of the accuracy, wouldn’t that be a simpler statement to make?

This leads to the fundamental problem that accuracy or classification error itself is often mediocre to useless metric because data sets usually are imbalanced. And hence the confidence on that error is just as useless.

I found this post for a different reason as I wanted to find if anyone else does what i do, namely provide metrics grouped by class probability. What is the precision if the model has 0.9 class probability vs 0.6 for example. That can be very useful information for end users because the metric will often vary greatly based on class probability.

• Jason Brownlee February 13, 2018 at 8:02 am #

Yes, the classification error is the inverse of the classification accuracy.

You can use a different measure to overcome imbalance:
https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/

• Tommy May 25, 2018 at 9:00 am #

Thomas, I think I’ve done what you described. I wrote a function to calculate a hand full of different performance metrics at different probability cutoffs and had it stored in a data frame. This helped me choose a probability cutoff that balanced the needs of the business. I can share the code of it’s what your looking for.

7. Peipei May 3, 2018 at 12:02 am #

Hi Jason,

Nice post. When calculating the confidence interval for error, AUC or other metrics, the standard error of the metric is needed. How should I calculate the standard error?

• Jason Brownlee May 3, 2018 at 6:34 am #

Great question, here is the equation:
https://en.wikipedia.org/wiki/Standard_error

• Peipei May 3, 2018 at 7:48 pm #

Thanks for replying. Does this mean I need to get multiple errors by running multiple times (bootstrap or cross-validation) to calculate the standard error?

• Jason Brownlee May 4, 2018 at 7:43 am #

Yes, if you are looking to calculate the standard error of the bootstrap result distribution.

8. Manoj May 3, 2018 at 1:33 pm #

Hi Jason,

I am trying to group my customers. Say GAP HK, GAP US should be under the group customer GAP.

Few of the customers are already grouped. Say GAP HK is grouped under GAP but GAP US is not.

I am using random forest classifier. I used already grouped customer name as training data. Group customer code is the label that I am trying to predict.

The classifier is assigning labels as expected. The problem I am facing is that the classifier is also assigning labels or group customer code to the customers although the customer name does not match closely with the training data. It is doing the best possible match. It is problem for me because I need to manually ungroup these customers. Can you suggest how to overcome this problem? Is it possible to know classifier correct probability for each predicted label? If yes, then I can ignore the once with low probability.

9. Anish July 12, 2018 at 3:31 am #

Hi Jason,
I am not sure if anyone else brought this up but I’ve found one issue here. The confidence interval based measure you suggested is not the “Wilson score interval”. according to the Wikipedia page(which is cited in that link). It’s actually “Normal approximation interval” which is above Wilson score paragraph. Correct me if I am wrong.

Thanks
-Anish

10. AB September 4, 2018 at 1:15 am #

Hi Jason,
I’m interested on the relation of Cross Validation and this approach.

With 150 examples I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have 100×5 results and I can use the mean and std dev of the error rates to estimate the variance of the model skills:

mean(errorRate) +/- 1.96*(std(errorRate))

I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

mean(errorRate) +/- const * sqrt( (mean(errorRate) * (1 – mean(errorRate))) / n)

Two questions:
1. Do you think this approach is correct?
2. Is correct to set n=150 in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV?

11. Kostas Theodor September 15, 2018 at 2:22 am #

Hi Jason, thanks for the great posts on confidence intervals/ bootstraps for machine learning.

Suppose you use
A) 5-fold CV
B) 30-fold CV

for model evaluation. You pick the final model and train it on all the data at hand.

What are the options one has for reporting on final model skill with a range for uncertainty in each case?
Should one have still held out a number of datapoints for validation+binomial confidence interval?
Is it too late to use the bootstrap confidence intervals as the final model was trained?

Thanks

• Jason Brownlee September 15, 2018 at 6:16 am #

Pick a final model and use a preferred method to report expected performance. It is unrelated to how you chose that model.

• Kostas Theodor September 16, 2018 at 10:29 pm #

Can I confirm that the above procedure of reporting classifier performance with confidence intervals is relevant for the final trained model? If that is so, it seems that the validation dataset mentioned should be called test set to align with the definitions of the linked post?

12. mars October 6, 2018 at 6:30 am #

Hi Jason,

Thank you for the post!

In your example you use accuracy and error rate and calculate a confidence interval.

Can one replace “error rate” with, say, precision, recall or f1? Why and why not?

For example, say we have a sample size=50, f1=0.02
Does that mean …

there is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true F1 of the model on unseen data?

Thanks!

• Jason Brownlee October 6, 2018 at 11:42 am #

Perhaps for some scores. The example in this example is specific for a ratio. I believe you can use it for other ratios like precision, recall and f1.

13. jecy November 1, 2018 at 12:49 pm #

Hi Jason

How get the standard error of the AUC curve in python

• Jason Brownlee November 1, 2018 at 2:33 pm #

Not sure I follow. Standard error refers to a statistical quantity on a distribution, not sure how you would calculate it for a curve.

14. Lorenzo Famiglini December 29, 2018 at 1:17 am #

Hello Jason,
I was wondering if I can compute confidence interval for Recall and Precision. If yes can you explain how can I do this?
Thank you so much,
best regards

Lorenzo

• Jason Brownlee December 29, 2018 at 5:54 am #

Yes, I expect the bootstrap would be a good place to start.

15. Franco Arda April 14, 2019 at 7:23 am #

Another excellent post Jason. Thank you.

There might a minor typo in “[0.0, 0.0588]” – should be [0.0, 0.0388] I think.

• Jason Brownlee April 15, 2019 at 7:47 am #

Thanks.

Nope, 0.02 + 0.0388 = 0.0588]

16. Daniel Wigmore April 19, 2019 at 12:12 am #

I am running a classifier with a training set of 41 and a validation set of 14 (55 total observations). I rerun this 50 times with different random slices of the data as training and test. Obviously I cannot make confidence intervals with this small validation set.

However, because I am rerunning it with different training and validation slices, can I get the mean error rate over the 50 tests and calculate a confidence interval?

const * sqrt( (error * (1 – error)) / n)

N would be 700 (14*50). If I had 50 tests which averaged out to an accurracy of 77.4% (error is 0.226), the confidence intervals would be 0.26 and 0.2

Does this work? or would these confidence intervals be unreliable?

Dan

• Jason Brownlee April 19, 2019 at 6:12 am #

Why would they be unreliable?

What are you worried about exactly – the dataset selection for each trial?

• Daniel April 23, 2019 at 5:37 am #

Hi Mr Brownlee,

Thank you for the quick reply and apologies for my late response. I am dealing with social science data and the validation set is rather limited. I am worried about assertaining confidence intervals for a limited validation sample.

A friend of mine came up with a solution in which I keep all the accuracy outputs in a vector and plot them like a histogram (I can’t seem to paste one into this reply window but can send it over if necessary by email).

Would I be able to get the confidence intervals by looking at the 5th and 95th percentile of the accuracy vector?

Would there be an advantage to randomly sampling (bootstrapping) with replacement over without replacement?

Thanks again

Dan

• Jason Brownlee April 23, 2019 at 7:58 am #

Just repeats without bootstrap? I think the distribution would only capture the variance in the model, not the data.

I would encourage you to use the bootstrap to calculate the distribution of accuracy scores.

• Daniel April 23, 2019 at 6:38 pm #

Thank you for the quick reply. The method I am currently using is subsampling. Randomly selecting different observations for the training set and the validation set 50 times, and collecting the accuracy scores to make a distribution of accuracy scores. I believe this is called subsampling. But I am happy to use bootstrapping instead.

Just to clarify, I am using the bootstrap on the data for partitioning the training and validation set correct? This means that an observation in the training set can also end up in the validation set.

Is there an article you could recommend which explains why bootstrapping is better than subsampling (I have taken up enough of your time and really appreciate all of your help so far)

• Jason Brownlee April 24, 2019 at 7:56 am #
17. Daniel April 24, 2019 at 10:03 pm #

Thanks Jason,

Final question,

My trained model has the below output and the best tuned number of neighbours is 5.

I am creating confidence intervals through creating a histogram of the accuracies in the resample. Is there a way to subset the resample results in which the model is best tuned (in this case k = 5)?

No pre-processing
Resampling: Bootstrapped (1000 reps)
Summary of sample sizes: 32, 32, 32, 32, 32, 32, …
Resampling results across tuning parameters:

k Accuracy Kappa
5 0.7262690 0.4593792
7 0.6830904 0.3830819
9 0.6655405 0.3522427

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

• Jason Brownlee April 25, 2019 at 8:15 am #

I recommend using a new procedure with the chosen config and the bootstrap to estimate the confidence intervals on model performance.

18. Daniel April 25, 2019 at 5:50 am #

sorry as an update, I am selecting ‘final’ as the returnResamp argument in the train control method. I believe that this should retain the resamples from only the best-tuned model

But typically when I check the mean of the resample, mean(model$resample$Accuracy), the mean is lower than the k=5 accuracy (typically 0.65). Is there a reason for this? I would have thought that the mean accuracy of the best tune resamples would equal the model accuracy in the results.

After this I promise to leave you alone (and thanks for your patience so far)

19. Kalen Gordon June 18, 2019 at 7:19 am #

This was very helpful for classification.

How would I go about calculating confidence intervals for regression analysis.
Could i use the same formula but instead of using classification error would I be able to use MAE should I use RMSE?

20. Fajar October 28, 2019 at 9:25 pm #

Hi Jason, my question is not too related to this topic, only slightly

I have a neural network(MLP) for binary classification with a logistic output between 0 and 1. With each run, I have to adjust my threshold on test set for minimizing the misclassifications. My question is to present my results, should I run it multiple times, adjust threshold each time and then take the average of other metrics eg F1 score or I don’t optimize for the threshold at all?

• Jason Brownlee October 29, 2019 at 5:24 am #

Hmmm, good question.

I would take the test as an evaluation of the “system” that includes the model and automatic threshold adjusting procedure. In that case, averaging the results of the whole system is reasonable, as long as you clearly state that is what you are doing.

Very cool!

21. dave February 14, 2020 at 1:06 am #

Hi @Jason I have a question related to this topic:

In the following thesis http://arno.uvt.nl/show.cgi?fid=147278 the user compute the AUC standard deviation as measure of robustness.

Let’s say I have run a repeated (10) 10-cross validation experiment with predictions implemented via a Markov chain model. As a measure of robustness, I want to compute the SD of the AUC across the runs/folders for the test set.

Intuitively, a relatively small standard deviation implies that the model produces stable results in distinguishing conversion from non-conversion.

The project is based on a 10x 10 cross-validated procedure, which as a consequence generate 100 AUCs.

Now, to derive the AUC’s SD summarizing the model I understand the process should be (it is not well defined in the paper):

a. the SD is computed across folders based on the actual folders’ AUC values (we derive 10 SDs, one for each repetition).

b. The SDs obtained at step b1 are then averaged across runs (it leaves us with 1SD).

however, as there is not so much literature on the topic I want to know If someone can validate the reasoning above

Any help or suggestion appreciated

• Jason Brownlee February 14, 2020 at 6:38 am #

Yes.

You would collect a sample of AUC scores across all repeats and all folds. Then calculate summary stats like the mean and standard deviation.

22. John April 9, 2020 at 9:59 am #

Wilson score is different, the one you’re describing is “Normal approximation interval” according to Wikipedia https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Normal_approximation_interval

Wilson score interval is asymmetric

23. charu June 19, 2020 at 8:22 pm #

How to calculate the Confidence interval for a binary classification problem with imbalanced dateset , where it is not possible to balance the data ?

24. Siva Karthik Gade September 4, 2020 at 4:25 pm #

Thank you Jason – great article!

One follow up question regarding minimum sample size requirement for (sample) error rate to satisfy normal distribution approximation ->

I have come across a few posts/slides around CLT which state that – in order for the sample proportion (or
mean or error rate) of a binomial distribution to approximate to normal distribution (to compute confidence interval), it should follow below 2 conditions:
np > 10
n(1-p) > 10
ex ref. http://homepages.math.uic.edu/~bpower6/stat101/Sampling%20Distributions.pdf

In the current example, for sample size (n) = 50 & error rate = 0.02, n*p (50 * 0.02 = 1) is not satisfying np > 10.

In this case, should we increase n to satisfy above requirements (or) Is there something else I am missing here?

• Jason Brownlee September 5, 2020 at 6:40 am #

Thanks for sharing.

Perhaps try increasing the sample size and see if it influences the result.

25. Manuel Gonçalves October 3, 2020 at 11:49 am #

How to compute CI with cross-validation? Do we use the CI on mean results? the n value is the test portion or the sum of then? Tne formula used in https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics , by multiplying std * 2 was unclear.

• Jason Brownlee October 3, 2020 at 12:30 pm #

You can use the bootstrap method to get a reliable estimate.

• Manuel Gonçalves October 4, 2020 at 11:12 am #

Perhaps a unreliable estimate, there is a reference or paper/book where this formula came from? cross-val will give me [acc1, acc2,acc3, … acc30] list with accuracys and I just comput the mean +/- 2 * std to represent the C.I. (CI = 2* std). What about “n” value, or 1.96 z value?

• Jason Brownlee October 4, 2020 at 2:58 pm #
• Manuel Gonçalves October 24, 2020 at 1:03 am #

Thank’s a lot for the comments… So what you think of this scenario?

Step one – Split train/validation 80/20 and use the train (80%) into cross-validation to get perfoemance metrics to show as means and std.
Step two – Train a final model with bootstrap on 20% left and comput performances with confifence intervals.

Is this a valid scenario? Any recomendations or can I use nested cross-validation for this? Is It valid to do botstrap inside a CV llop?

• Jason Brownlee October 24, 2020 at 7:05 am #

Not sure I like it – unless you have TONS of data. But if it works for you, go for it!

Nested CV is good for choosing a model and hyperparametres in a single test harness.

Final evaluation for reporting could be a bootstrap could be on the entire dataset you have available.

• Manuel Gonçalves October 4, 2020 at 11:24 am #

There is an open discussion concerning this formula here: https://github.com/scikit-learn/scikit-learn/issues/6059

26. Manuel October 24, 2020 at 1:54 am #

This post concludes that is no way to compute reliable CI inside a cross-validation schema. What about a dozen of papers that show CI from CV results?

• Jason Brownlee October 24, 2020 at 7:07 am #

They are likely summarizing the mean and standard deviation of the CV process itself. This is very common.

27. Manuel October 27, 2020 at 3:55 am #

In your tutorial, the CI was computed on a single execution of train/test split, but what about repeated executions, e.g., 30 times?

28. Sourabh April 19, 2021 at 9:39 pm #

Hi Jason ,
what is error * (1 -error) this term tells ?

• Jason Brownlee April 20, 2021 at 5:57 am #