Last Updated on

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.

This is important so that you can set the expectations for the model on new data.

A common mistake is to report the classification accuracy of the model alone.

In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.

Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.

Let’s get started.

## Classification Accuracy

The skill of a classification machine learning algorithm is often reported as classification accuracy.

This is the percentage of the correct predictions from all predictions made. It is calculated as follows:

1 |
classification accuracy = correct predictions / total predictions * 100.0 |

A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.

## Classification Error

When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.

This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.

You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.

1 |
classification error = incorrect predictions / total predictions |

A classifier may have an error of 0.25 or 0.02.

This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.

## Validation Dataset

What dataset do you use to calculate model skill?

It is a good practice to hold out a validation dataset from the modeling process.

This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.

After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Confidence Interval

Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.

A confidence interval is comprised of two things:

**Range**. This is the lower and upper limit on the skill that can be expected on the model.**Probability**. This is the probability that the skill of the model will fall within the range.

In general, the confidence interval for classification error can be calculated as follows:

1 |
error +/- const * sqrt( (error * (1 - error)) / n) |

Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.

The values for const are provided from statistics, and common values used are:

- 1.64 (90%)
- 1.96 (95%)
- 2.33 (98%)
- 2.58 (99%)

Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

- Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
- At least 30 observations were used to evaluate the model.

This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.

## Confidence Interval Example

Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).

We can calculate the 95% confidence interval (const = 1.96) as follows:

1 2 3 4 5 |
error +/- const * sqrt( (error * (1 - error)) / n) 0.02 +/- 1.96 * sqrt( (0.02 * (1 - 0.02)) / 50) 0.02 +/- 1.96 * sqrt(0.0196 / 50) 0.02 +/- 1.96 * 0.0197 0.02 +/- 0.0388 |

Or, stated another way:

There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.

Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.

## Further Reading

- Chapter 5, Machine Learning, 1997
- Binomial proportion confidence interval on Wikipedia
- Confidence Interval on Wikipedia

## Summary

In this post, you discovered how to calculate confidence intervals for your classifier.

Specifically, you learned:

- How to calculate classification accuracy and classification error when reporting results.
- What dataset to use when calculating model skill that is to be reported.
- How to calculate a lower and upper bound on classification error for a chosen level of likelihood.

Do you have any questions about classifier confidence intervals?

Ask your questions in the comments below.

How’s this (confidence interval) differ from F1 score, which is widely used and, IMHO, easier to comprehend, since it’s one score covers both precision and recall.

The F1 is a skill measure for the model. It could be accuracy or anything else.

In this post, we are talking about the confidence (uncertainty) on the calculated skill score.

Hi Jason,

Thank you for the nice post. This error confidence interval that you report corresponds to binary classification only. How about multi-class classification?

Regards

Really great question. I expect you would use logloss or AUC and report confidence on that.

I see,

But then the expression of the confidence interval (for AUC or any other metric) would be different I presume since the process wouldn’t be described using the binomial distribution.

For multi-class classification, wouldn’t the distribution be a multinomial distribution? And in this case the expression for the error confidence interval would change I presume.

Regards

Elie

I see, yes you are correct. I would recommend an empirical approach to summarizing the distribution using the bootstrap method (a post is scheduled).

Hi Jason,

Really good post. But I have a question. Does the classification error differ if we use a different skill – for instance F1-score – for our model?

Thanks

Hi Jonad,

Different measures will evaluate skill in different ways. They will provide different perspectives on the same underlying model error.

Does that make sense?

Yes, I was thinking that the classification error formula ( incorrect predictions / total predictions) might differ depending on the evaluation metrics. Now I understand it better.

Thanks

Great post!

How could I use confidence intervals and cross-validation together?

It’s a tough one, we are generally interested in the variance of model skill during model selection and during the presentation of the final model.

Often standard deviation of CV score is used to capture model skill variance, perhaps that is generally sufficient and we can leave confidence intervals for presenting the final model or specific predictions?

I’m open to better ideas.

Ok, Thanks!

The last question: when I’m using k-fold cv, the value of ‘n’ is equal to the number of all observations or all observations – k?

@Simone the value of n is AFAIK an empirical value that is chosen to be 5 or 10. Jason explains that very well (as usual) in this post:

https://machinelearningmastery.com/k-fold-cross-validation/

Ah @Simone, by the way, if ‘n’ is equal to the number of all observations that is a type of cross-validation that is called LOOCV (Leave-one-out cross-validation) and uses a single observation from the original sample as the validation data, and the remaining observations as the training data.

@Jason what about this? (I haven’t fully read it yet and I’m struggling to understand it but .. well, I think it might work as well):

Mach Learn. 2018; 107(12): 1895–1922. Published online 2018 May 9. doi: 10.1007/s10994-018-5714-4. PMCID: PMC6191021, PMID: 30393425. Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Ioannis Tsamardinos, lissavet Greasidou (corresponding author), and Giorgos Borboudakis.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6191021/

Yes, it is a great method. I have many posts on it, you can start here:

https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/

I recommend it for summarizing final model performance, rather than model selection. CVV is better for model selection.

Well @Simone, from the point of view of a developer if you take a look at the scikit-learn documentation and go over the section “3.1.1. Computing cross-validated metrics” (https://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics) you will see that the 95% confidence interval of the score estimate is reported as Jason states in this post.

Hi Jason

Is there R code for calculating the CI and graphing them?

Thanks

I bet there is, I don’t have it on hand, sorry.

The error is just the reverse of the accuracy, wouldn’t that be a simpler statement to make?

This leads to the fundamental problem that accuracy or classification error itself is often mediocre to useless metric because data sets usually are imbalanced. And hence the confidence on that error is just as useless.

I found this post for a different reason as I wanted to find if anyone else does what i do, namely provide metrics grouped by class probability. What is the precision if the model has 0.9 class probability vs 0.6 for example. That can be very useful information for end users because the metric will often vary greatly based on class probability.

Yes, the classification error is the inverse of the classification accuracy.

You can use a different measure to overcome imbalance:

https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/

Thomas, I think I’ve done what you described. I wrote a function to calculate a hand full of different performance metrics at different probability cutoffs and had it stored in a data frame. This helped me choose a probability cutoff that balanced the needs of the business. I can share the code of it’s what your looking for.

Hi Jason,

Nice post. When calculating the confidence interval for error, AUC or other metrics, the standard error of the metric is needed. How should I calculate the standard error?

Great question, here is the equation:

https://en.wikipedia.org/wiki/Standard_error

Thanks for replying. Does this mean I need to get multiple errors by running multiple times (bootstrap or cross-validation) to calculate the standard error?

Yes, if you are looking to calculate the standard error of the bootstrap result distribution.

Hi Jason,

I am trying to group my customers. Say GAP HK, GAP US should be under the group customer GAP.

Few of the customers are already grouped. Say GAP HK is grouped under GAP but GAP US is not.

I am using random forest classifier. I used already grouped customer name as training data. Group customer code is the label that I am trying to predict.

The classifier is assigning labels as expected. The problem I am facing is that the classifier is also assigning labels or group customer code to the customers although the customer name does not match closely with the training data. It is doing the best possible match. It is problem for me because I need to manually ungroup these customers. Can you suggest how to overcome this problem? Is it possible to know classifier correct probability for each predicted label? If yes, then I can ignore the once with low probability.

Thank you in advance for advice.

Perhaps you can predict probabilities instead and only accept the high probability predictions?

No model is perfect, we must expect some error.

https://machinelearningmastery.com/faq/single-faq/why-cant-i-get-100-accuracy-or-zero-error-with-my-model

Nevertheless, these ideas may help you lift the skill of your model:

http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/

Hi Jason,

I am not sure if anyone else brought this up but I’ve found one issue here. The confidence interval based measure you suggested is not the “Wilson score interval”. according to the Wikipedia page(which is cited in that link). It’s actually “Normal approximation interval” which is above Wilson score paragraph. Correct me if I am wrong.

Thanks

-Anish

Thanks Anish.

Hi Jason,

I’m interested on the relation of Cross Validation and this approach.

With 150 examples I decide to use a 100 repeated 5-fold Cross Validation to understand the behavior of my classifier. At this point I have 100×5 results and I can use the mean and std dev of the error rates to estimate the variance of the model skills:

mean(errorRate) +/- 1.96*(std(errorRate))

I could estimate Confidence Interval of the True Error (that I would obtain on the unseen data) using the the average Error rate:

mean(errorRate) +/- const * sqrt( (mean(errorRate) * (1 – mean(errorRate))) / n)

Two questions:

1. Do you think this approach is correct?

2. Is correct to set n=150 in the second equation or I should use the average number of Test Data used as Test Set in each fold of CV?

You have 5 results from a 5-fold CV. The results are somewhat dependent, they are not iid.

You can use the Gaussian confidence interval, with a grain of salt. You could also use the bootstrap.

I explain more here:

https://machinelearningmastery.com/confidence-intervals-for-machine-learning/

Hi Jason, thanks for the great posts on confidence intervals/ bootstraps for machine learning.

Suppose you use

A) 5-fold CV

B) 30-fold CV

for model evaluation. You pick the final model and train it on all the data at hand.

What are the options one has for reporting on final model skill with a range for uncertainty in each case?

Should one have still held out a number of datapoints for validation+binomial confidence interval?

Is it too late to use the bootstrap confidence intervals as the final model was trained?

Thanks

Not sure I follow your question?

Pick a final model and use a preferred method to report expected performance. It is unrelated to how you chose that model.

Thanks Jason. I found your other post https://machinelearningmastery.com/difference-test-validation-datasets/ very helpful.

Can I confirm that the above procedure of reporting classifier performance with confidence intervals is relevant for the final trained model? If that is so, it seems that the validation dataset mentioned should be called test set to align with the definitions of the linked post?

Yes.

Hi Jason,

Thank you for the post!

In your example you use accuracy and error rate and calculate a confidence interval.

Can one replace “error rate” with, say, precision, recall or f1? Why and why not?

For example, say we have a sample size=50, f1=0.02

Does that mean …

there is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true F1 of the model on unseen data?

Thanks!

Perhaps for some scores. The example in this example is specific for a ratio. I believe you can use it for other ratios like precision, recall and f1.

Hi Jason

Thank you for your post.

How get the standard error of the AUC curve in python

Not sure I follow. Standard error refers to a statistical quantity on a distribution, not sure how you would calculate it for a curve.

Hello Jason,

I was wondering if I can compute confidence interval for Recall and Precision. If yes can you explain how can I do this?

Thank you so much,

best regards

Lorenzo

Yes, I expect the bootstrap would be a good place to start.

Another excellent post Jason. Thank you.

There might a minor typo in “[0.0, 0.0588]” – should be [0.0, 0.0388] I think.

Thanks.

Nope, 0.02 + 0.0388 = 0.0588]

I am running a classifier with a training set of 41 and a validation set of 14 (55 total observations). I rerun this 50 times with different random slices of the data as training and test. Obviously I cannot make confidence intervals with this small validation set.

However, because I am rerunning it with different training and validation slices, can I get the mean error rate over the 50 tests and calculate a confidence interval?

const * sqrt( (error * (1 – error)) / n)

N would be 700 (14*50). If I had 50 tests which averaged out to an accurracy of 77.4% (error is 0.226), the confidence intervals would be 0.26 and 0.2

Does this work? or would these confidence intervals be unreliable?

Thanks for your excellent article

Dan

Why would they be unreliable?

What are you worried about exactly – the dataset selection for each trial?

Hi Mr Brownlee,

Thank you for the quick reply and apologies for my late response. I am dealing with social science data and the validation set is rather limited. I am worried about assertaining confidence intervals for a limited validation sample.

A friend of mine came up with a solution in which I keep all the accuracy outputs in a vector and plot them like a histogram (I can’t seem to paste one into this reply window but can send it over if necessary by email).

Would I be able to get the confidence intervals by looking at the 5th and 95th percentile of the accuracy vector?

Would there be an advantage to randomly sampling (bootstrapping) with replacement over without replacement?

Thanks again

Dan

Just repeats without bootstrap? I think the distribution would only capture the variance in the model, not the data.

I would encourage you to use the bootstrap to calculate the distribution of accuracy scores.

Thank you for the quick reply. The method I am currently using is subsampling. Randomly selecting different observations for the training set and the validation set 50 times, and collecting the accuracy scores to make a distribution of accuracy scores. I believe this is called subsampling. But I am happy to use bootstrapping instead.

Just to clarify, I am using the bootstrap on the data for partitioning the training and validation set correct? This means that an observation in the training set can also end up in the validation set.

Is there an article you could recommend which explains why bootstrapping is better than subsampling (I have taken up enough of your time and really appreciate all of your help so far)

Perhaps start here:

https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/

Then here:

https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/

Thanks Jason,

Final question,

My trained model has the below output and the best tuned number of neighbours is 5.

I am creating confidence intervals through creating a histogram of the accuracies in the resample. Is there a way to subset the resample results in which the model is best tuned (in this case k = 5)?

No pre-processing

Resampling: Bootstrapped (1000 reps)

Summary of sample sizes: 32, 32, 32, 32, 32, 32, …

Resampling results across tuning parameters:

k Accuracy Kappa

5 0.7262690 0.4593792

7 0.6830904 0.3830819

9 0.6655405 0.3522427

Accuracy was used to select the optimal model using the largest value.

The final value used for the model was k = 5.

I recommend using a new procedure with the chosen config and the bootstrap to estimate the confidence intervals on model performance.

sorry as an update, I am selecting ‘final’ as the returnResamp argument in the train control method. I believe that this should retain the resamples from only the best-tuned model

But typically when I check the mean of the resample, mean(model$resample$Accuracy), the mean is lower than the k=5 accuracy (typically 0.65). Is there a reason for this? I would have thought that the mean accuracy of the best tune resamples would equal the model accuracy in the results.

After this I promise to leave you alone (and thanks for your patience so far)

This was very helpful for classification.

How would I go about calculating confidence intervals for regression analysis.

Could i use the same formula but instead of using classification error would I be able to use MAE should I use RMSE?

Good question

Probably using on the standard deviation of the error score, from the mean, e.g. +/- 2 or 3 standard deviations will cover the expected behaviour of the system:

https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

Hi Jason, my question is not too related to this topic, only slightly

I have a neural network(MLP) for binary classification with a logistic output between 0 and 1. With each run, I have to adjust my threshold on test set for minimizing the misclassifications. My question is to present my results, should I run it multiple times, adjust threshold each time and then take the average of other metrics eg F1 score or I don’t optimize for the threshold at all?

Hmmm, good question.

I would take the test as an evaluation of the “system” that includes the model and automatic threshold adjusting procedure. In that case, averaging the results of the whole system is reasonable, as long as you clearly state that is what you are doing.

Very cool!

Hi @Jason I have a question related to this topic:

In the following thesis http://arno.uvt.nl/show.cgi?fid=147278 the user compute the AUC standard deviation as measure of robustness.

Let’s say I have run a repeated (10) 10-cross validation experiment with predictions implemented via a Markov chain model. As a measure of robustness, I want to compute the SD of the AUC across the runs/folders for the test set.

Intuitively, a relatively small standard deviation implies that the model produces stable results in distinguishing conversion from non-conversion.

The project is based on a 10x 10 cross-validated procedure, which as a consequence generate 100 AUCs.

Now, to derive the AUC’s SD summarizing the model I understand the process should be (it is not well defined in the paper):

a. the SD is computed across folders based on the actual folders’ AUC values (we derive 10 SDs, one for each repetition).

b. The SDs obtained at step b1 are then averaged across runs (it leaves us with 1SD).

however, as there is not so much literature on the topic I want to know If someone can validate the reasoning above

Any help or suggestion appreciated

Yes.

You would collect a sample of AUC scores across all repeats and all folds. Then calculate summary stats like the mean and standard deviation.