How to Report Classifier Performance with Confidence Intervals

Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.

This is important so that you can set the expectations for the model on new data.

A common mistake is to report the classification accuracy of the model alone.

In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.

Let’s get started.

How to Report Classifier Performance with Confidence Intervals

How to Report Classifier Performance with Confidence Intervals
Photo by Andrew, some rights reserved.

Classification Accuracy

The skill of a classification machine learning algorithm is often reported as classification accuracy.

This is the percentage of the correct predictions from all predictions made. It is calculated as follows:

A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.

Classification Error

When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.

This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.

You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.

A classifier may have an error of 0.25 or 0.02.

This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.

Validation Dataset

What dataset do you use to calculate model skill?

It is a good practice to hold out a validation dataset from the modeling process.

This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.

After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.

Confidence Interval

Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.

A confidence interval is comprised of two things:

  • Range. This is the lower and upper limit on the skill that can be expected on the model.
  • Probability. This is the probability that the skill of the model will fall within the range.

In general, the confidence interval for classification error can be calculated as follows:

Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.

The values for const are provided from statistics, and common values used are:

  • 1.64 (90%)
  • 1.96 (95%)
  • 2.33 (98%)
  • 2.58 (99%)

Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:

  • Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
  • At least 30 observations were used to evaluate the model.

This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.

Confidence Interval Example

Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).

We can calculate the 95% confidence interval (const = 1.96) as follows:

Or, stated another way:

There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.

Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.

Further Reading


In this post, you discovered how to calculate confidence intervals for your classifier.

Specifically, you learned:

  • How to calculate classification accuracy and classification error when reporting results.
  • What dataset to use when calculating model skill that is to be reported.
  • How to calculate a lower and upper bound on classification error for a chosen level of likelihood.

Do you have any questions about classifier confidence intervals?
Ask your questions in the comments below.

12 Responses to How to Report Classifier Performance with Confidence Intervals

  1. Birkey June 2, 2017 at 3:12 pm #

    How’s this (confidence interval) differ from F1 score, which is widely used and, IMHO, easier to comprehend, since it’s one score covers both precision and recall.

    • Jason Brownlee June 3, 2017 at 7:20 am #

      The F1 is a skill measure for the model. It could be accuracy or anything else.

      In this post, we are talking about the confidence (uncertainty) on the calculated skill score.

  2. Elie Kawerk June 3, 2017 at 3:17 am #

    Hi Jason,

    Thank you for the nice post. This error confidence interval that you report corresponds to binary classification only. How about multi-class classification?


    • Jason Brownlee June 3, 2017 at 7:25 am #

      Really great question. I expect you would use logloss or AUC and report confidence on that.

      • Elie Kawerk June 3, 2017 at 5:03 pm #

        I see,

        But then the expression of the confidence interval (for AUC or any other metric) would be different I presume since the process wouldn’t be described using the binomial distribution.

        For multi-class classification, wouldn’t the distribution be a multinomial distribution? And in this case the expression for the error confidence interval would change I presume.


        • Jason Brownlee June 4, 2017 at 7:49 am #

          I see, yes you are correct. I would recommend an empirical approach to summarizing the distribution using the bootstrap method (a post is scheduled).

  3. Jonad June 4, 2017 at 1:23 am #

    Hi Jason,
    Really good post. But I have a question. Does the classification error differ if we use a different skill – for instance F1-score – for our model?

    • Jason Brownlee June 4, 2017 at 7:54 am #

      Hi Jonad,

      Different measures will evaluate skill in different ways. They will provide different perspectives on the same underlying model error.

      Does that make sense?

      • jonad June 5, 2017 at 2:45 am #

        Yes, I was thinking that the classification error formula ( incorrect predictions / total predictions) might differ depending on the evaluation metrics. Now I understand it better.

  4. Simone June 7, 2017 at 9:25 pm #

    Great post!
    How could I use confidence intervals and cross-validation together?

    • Jason Brownlee June 8, 2017 at 7:42 am #

      It’s a tough one, we are generally interested in the variance of model skill during model selection and during the presentation of the final model.

      Often standard deviation of CV score is used to capture model skill variance, perhaps that is generally sufficient and we can leave confidence intervals for presenting the final model or specific predictions?

      I’m open to better ideas.

      • Simone June 9, 2017 at 6:32 pm #

        Ok, Thanks!
        The last question: when I’m using k-fold cv, the value of ‘n’ is equal to the number of all observations or all observations – k?

Leave a Reply