Last Updated on
Once you choose a machine learning algorithm for your classification problem, you need to report the performance of the model to stakeholders.
This is important so that you can set the expectations for the model on new data.
A common mistake is to report the classification accuracy of the model alone.
In this post, you will discover how to calculate confidence intervals on the performance of your model to provide a calibrated and robust indication of your model’s skill.
Discover statistical hypothesis testing, resampling methods, estimation statistics and nonparametric methods in my new book, with 29 step-by-step tutorials and full source code.
Let’s get started.
The skill of a classification machine learning algorithm is often reported as classification accuracy.
This is the percentage of the correct predictions from all predictions made. It is calculated as follows:
classification accuracy = correct predictions / total predictions * 100.0
A classifier may have an accuracy such as 60% or 90%, and how good this is only has meaning in the context of the problem domain.
When talking about a model to stakeholders, it may be more relevant to talk about classification error or just error.
This is because stakeholders assume models perform well, they may really want to know how prone a model is to making mistakes.
You can calculate classification error as the percentage of incorrect predictions to the number of predictions made, expressed as a value between 0 and 1.
classification error = incorrect predictions / total predictions
A classifier may have an error of 0.25 or 0.02.
This value too can be converted to a percentage by multiplying it by 100. For example, 0.02 would become (0.02 * 100.0) or 2% classification error.
What dataset do you use to calculate model skill?
It is a good practice to hold out a validation dataset from the modeling process.
This means a sample of the available data is randomly selected and removed from the available data, such that it is not used during model selection or configuration.
After the final model has been prepared on the training data, it can be used to make predictions on the validation dataset. These predictions are used to calculate a classification accuracy or classification error.
Need help with Statistics for Machine Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Rather than presenting just a single error score, a confidence interval can be calculated and presented as part of the model skill.
A confidence interval is comprised of two things:
- Range. This is the lower and upper limit on the skill that can be expected on the model.
- Probability. This is the probability that the skill of the model will fall within the range.
In general, the confidence interval for classification error can be calculated as follows:
error +/- const * sqrt( (error * (1 - error)) / n)
Where error is the classification error, const is a constant value that defines the chosen probability, sqrt is the square root function, and n is the number of observations (rows) used to evaluate the model. Technically, this is called the Wilson score interval.
The values for const are provided from statistics, and common values used are:
- 1.64 (90%)
- 1.96 (95%)
- 2.33 (98%)
- 2.58 (99%)
Use of these confidence intervals makes some assumptions that you need to ensure you can meet. They are:
- Observations in the validation data set were drawn from the domain independently (e.g. they are independent and identically distributed).
- At least 30 observations were used to evaluate the model.
This is based on some statistics of sampling theory that takes calculating the error of a classifier as a binomial distribution, that we have sufficient observations to approximate a normal distribution for the binomial distribution, and that via the central limit theorem that the more observations we classify, the closer we will get to the true, but unknown, model skill.
Confidence Interval Example
Consider a model with an error of 0.02 (error = 0.02) on a validation dataset with 50 examples (n = 50).
We can calculate the 95% confidence interval (const = 1.96) as follows:
error +/- const * sqrt( (error * (1 - error)) / n)
0.02 +/- 1.96 * sqrt( (0.02 * (1 - 0.02)) / 50)
0.02 +/- 1.96 * sqrt(0.0196 / 50)
0.02 +/- 1.96 * 0.0197
0.02 +/- 0.0388
Or, stated another way:
There is a 95% likelihood that the confidence interval [0.0, 0.0588] covers the true classification error of the model on unseen data.
Notice that the confidence intervals on the classification error must be clipped to the values 0.0 and 1.0. It is impossible to have a negative error (e.g. less than 0.0) or an error more than 1.0.
- Chapter 5, Machine Learning, 1997
- Binomial proportion confidence interval on Wikipedia
- Confidence Interval on Wikipedia
In this post, you discovered how to calculate confidence intervals for your classifier.
Specifically, you learned:
- How to calculate classification accuracy and classification error when reporting results.
- What dataset to use when calculating model skill that is to be reported.
- How to calculate a lower and upper bound on classification error for a chosen level of likelihood.
Do you have any questions about classifier confidence intervals?
Ask your questions in the comments below.