How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

It is important to both present the expected skill of a machine learning model a well as confidence intervals for that model skill.

Confidence intervals provide a range of model skills and a likelihood that the model skill will fall between the ranges when making predictions on new data. For example, a 95% likelihood of classification accuracy between 70% and 75%.

A robust way to calculate confidence intervals for machine learning algorithms is to use the bootstrap. This is a general technique for estimating statistics that can be used to calculate empirical confidence intervals, regardless of the distribution of skill scores (e.g. non-Gaussian)

In this post, you will discover how to use the bootstrap to calculate confidence intervals for the performance of your machine learning algorithms.

After reading this post, you will know:

  • How to estimate confidence intervals of a statistic using the bootstrap.
  • How to apply this method to evaluate machine learning algorithms.
  • How to implement the bootstrap method for estimating confidence intervals in Python.

Let’s get started.

  • Update June/2017: Fixed a bug where the wrong values were provided to numpy.percentile(). ThanksĀ Elie Kawerk.
  • Update March/2018: Updated link to dataset file.
How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python
Photo by Hendrik Wieduwilt, some rights reserved.

Bootstrap Confidence Intervals

Calculating confidence intervals with the bootstrap involves two steps:

  1. Calculate a Population of Statistics
  2. Calculate Confidence Intervals

Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

1. Calculate a Population of Statistics

The first step is to use the bootstrap procedure to resample the original data a number of times and calculate the statistic of interest.

The dataset is sampled with replacement. This means that each time an item is selected from the original dataset, it is not removed, allowing that item to possibly be selected again for the sample.

The statistic is calculated on the sample and is stored so that we build up a population of the statistic of interest.

The number of bootstrap repeats defines the variance of the estimate, and more is better, often hundreds or thousands.

We can demonstrate this step with the following pseudocode.

2. Calculate Confidence Interval

Now that we have a population of the statistics of interest, we can calculate the confidence intervals.

This is done by first ordering the statistics, then selecting values at the chosen percentile for the confidence interval. The chosen percentile in this case is called alpha.

For example, if we were interested in a confidence interval of 95%, then alpha would be 0.95 and we would select the value at the 2.5% percentile as the lower bound and the 97.5% percentile as the upper bound on the statistic of interest.

For example, if we calculated 1,000 statistics from 1,000 bootstrap samples, then the lower bound would be the 25th value and the upper bound would be the 975th value, assuming the list of statistics was ordered.

In this, we are calculating a non-parametric confidence interval that does not make any assumption about the functional form of the distribution of the statistic. This confidence interval is often called the empirical confidence interval.

We can demonstrate this with pseudocode below.

Bootstrap Model Performance

The bootstrap can be used to evaluate the performance of machine learning algorithms.

The size of the sample taken each iteration may be limited to 60% or 80% of the available data. This will mean that there will be some samples that are not included in the sample. These are called out of bag (OOB) samples.

A model can then be trained on the data sample each bootstrap iteration and evaluated on the out of bag samples to give a performance statistic, which can be collected and from which confidence intervals may be calculated.

We can demonstrate this process with the following pseudocode.

Calculate Classification Accuracy Confidence Interval

This section demonstrates how to use the bootstrap to calculate an empirical confidence interval for a machine learning algorithm on a real-world dataset using the Python machine learning library scikit-learn.

This section assumes you have Pandas, NumPy, and Matplotlib installed. If you need help setting up your environment, see the tutorial:

First, download the Pima Indians dataset and place it in your current working directory with the filename “pimaindians-diabetes.data.csv” (update: download here).

We will load the dataset using Pandas.

Next, we will configure the bootstrap. We will use 1,000 bootstrap iterations and select a sample that is 50% the size of the dataset.

Next, we will iterate over the bootstrap.

The sample will be selected with replacement using the resample() function from sklearn. Any rows that were not included in the sample are retrieved and used as the test dataset. Next, a decision tree classifier is fit on the sample and evaluated on the test set, a classification score calculated, and added to a list of scores collected across all the bootstraps.

Once the scores are collected, a histogram is created to give an idea of the distribution of scores. We would generally expect this distribution to be Gaussian, perhaps with a skew with a symmetrical variance around the mean.

Finally, we can calculate the empirical confidence intervals using the percentile() NumPy function. A 95% confidence interval is used, so the values at the 2.5 and 97.5 percentiles are selected.

Putting this all together, the complete example is listed below.

Running the example prints the classification accuracy each bootstrap iteration.

A histogram of the 1,000 accuracy scores is created showing a Gaussian-like distribution.

Distribution of Classification Accuracy Using the Bootstrap

Distribution of Classification Accuracy Using the Bootstrap

Finally, the confidence intervals are reported, showing that there is a 95% likelihood that the confidence interval 64.4% and 73.0% covers the true skill of the model.

This same method can be used to calculate confidence intervals of any other errors scores, such as root mean squared error for regression algorithms.

Further Reading

This section provides additional resources on the bootstrap and bootstrap confidence intervals.

Summary

In this post, you discovered how to use the bootstrap to calculate confidence intervals for machine learning algorithms.

Specifically, you learned:

  • How to calculate the bootstrap estimate of confidence intervals of a statistic from a dataset.
  • How to apply the bootstrap to evaluate machine learning algorithms.
  • How to calculate bootstrap confidence intervals for machine learning algorithms in Python.

Do you have any questions about confidence intervals?
Ask your questions in the comments below.

Get a Handle on Statistics for Machine Learning!

Statistical Methods for Machine Learning

Develop a working understanding of statistics

…by writing lines of code in python

Discover how in my new Ebook:
Statistical Methods for Machine Learning

It provides self-study tutorials on topics like:
Hypothesis Tests, Correlation, Nonparametric Stats, Resampling, and much more…

Discover how to Transform Data into Knowledge

Skip the Academics. Just Results.

Click to learn more.

43 Responses to How to Calculate Bootstrap Confidence Intervals For Machine Learning Results in Python

  1. benson dube June 5, 2017 at 6:40 pm #

    Thank you Jason

  2. Elie Kawerk June 5, 2017 at 8:25 pm #

    Thank you Jason,

    Are you considering to include the posts on confidence intervals in a previous (or new) book?

    Regards
    Elie

    • Jason Brownlee June 6, 2017 at 9:33 am #

      Hi Elie,

      Not especially. I have a soft idea of “statistical methods for machine learning” or something to that effect that could cover topics like sampling theory and confidence intervals. Posts like this are a test to see if there is interest.

      Why do you ask? Is this a topic for which you’re looking for more help?

      • Elie Kawerk June 6, 2017 at 3:31 pm #

        Yes,

        I actually find posts like these very useful for reporting statistically meaningful results for machine learning. Here’s a nice free book link on the subject: https://www.otexts.org/book/sfml

        Regards

        • Elie June 6, 2017 at 5:22 pm #

          Jason,

          I have another question. How do we report confidence intervals on an evaluation done on a hold-out set? We cannot apply the bootstrap with training here since this would contaminate our results.

          Regards

          • Jason Brownlee June 7, 2017 at 7:09 am #

            Generally, you can repeat the holdout process many times with different random samples and use the outcomes as your population of results.

        • Jason Brownlee June 7, 2017 at 7:08 am #

          I agree, it’s a great book!

          • Elie Kawerk June 7, 2017 at 3:21 pm #

            Wouldn’t it be suitable to apply the bootstrap by sampling from one hold out set for a certain number of iterations without any training involved?

          • Jason Brownlee June 8, 2017 at 7:38 am #

            As in only resampling the set used to evaluate the skill of the model?

            No, I don’t think this would be valid.

  3. Elie Kawerk June 10, 2017 at 4:09 am #

    Jason,

    Something seems not to make sense in the confidence interval you obtain. The one you calculate is [61.5, 63.9]. However the mean (that I obtain) is about 69% (similar to the one you get from the graph) and visually one can inspect and estimate the 95% CI to be [64, 74]. I think there is something wrong.

    Please correct me if I am wrong.

    Regards
    Elie

    • Jason Brownlee June 10, 2017 at 8:40 am #

      You are correct Elie, thank you so much.

      The values provided to numpy.percentile() must be in [0,100] and I was providing them in [0,1].

      I have fixed the example and made it clearer how the percentile values are calculated (so they can be debugged by the skeptical developers we should be).

      I really appreciate you finding this and the comment. It makes the example better for everyone!

  4. Dawit August 24, 2017 at 7:41 am #

    I love the piece of cake notes about ML.

  5. Malcolm August 27, 2017 at 6:31 am #

    Hi Jason, thanks for this post. Is the concept of fitting your classifier using a training set and calculating it’s accuracy on unseen data (a holdout set) now outdated? It seems to me that CV is testing how good your algorithm (with optimized parameters) is at building a model, in that the model is re-fitted in each fold prior to testing it on the held-out fold. It doesn’t seem to predict how a trained model will work once it is put into production and starts having to classify completely new instances. For that, I can’t see any option other than training on the full training set and testing against the hold out set – am I missing something ?

    • Jason Brownlee August 28, 2017 at 6:45 am #

      Cross Validation (CV) is doing the same thing, just repeated 10 times, for 10-fold CV.

      CV and train/test splits are both resampling methods intended to estimate the skill of the model on unseen data. Perhaps this post will clear things up for you:
      https://machinelearningmastery.com/train-final-machine-learning-model/

      On many problems where we have millions of examples (e.g. deep learning) we often do not have the resources for CV.

  6. Malcolm August 28, 2017 at 8:50 pm #

    Thanks for the link – that clarifies a lot of things ! In summary, is it correct to say where you have the processing power and time to perform it, k-fold validation is always going to be the superior option as it provides means, standard deviations and confidence intervals ?

    • Jason Brownlee August 29, 2017 at 5:04 pm #

      CV will give a less biased estimate of model skill, in general. This is the key objective of the method.

      I often use a bootstrap to then present the final confidence interval for the chosen configuration.

  7. Rob February 15, 2018 at 2:55 am #

    Hi, thanks for this article. I’m after calculating confidence intervals for sensitivity and specificity. I’ve used your code, but changed the prep step to use the ‘train_test_split’ (with a random seed), to create samples of the data. Then, once per iteration, I’ve fitted a model and made predictions, then created a confusion matrix, and then worked out the sensitivity and specificity. Do you think that sounds reasonable? The results look sensible.

    Thanks.

    • Jason Brownlee February 15, 2018 at 8:48 am #

      I think the bootstrap approach would work for other accuracy related measures.

  8. Ian March 28, 2018 at 10:28 pm #

    Nice post (:

    Might be useful to mention the bootsrapped package in case people are feeling lazy šŸ˜‰

    https://pypi.python.org/pypi/bootstrapped/0.0.1

  9. Sabrina April 17, 2018 at 8:11 am #

    Great post Jason. I was wondering if you think it might be feasible to use bootstrap method for neural nets to estimate confidence level?

  10. Vladislav Gladkikh May 25, 2018 at 5:48 pm #

    How can I estimate a 95% likelihood of classification accuracy for a particular unseen data point?

    Suppose, I have three new data points x1, x2, x3. How can I estimate that my method has a 95% likelihood of classification accuracy between 80% and 85% for predicting, say, x2, but between 60% and 65% for predicting x1, and between 90% and 95% for predicting x3?

    • Jason Brownlee May 26, 2018 at 5:50 am #

      It sounds like you might be asking about a prediction interval instead of a confidence interval.

      I cover this in my book on stats and I have a post on the topic scheduled for soon.

      Until then see this article:
      https://en.wikipedia.org/wiki/Prediction_interval

      • Vladislav Gladkikh May 26, 2018 at 12:05 pm #

        Thanks for the link!

  11. Anthony The Koala June 3, 2018 at 11:30 pm #

    Dear Dr Jason,
    While I understand the general concept presented above, I am lost on one fine detail on
    lines 21 and 23. It is about selection array [:,:-1] and [:,-1].

    It is about

    and

    In other words, when there is an array, what is the difference between the selection methods.

    and

    Thank you,
    Anthony of Sydney

  12. VK July 31, 2018 at 1:24 am #

    Hello,
    can you please point to a reference for the expressions in line 32 and 34 of the complete example. frame?

    p = ((1.0-alpha)/2.0) * 100
    p = (alpha+((1.0-alpha)/2.0)) * 100

    Thanks

    • Jason Brownlee July 31, 2018 at 6:09 am #

      Perhaps see the references at the end of the post.

  13. vk July 31, 2018 at 2:56 am #

    Please ignore my previous comment. I just observed that alpha was set to .95.
    They now make sense.

  14. Santiago September 15, 2018 at 2:49 am #

    Hi! Thanks very much for your post!
    However I don’t understand your confidence intervals. If you wanted a 95% confidence interval shouldn’t you be setting your alpha to 0.05?
    Cheers,
    Santiago

  15. khanh ha December 12, 2018 at 1:35 am #

    it’s very intuitive. Thanks for writing it

  16. Chris December 23, 2018 at 4:56 pm #

    Jason,

    Thank you for writing this article!!

    During my PhD in Physics, we had to fit unwieldy (highly) nonlinear functions to real data and it was essential for us (as “budding” scientists) to at least attempt to report on both the standard and systematic errors of our results.

    While learning about and using propagation of error methods throughout school, it surprised me to learn so late in my studies that the bootstrap method was a (kosher) go-to method for (a simulated) estimate of standard error of parameters from such complicated functions.

    When nature’s constituent parts at various scales appear to often exhibit non-linear interaction dynamics, one would think our engineer and science teaching forefathers (e.g. for engineers, physicists, chemists, biologists, etc.) would have emphasized merit and further study of this method, particularly in an era now heavily (re)focused more on phenomenological studies and less so on pure analytical theories.

    To end on a high note, I’m still thankful for the advances in DS and ML for making such phenomenological studies, and basic predictive analytics accessible to the masses, and namely a laymen like me.

    Thank you again for another great article!!

    • Jason Brownlee December 24, 2018 at 5:26 am #

      Thanks for sharing, I could not agree with you more.

      In fact, more simulation/monte carlo methods should be taught in general.

Leave a Reply