The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

It can be used to estimate summary statistics such as the mean or standard deviation. It is used in applied machine learning to estimate the skill of machine learning models when making predictions on data not included in the training data.

A desirable property of the results from estimating machine learning model skill is that the estimated skill can be presented with confidence intervals, a feature not readily available with other methods such as cross-validation.

In this tutorial, you will discover the bootstrap resampling method for estimating the skill of machine learning models on unseen data.

After completing this tutorial, you will know:

- The bootstrap method involves iteratively resampling a dataset with replacement.
- That when using the bootstrap you must choose the size of the sample and the number of repeats.
- The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- Bootstrap Method
- Configuration of the Bootstrap
- Worked Example
- Bootstrap API

### Need help with Statistics for Machine Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Bootstrap Method

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. This approach to sampling is called sampling with replacement.

The process for building one sample can be summarized as follows:

- Choose the size of the sample.
- While the size of the sample is less than the chosen size
- Randomly select an observation from the dataset
- Add it to the sample

The bootstrap method can be used to estimate a quantity of a population. This is done by repeatedly taking small samples, calculating the statistic, and taking the average of the calculated statistics. We can summarize this procedure as follows:

- Choose a number of bootstrap samples to perform
- Choose a sample size
- For each bootstrap sample
- Draw a sample with replacement with the chosen size
- Calculate the statistic on the sample

- Calculate the mean of the calculated sample statistics.

The procedure can also be used to estimate the skill of a machine learning model.

The bootstrap is a widely applicable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with a given estimator or statistical learning method.

— Page 187, An Introduction to Statistical Learning, 2013.

This is done by training the model on the sample and evaluating the skill of the model on those samples not included in the sample. These samples not included in a given sample are called the out-of-bag samples, or OOB for short.

This procedure of using the bootstrap method to estimate the skill of the model can be summarized as follows:

- Choose a number of bootstrap samples to perform
- Choose a sample size
- For each bootstrap sample
- Draw a sample with replacement with the chosen size
- Fit a model on the data sample
- Estimate the skill of the model on the out-of-bag sample.

- Calculate the mean of the sample of model skill estimates.

The samples not selected are usually referred to as the “out-of-bag” samples. For a given iteration of bootstrap resampling, a model is built on the selected samples and is used to predict the out-of-bag samples.

— Page 72, Applied Predictive Modeling, 2013.

Importantly, any data preparation prior to fitting the model or tuning of the hyperparameter of the model must occur within the for-loop on the data sample. This is to avoid data leakage where knowledge of the test dataset is used to improve the model. This, in turn, can result in an optimistic estimate of the model skill.

A useful feature of the bootstrap method is that the resulting sample of estimations often forms a Gaussian distribution. In additional to summarizing this distribution with a central tendency, measures of variance can be given, such as standard deviation and standard error. Further, a confidence interval can be calculated and used to bound the presented estimate. This is useful when presenting the estimated skill of a machine learning model.

## Configuration of the Bootstrap

There are two parameters that must be chosen when performing the bootstrap: the size of the sample and the number of repetitions of the procedure to perform.

### Sample Size

In machine learning, it is common to use a sample size that is the same as the original dataset.

The bootstrap sample is the same size as the original dataset. As a result, some samples will be represented multiple times in the bootstrap sample while others will not be selected at all.

— Page 72, Applied Predictive Modeling, 2013.

If the dataset is enormous and computational efficiency is an issue, smaller samples can be used, such as 50% or 80% of the size of the dataset.

### Repetitions

The number of repetitions must be large enough to ensure that meaningful statistics, such as the mean, standard deviation, and standard error can be calculated on the sample.

A minimum might be 20 or 30 repetitions. Smaller values can be used will further add variance to the statistics calculated on the sample of estimated values.

Ideally, the sample of estimates would be as large as possible given the time resources, with hundreds or thousands of repeats.

## Worked Example

We can make the bootstrap procedure concrete with a small worked example. We will work through one iteration of the procedure.

Imagine we have a dataset with 6 observations:

1 |
[0.1, 0.2, 0.3, 0.4, 0.5, 0.6] |

The first step is to choose the size of the sample. Here, we will use 4.

Next, we must randomly choose the first observation from the dataset. Let’s choose 0.2.

1 |
sample = [0.2] |

This observation is returned to the dataset and we repeat this step 3 more times.

1 |
sample = [0.2, 0.1, 0.2, 0.6] |

We now have our data sample. The example purposefully demonstrates that the same value can appear zero, one or more times in the sample. Here the observation 0.2 appears twice.

An estimate can then be calculated on the drawn sample.

1 |
statistic = calculation([0.2, 0.1, 0.2, 0.6]) |

Those observations not chosen for the sample may be used as out of sample observations.

1 |
oob = [0.3, 0.4, 0.5] |

In the case of evaluating a machine learning model, the model is fit on the drawn sample and evaluated on the out-of-bag sample.

1 2 3 4 |
train = [0.2, 0.1, 0.2, 0.6] test = [0.3, 0.4, 0.5] model = fit(train) statistic = evaluate(model, test) |

That concludes one repeat of the procedure. It can be repeated 30 or more times to give a sample of calculated statistics.

1 |
statistics = [...] |

This sample of statistics can then be summarized by calculating a mean, standard deviation, or other summary values to give a final usable estimate of the statistic.

1 |
estimate = mean([...]) |

## Bootstrap API

We do not have to implement the bootstrap method manually. The scikit-learn library provides an implementation that will create a single bootstrap sample of a dataset.

The resample() scikit-learn function can be used. It takes as arguments the data array, whether or not to sample with replacement, the size of the sample, and the seed for the pseudorandom number generator used prior to the sampling.

For example, we can create a bootstrap that creates a sample with replacement with 4 observations and uses a value of 1 for the pseudorandom number generator.

1 |
boot = resample(data, replace=True, n_samples=4, random_state=1) |

Unfortunately, the API does not include any mechanism to easily gather the out-of-bag observations that could be used as a test set to evaluate a fit model.

At least in the univariate case we can gather the out-of-bag observations using a simple Python list comprehension.

1 2 |
# out of bag observations oob = [x for x in data if x not in boot] |

We can tie all of this together with our small dataset used in the worked example of the prior section.

1 2 3 4 5 6 7 8 9 10 |
# scikit-learn bootstrap from sklearn.utils import resample # data sample data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] # prepare bootstrap sample boot = resample(data, replace=True, n_samples=4, random_state=1) print('Bootstrap Sample: %s' % boot) # out of bag observations oob = [x for x in data if x not in boot] print('OOB Sample: %s' % oob) |

Running the example prints the observations in the bootstrap sample and those observations in the out-of-bag sample

1 2 |
Bootstrap Sample: [0.6, 0.4, 0.5, 0.1] OOB Sample: [0.2, 0.3] |

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

- List 3 summary statistics that you could estimate using the bootstrap method.
- Find 3 research papers that use the bootstrap method to evaluate the performance of machine learning models.
- Implement your own function to create a sample and an out-of-bag sample with the bootstrap method.

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

### Books

- Applied Predictive Modeling, 2013.
- An Introduction to Statistical Learning, 2013.
- An Introduction to the Bootstrap, 1994.

### API

### Articles

- Resampling (statistics) on Wikipedia
- Bootstrapping (statistics) on Wikipedia
- Rule of thumb for number of bootstrap samples, CrossValiated.

## Summary

In this tutorial, you discovered the bootstrap resampling method for estimating the skill of machine learning models on unseen data.

Specifically, you learned:

- The bootstrap method involves iteratively resampling a dataset with replacement.
- That when using the bootstrap you must choose the size of the sample and the number of repeats.
- The scikit-learn provides a function that you can use to resample a dataset for the bootstrap method.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Great post, Jason! Helped me a lot

I’m glad to hear that.

One more book:

Michael R. Chernick, Robert A. LaBudde. An Introduction to Bootstrap Methods with Applications to R (2011) https://www.amazon.com/Introduction-Bootstrap-Methods-Applications/dp/0470467045

Papers:

Yoram Reich, S.V.Barai. Evaluating machine learning models for engineering problems https://www.sciencedirect.com/science/article/pii/S0954181098000211

Gordon C. S. Smith, Shaun R. Seaman, Angela M. Wood, Patrick Royston, Ian R. White. Correcting for Optimistic Prediction in Small Data Sets https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4108045/

Wonderful references, thanks Vladislav.

Thanks to this post i can finally understand the difference between K-Cross validation and Bootstrap, thanks for the clear explanation.

I’m glad to hear that.

Hi Jason,

a very good post. Could you extend it with a bit of explanation/example on how to calculate confidence intervals at the end, e.g. for a bootstrap-calculated mean?

See this post:

https://machinelearningmastery.com/confidence-intervals-for-machine-learning/

And this post:

https://machinelearningmastery.com/calculate-bootstrap-confidence-intervals-machine-learning-results-python/

Thank you very much Jason, for a wonderful topic, It help me a lot to understand the concept.

I’m glad to hear that Mahmood.

Thanks for this post I was expecting (going over ISLR’s bootstrap Labs) a bootstrap method in sklearn (or numpy, pandas). thanks for explanation. You may also want to mention the Panda’s resample method, useful for converting monthly to quarterly observations.

Not sure what the sklearn.cross-validation.bootstrap is doing.

Thanks Jerry.

Hi Jason,

Thanks for the post. I understand what is Bootstrapping machine learning. I am confused between the difference between Bootstrapping and repeated random-subsampling cross-validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics)#Repeated_random_sub-sampling_validation). To me both seem the same. First sample with randomly create a sub-sample from the given data and perform training of model on this. Next, validate the model on left out sample. Repeat the process some number of times. The final validation error would be an estimate from each of these iterations. Please let me know what is the difference?

One difference I can think of is bootstrapping samples with replacement and repeated random sub-sampling method does not repeat the sample. Is this the only difference?

Thanks,

Gaurav

Selection with replacement might be the main difference.