How to Implement Resampling Methods From Scratch In Python

The goal of predictive modeling is to create models that make good predictions on new data.

We don’t have access to this new data at the time of training, so we must use statistical methods to estimate the performance of a model on new data.

This class of methods are called resampling methods, as they resampling your available training data.

In this tutorial, you will discover how to implement resampling methods from scratch in Python.

After completing this tutorial, you will know:

  • How to implement a train and test split of your data.
  • How to implement a k-fold cross validation split of your data.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Jan/2017: Changed the calculation of fold_size in cross_validation_split() to always be an integer. Fixes issues with Python 3.
  • Update May/2018: Fixed typo re LOOCV.
  • Update Aug/2018: Tested and updated to work with Python 3.6.
How to Implement Resampling Methods From Scratch In Python

How to Implement Resampling Methods From Scratch In Python
Photo by Andrew Lynch, some rights reserved.

Description

The goal of resampling methods is to make the best use of your training data in order to accurately estimate the performance of a model on new unseen data.

Accurate estimates of performance can then be used to help you choose which set of model parameters to use or which model to select.

Once you have chosen a model, you can train for final model on the entire training dataset and start using it to make predictions.

There are two common resampling methods that you can use:

  • A train and test split of your data.
  • k-fold cross validation.

In this tutorial, we will look at using each and when to use one method over the other.

Tutorial

This tutorial is divided into 3 parts:

  1. Train and Test Split.
  2. k-fold Cross Validation Split.
  3. How to Choose a Resampling Method.

These steps will provide the foundations you need to handle resampling your dataset to estimate algorithm performance on new data.

1. Train and Test Split

The train and test split is the easiest resampling method.

As such, it is the most widely used.

The train and test split involves separating a dataset into two parts:

  • Training Dataset.
  • Test Dataset.

The training dataset is used by the machine learning algorithm to train the model. The test dataset is held back and is used to evaluate the performance of the model.

The rows assigned to each dataset are randomly selected. This is an attempt to ensure that the training and evaluation of a model is objective.

If multiple algorithms are compared or multiple configurations of the same algorithm are compared, the same train and test split of the dataset should be used. This is to ensure that the comparison of performance is consistent or apples-to-apples.

We can achieve this by seeding the random number generator the same way before splitting the data, or by holding the same split of the dataset for use by multiple algorithms.

We can implement the train and test split of a dataset in a single function.

Below is a function named train_test_split() to split a dataset into a train and test split. It accepts two arguments, the dataset to split as a list of lists and an optional split percentage.

A default split percentage of 0.6 or 60% is used. This will assign 60% of the dataset to the training dataset and leave the remaining 40% to the test dataset. A 60/40 for train/test is a good default split of the data.

The function first calculates how many rows the training set requires from the provided dataset. A copy of the original dataset is made. Random rows are selected and removed from the copied dataset and added to the train dataset until the train dataset contains the target number of rows.

The rows that remain in the copy of the dataset are then returned as the test dataset.

The randrange() function from the random model is used to generate a random integer in the range between 0 and the size of the list.

We can test this function using a contrived dataset of 10 rows, each with a single column.

The complete example is listed below.

The example fixes the random seed before splitting the training dataset. This is to ensure the exact same split of the data is made every time the code is executed. This is handy if we want to use the same split many times to evaluate and compare the performance of different algorithms.

Running the example produces the output below.

The data in the train and test set is printed, showing that 6/10 or 60% of the records were assigned to the training dataset and 4/10 or 40% of the records were assigned to the test set.

2. k-fold Cross Validation Split

A limitation of using the train and test split method is that you get a noisy estimate of algorithm performance.

The k-fold cross validation method (also called just cross validation) is a resampling method that provides a more accurate estimate of algorithm performance.

It does this by first splitting the data into k groups. The algorithm is then trained and evaluated k times and the performance summarized by taking the mean performance score. Each group of data is called a fold, hence the name k-fold cross-validation.

It works by first training the algorithm on the k-1 groups of the data and evaluating it on the kth hold-out group as the test set. This is repeated so that each of the k groups is given an opportunity to be held out and used as the test set.

As such, the value of k should be divisible by the number of rows in your training dataset, to ensure each of the k groups has the same number of rows.

You should choose a value for k that splits the data into groups with enough rows that each group is still representative of the original dataset. A good default to use is k=3 for a small dataset or k=10 for a larger dataset. A quick way to check if the fold sizes are representative is to calculate summary statistics such as mean and standard deviation and see how much the values differ from the same statistics on the whole dataset.

We can reuse what we learned in the previous section in creating a train and test split here in implementing k-fold cross validation.

Instead of two groups, we must return k-folds or k groups of data.

Below is a function named cross_validation_split() that implements the cross validation split of data.

As before, we create a copy of the dataset from which to draw randomly chosen rows.

We calculate the size of each fold as the size of the dataset divided by the number of folds required.

If the dataset does not cleanly divide by the number of folds, there may be some remainder rows and they will not be used in the split.

We then create a list of rows with the required size and add them to a list of folds which is then returned at the end.

We can test this resampling method on the same small contrived dataset as above. Each row has only a single column value, but we can imagine how this might scale to a standard machine learning dataset.

The complete example is listed below.

As before, we fix the seed for the random number generator to ensure that each time the code is executed that the same rows are used in the same folds.

A k value of 4 is used for demonstration purposes. We would expect that the 10 rows divided into 4 folds will result in 2 rows per fold, with a remainder of 2 that will not be used in the split.

Running the example produces the output below. The list of the folds is printed, showing that indeed as expected there are two rows per fold.

3. How to Choose a Resampling Method

The gold standard for estimating the performance of machine learning algorithms on new data is k-fold cross validation.

When well-configured, k-fold cross validation gives a robust estimate of performance compared to other methods such as the train and test split.

The downside of cross-validation is that it can be time-consuming to run, requiring k different models to be trained and evaluated. This is a problem if you have a very large dataset or if you are evaluating a model that takes a long time to train.

The train and test split resampling method is the most widely used. This is because it is easy to understand and implement, and because it gives a quick estimate of algorithm performance.

Only a single model is constructed and evaluated.

Although the train and test split method can give a noisy or unreliable estimate of the performance of a model on new data, this becomes less of a problem if you have a very large dataset.

Large datasets are those in the hundreds of thousands or millions of records, large enough that splitting it in half results in two datasets that have nearly equivalent statistical properties.

In such cases, there may be little need to use k-fold cross validation as an evaluation of the algorithm and a train and test split may be just as reliable.

Extensions

In this tutorial, we have looked at the two most common resampling methods.

There are other methods you may want to investigate and implement as extensions to this tutorial.

For example:

  • Repeated Train and Test. This is where the train and test split is used, but the process is repeated many times.
  • LOOCV or Leave One Out Cross Validation. This is a form of k-fold cross-validation where the value of k is fixed at n (the number of training examples).
  • Stratification. In classification problems, this is where the balance of class values in each group is forced to match the original dataset.

Did you implement an extension?
Share your experiences in the comments below.

Review

In this tutorial, you discovered how to implement resampling methods in Python from scratch.

Specifically, you learned:

  • How to implement the train and test split method.
  • How to implement the k-fold cross validation method.
  • When to use each method.

Do you have any questions about resampling methods or about this post?
Ask your questions in the comments and I will do my best to answer.

Discover How to Code Algorithms From Scratch!

Machine Learning Algorithms From Scratch

No Libraries, Just Python Code.

...with step-by-step tutorials on real-world datasets

Discover how in my new Ebook:
Machine Learning Algorithms From Scratch

It covers 18 tutorials with all the code for 12 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Stochastic Gradient Descent and much more...

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

See What's Inside

25 Responses to How to Implement Resampling Methods From Scratch In Python

  1. Avatar
    Dr Alan Beckles October 18, 2016 at 11:29 am #

    When using stratified k fold cross validation why is the shuffle argument set to True?

  2. Avatar
    Isauro November 18, 2016 at 1:37 pm #

    Hi, I receive this error when running the cross_validation_split function. I’m using python 3.4

    line 26, in cross_validation_split
    index = randrange(len(dataset_copy))
    File “C:\Python34\lib\random.py”, line 186, in randrange
    raise ValueError(“empty range for randrange()”)
    ValueError: empty range for randrange()

    I’m running the same code example on my end and will receive this error. If I use randrange() with len(dataset) out of the function works fine. Any help would be great!

    • Avatar
      Isauro November 18, 2016 at 2:24 pm #

      I figured it out. I needed fold_size = len(dataset) / folds to have double // to turn it into an integer. Should be: fold_size = len(dataset) // folds

      • Avatar
        Ayushi October 24, 2019 at 7:52 pm #

        I am getting same, even though I tried with double //. Is there any way I can resolve it?

    • Avatar
      Jason Brownlee November 19, 2016 at 8:42 am #

      This might be a Python 3 thing, I’ll look into it.

      • Avatar
        Brandon December 8, 2016 at 1:43 pm #

        I just ran into this as well. As soon as I added what Isauro mentioned that method worked for me. I am using python 3.5.

  3. Avatar
    Ronak November 3, 2017 at 8:28 am #

    How will the function change if there are multiple rows? I know we can use cross validation package from sklearn for bigger datasets but I am trying to code the logic of cross validation for bigger datasets. Can you or someone please show how to do that? Maybe take the iris dataset for example.

    • Avatar
      Jason Brownlee November 3, 2017 at 2:16 pm #

      What do you mean by multiple rows? Cross-validation requires multiple rows to select from.

      • Avatar
        Talha Mahboob Alam May 17, 2020 at 7:24 pm #

        1. What is the best way to resample the data. On full dataset or after splitting into training and test sets?
        2. What is the ratio of balancing for various over sampling and under sampling techniques? How to determined the balancing ratio.

        • Avatar
          Jason Brownlee May 18, 2020 at 6:10 am #

          It depends on your data – you must use experimentation to discover what works best.

  4. Avatar
    Rosângela November 24, 2017 at 3:50 am #

    Hi, how can i implement some metric to mensure the approach cross validation?

  5. Avatar
    Tanya May 6, 2018 at 4:35 pm #

    Hey, How can I used 10-cross validation for Naive Bayes classifier. I saw your post regarding Naive Bayes classifier here. (https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/)
    but you use a normal training and testing split there.

    Please guide me as I am fairly fresh to machine learning.

  6. Avatar
    Rajesh May 21, 2018 at 11:29 pm #

    Hi Jason,

    You mentioned as below.

    LOOCV or Leave One Out Cross Validation. This is a form of k-fold cross-validation where the value of k is fixed at 1.

    Is it should be 1 or N ?

  7. Avatar
    Pelumi January 13, 2019 at 3:52 am #

    In the k-fold cross validation method, the formula for calculating the fold size is total rows / total fold, which means the total rows is divisible by the total fold (k).
    But in your write up, you said “the value of k should be divisible by the number of rows”. I’m kind of confused. I’m i reading it the wrong way or the statement is incorrect?

    • Avatar
      Jason Brownlee January 13, 2019 at 5:43 am #

      Yes, it should be the other way around: the number of rows should be divisible by k.

  8. Avatar
    Jordan November 24, 2019 at 10:04 am #

    Hello,

    Attempting to implement LOOCV from scratch for a multilabel classification problem. I’m curious to see if I am doing this correctly, and whether it is a problem or not that I am fitting the model on the data with each iteration. Does the following code result in data leakage? If the model is trained on all data except for “X” sample, then the next iteration it is tested on “Y” sample, it was previously fit to a training set that included “Y” sample. Is that problematic? Hopefully I’m making myself clear.

    Happy to purchase one of the books if there is an example of this. The sklearn library for cross val doesn't seem to work with multilabel data

    • Avatar
      Jason Brownlee November 25, 2019 at 6:16 am #

      I’m eager to help, but I don’t have the capacity to review/debug your code, sorry.

      I’m surprised to hear that LOOCV in sklearn does not work for your use case. It should be agnostic to the problem type. What is the problem exactly?

    • Avatar
      Arinoid November 25, 2019 at 8:51 am #

      Hey, are you from Manchester Uni?)

  9. Avatar
    Divi July 29, 2021 at 6:49 pm #

    Informative and description content. Thanks to the author!

Leave a Reply