Nested Cross-Validation for Machine Learning with Python

Last Updated on July 31, 2020

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.

In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.

After completing this tutorial, you will know:

  • Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
  • Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
  • How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Let’s get started.

Nested Cross-Validation for Machine Learning with Python

Nested Cross-Validation for Machine Learning with Python
Photo by Andrew Bone, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Combined Hyperparameter Tuning and Model Selection
  2. What Is Nested Cross-Validation
  3. Nested Cross-Validation With Scikit-Learn

Combined Hyperparameter Tuning and Model Selection

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.

Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.

Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, an often noisy model performance score. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

One approach to this problem is called nested cross-validation.

What Is Nested Cross-Validation

Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.

In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.

As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “double cross-validation.”

Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.”

Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.

This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.

In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.

On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

What Is the Cost of Nested Cross-Validation?

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.

To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.

How Do You Set k?

The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.

You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.

It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.

How Do You Configure the Final Model?

The final model is configured and fit using the procedure applied internally to the outer loop.

As follows:

  1. An algorithm is selected based on its performance on the outer loop of nested cross-validation.
  2. Then the inner-procedure is applied to the entire dataset.
  3. The hyperparameters found during this final search are then used to configure a final model.
  4. The final model is fit on the entire dataset.

This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.

Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.

Nested Cross-Validation With Scikit-Learn

The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.

The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset. The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.

For example:

This class can be used to perform the outer-loop of the nested-cross validation procedure.

The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.

For example:

These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.

We can tie these elements together and implement the nested cross-validation procedure.

Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.

This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.

Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.

We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.

The complete example is listed below.

Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.

You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.

Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.

This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.

A final mean classification accuracy is then reported.

A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.

This greatly reduces the amount of code required to perform the nested cross-validation.

The complete example is listed below.

Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Papers

APIs

Summary

In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.

Specifically, you learned:

  • Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
  • Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
  • How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Discover Fast Machine Learning in Python!

Master Machine Learning With Python

Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

25 Responses to Nested Cross-Validation for Machine Learning with Python

  1. Viktor July 29, 2020 at 3:54 pm #

    Good afternoon, thanks for your article.
    Could you please tell a newbie. The picture “A final mean classification accuracy is then reported” shows the “best parameters” for each outer fold.
    1) what parameters from this list should be used in the final model?
    2) how can the resulting trained model be called in your example?
    thank you in advance.

    • Jason Brownlee July 30, 2020 at 6:18 am #

      Think of it this way: We are searching for a modeling process, not a specific model and configuration.

      The modeling process that achieves the best result is the one we will use.

      So we could compare the best grid search model from multiple different algorithms and use the process that creates the best model as the process to create the final model.

      Does that make sense?

  2. Abhishek V July 31, 2020 at 3:56 am #

    double cross-validation no doubt is a better approach, as it looks on surface, we will see how we it works with our own problem

  3. Áttila July 31, 2020 at 6:11 am #

    Hi Jason I follow yours posts and I learn a lot! Great stuff!

    I do not know if I understood well the procedure but I think it is an overfitting again when it is selected a model with a better mean metric in the holdout groups.

    I view the nested cross validation as a technique that gives k best models and it is not possible to assess the better. To do this I’ll need another holdout data but there is none anymore.

    The question is what I’m gonna do with k best models? I do not know I can ensemble then. Or I can discard some using some theoretical analysis for example.

    Cheers

    • Jason Brownlee July 31, 2020 at 6:26 am #

      Thanks!

      No, it selects a modeling pipeline that results in the best model. E.g. the grid search will choose the best model for you to start using directly and make predictions.

      • Leo August 10, 2020 at 6:02 am #

        Hi Jason!! Now I get it! You dont end up with k models. What you get from cross_val_score is an UNBIASED estimate for the generalization error. Now let´s say you´re tasked with making a new prediction using a fresh set of predictors X.
        What you should do is you should run GridSearchCV on your entire dataset. Then use best_model to make your prediction.

        Great post!!!

  4. Shahid Islam August 2, 2020 at 12:10 am #

    Hi:
    I work with very large data set( more than billion rows). I would like to see a book where these models and methods are described using pyspark in spark setting. Now a days large data set is more common, so a book like that will be very useful for data scientists.

  5. Anthony The Koala August 12, 2020 at 10:14 am #

    Dear Dr Jason,
    This is about the lines of code in the nested model, particularly

    After some experimentation, I discovered that you could say the same thing as:

    The reason is that search contains the best model for the given parameters such that we can fit X_train, y_train for the best model and predict using X_test with the best model.
    Definition: best model = search, the best model for the GridSearch’s parameters.

    Thank you,
    Anthony of Sydney

    • Jason Brownlee August 12, 2020 at 1:35 pm #

      Yes, that is how you would use the technique as a final model without the outer loop for model evaluation.

      • Anthony The Koala August 12, 2020 at 3:00 pm #

        Dear Dr Jason,
        Thank you for the reply. The question I asked related to the nested example achieving the same score with less code.

        My question is on the last mode.

        If the last model has less code and no loops, but achieve the same results, why not implement the last example that achieves the same outcome of removing the noise from the scores?

        Thank you again,
        Anthony of Sydney

        • Jason Brownlee August 13, 2020 at 6:04 am #

          It is doing something different.

          The example in the tutorial is evaluating the modeling pipeline with k-fold cross-validation (outer loop).

          Your example fits the pipeline once and uses it to make a prediction.

  6. Anthony The Koala August 12, 2020 at 5:25 pm #

    Dear Dr Jason,
    Thank you.
    In addition to the above question.
    In the nested example, you get X_train and y_train from the outer splits and were able to get yhat

    In the above code, in order to get X_train and y_train, there were 10 iterations of the loop based on kfolds=10. We had 10 different train_ix Thus we could estimate yhat based on 10 versions of X_train and y_train..

    BUT in the last example there are no iterations and not the ability to estimate yhat. So we cannot predict yhat because we don’t have 10 iterations of train_ix as in the nested example?

    Put it another way, how can we make a prediction in the last example?

    Thank you,
    Anthony of Sydney

    How do you make a prediction for yhat in the second example

    • Jason Brownlee August 13, 2020 at 6:08 am #

      The last example does the same thing as the first example in fewer lines of code.

      You can make a prediction using the code example you provided, e.g. fit the pipeline then call predict().

      • Anthony The Koala August 13, 2020 at 7:50 am #

        Dear Dr Jason,
        Thank you.
        But in the last example, how do I get the indices for train_ix when there is no iteration as in the first example?
        I need the train_ix indices in order to get X_train and y_train. I cannot see how that could be obtained in the last example.

        In the first example, you get the train_ix from the cv_outer.split(X).

        In the last example there is no way to get the train_ix in order to get the X_train and the y_train because you need the indices for train_ix BUT you cannot get the train_ix indices in the second example.

        So how do I get the train_ix in the last example.

        Thank you and always appreciate your tutorials.
        Anthony of Sydney

        • Jason Brownlee August 13, 2020 at 10:54 am #

          In the last example, the cross-validation of the modeling pipeline (outer) is controlled via the call to cross_val_score(). The cross-validation of each hyperparameter combination is controlled automatically via the GridSearchCV (inner).

          It does the same thing as the first example in fewer lines of code.

        • Anthony The Koala August 13, 2020 at 11:11 am #

          Dear Dr Jason,
          I tried to predict yhat from the last example.
          In the last example, in order to get y_train, I had to do the following by appending this code to the end of the last example

          This is certainly different from either the first or second example which produced
          Accuracy: 0.927 (0.019)
          That is mine produced
          0;926 with std dev (0.0037)

          It is strange that my std_dev was less using 3 sets of scores.

          Thank you,
          Anthony of Sydney

          • Anthony The Koala August 13, 2020 at 11:56 am #

            Dear Dr Jason,
            I did the same thing, by appending the following code, this time using
            This time I used the cv_outer which is 10 fold and attached to the end of your last example:

            The result is exactly like your first and last code.

            Summary of findings:
            For the second example, in order to get the test and train sets each for X and y you have to have a loop in order to extract the indices for test and train. The indices for test and train are dependent on the number of folds.

            Thank you,
            Anthony of Sydney

          • Jason Brownlee August 13, 2020 at 1:26 pm #

            Sorry, I don’t understand what you’re trying to achieve.

            Why would you want to make a prediction within cross-validation?

  7. Anthony The Koala August 13, 2020 at 4:06 pm #

    Dear Dr Jason
    Thank you for your reply.
    This was the original question and maybe I did not make it clear.
    (1) My original question was how to make predictions with the second example labelled above the heading “Further Reading”.

    (2) You did make a prediction with the cross validation in the first example located at the section “Nested Cross-Validation With Scikit-Learn”

    (3) To answer your question “…what I am trying to achieve…:” is to be able to predict yhat in the second example above the heading “Further Reading” . I was able to predict yhat and achieved the same results.

    This required me adding a few extra lines to the second example.

    I finally worked it out getting EXACTLY the SAME mean(scores) and std(scores) as your example.

    I added the following code to the end of your second example LOCATED above the section “Further Reading”

    Conclusion (1) , you can retrieve the means to produce X_train, y_train, X_test, y_test, scores, mean(scores) and std(scores) as your examples.

    Conclusion (2), you can make predictions with code and get the same results.

    It works.

    Thank you again for your patience,
    Anthony of Sydney

    It was because you used the prediction

    • Jason Brownlee August 14, 2020 at 5:57 am #

      You provided code to make a prediction with the final example already and I pointed out as much, here it is again:

      • Anthony The Koala August 14, 2020 at 8:48 am #

        Dear Dr Jason,
        Thank you.
        It works, and I thank you for that.
        Again, thank you for your patience which helps in understanding.
        It is appreciated.
        Anthony of Sydney

        • Jason Brownlee August 14, 2020 at 1:15 pm #

          You’re very welcome Anthony, I’m here to help if I can.

Leave a Reply