# Nested Cross-Validation for Machine Learning with Python

Last Updated on January 7, 2021

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.

In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.

After completing this tutorial, you will know:

• Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
• Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
• How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Updated Jan/2021: Added section on pipeline thinking and a link to a related tutorial.

Nested Cross-Validation for Machine Learning with Python
Photo by Andrew Bone, some rights reserved.

## Tutorial Overview

This tutorial is divided into three parts; they are:

1. Combined Hyperparameter Tuning and Model Selection
2. What Is Nested Cross-Validation
1. What Is the Cost of Nested Cross-Validation?
2. How Do You Set k?
3. How Do You Configure the Final Model?
4. What Configuration Was Chosen by Inner Loop?
3. Nested Cross-Validation With Scikit-Learn

## Combined Hyperparameter Tuning and Model Selection

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.

Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.

Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, an often noisy model performance score. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

One approach to this problem is called nested cross-validation.

## What Is Nested Cross-Validation

Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.

In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.

As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “double cross-validation.”

Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.”

Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.

This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.

Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.

In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.

### What Is the Cost of Nested Cross-Validation?

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.

To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.

### How Do You Set k?

The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.

You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.

It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.

For more general help on setting k, see this tutorial:

### How Do You Configure the Final Model?

The final model is configured and fit using the procedure applied in one pass of the outer loop, e.g. the outer loop applied to the entire dataset.

As follows:

1. An algorithm is selected based on its performance on the outer loop of nested cross-validation.
2. Then the inner-procedure is applied to the entire dataset.
3. The hyperparameters found during this final search are then used to configure a final model.
4. The final model is fit on the entire dataset.

This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.

### What Configuration Was Chosen by Inner Loop?

It doesn’t matter, that’s the whole idea.

An automatic configuration procedure was used instead of a specific configuration. There is a single final model, but the best configuration for that final model is found via the chosen search procedure, on the final run.

You let go of the need to dive into the specific model configuration chosen, just like at the next level down you let go of the specific model coefficients found each cross-validation fold.

This requires a shift in thinking and can be challenging, e.g. a shift from “I configured my model like this…” to “I used an automatic model configuration procedure with these constraints…“.

This tutorial has more on the topic of “pipeline thinking” and may help:

Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.

## Nested Cross-Validation With Scikit-Learn

The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.

The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset. The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.

For example:

This class can be used to perform the outer-loop of the nested-cross validation procedure.

The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.

For example:

These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.

We can tie these elements together and implement the nested cross-validation procedure.

Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.

This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.

Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.

We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.

The complete example is listed below.

Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.

Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.

This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.

A final mean classification accuracy is then reported.

A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.

This greatly reduces the amount of code required to perform the nested cross-validation.

The complete example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.

Specifically, you learned:

• Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
• Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
• How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Do you have any questions?

## Discover Fast Machine Learning in Python!

#### Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:

### 115 Responses to Nested Cross-Validation for Machine Learning with Python

1. Viktor July 29, 2020 at 3:54 pm #

Good afternoon, thanks for your article.
Could you please tell a newbie. The picture “A final mean classification accuracy is then reported” shows the “best parameters” for each outer fold.
1) what parameters from this list should be used in the final model?
2) how can the resulting trained model be called in your example?

• Jason Brownlee July 30, 2020 at 6:18 am #

Think of it this way: We are searching for a modeling process, not a specific model and configuration.

The modeling process that achieves the best result is the one we will use.

So we could compare the best grid search model from multiple different algorithms and use the process that creates the best model as the process to create the final model.

Does that make sense?

2. Abhishek V July 31, 2020 at 3:56 am #

double cross-validation no doubt is a better approach, as it looks on surface, we will see how we it works with our own problem

3. Áttila July 31, 2020 at 6:11 am #

Hi Jason I follow yours posts and I learn a lot! Great stuff!

I do not know if I understood well the procedure but I think it is an overfitting again when it is selected a model with a better mean metric in the holdout groups.

I view the nested cross validation as a technique that gives k best models and it is not possible to assess the better. To do this I’ll need another holdout data but there is none anymore.

The question is what I’m gonna do with k best models? I do not know I can ensemble then. Or I can discard some using some theoretical analysis for example.

Cheers

• Jason Brownlee July 31, 2020 at 6:26 am #

Thanks!

No, it selects a modeling pipeline that results in the best model. E.g. the grid search will choose the best model for you to start using directly and make predictions.

• Leo August 10, 2020 at 6:02 am #

Hi Jason!! Now I get it! You dont end up with k models. What you get from cross_val_score is an UNBIASED estimate for the generalization error. Now let´s say you´re tasked with making a new prediction using a fresh set of predictors X.
What you should do is you should run GridSearchCV on your entire dataset. Then use best_model to make your prediction.

Great post!!!

• Jason Brownlee August 10, 2020 at 6:52 am #

Exactly!

• Felipe Araya January 28, 2021 at 12:32 am #

Hi Jason,

0. Create Outer and Inner loops (train and test set)
1. Run the GridseachCV on your train set
2. Get your “best_model” and put it against the outer loop
3. Once you get your best generalised model (after you ran the outer loop) get the hyperparameters you found work best for generalization
4. Use those hyperparameters to retrain your model now using the whole data
5. You can now used the model trainned using the whole data and the hyperparameters found during the NestedCV, to make predictions
6. If you get a new fresh set for your predictiors X (meaning more data to add to your previous set), repeat steps 1-4.

Isn’t like that? or am I just reading the question asked wrongly?

• Jason Brownlee January 28, 2021 at 5:59 am #

No, not with nested cv. See the section on training a final model.

Of course, you can do anything you like as long as you justify it.

4. Shahid Islam August 2, 2020 at 12:10 am #

Hi:
I work with very large data set( more than billion rows). I would like to see a book where these models and methods are described using pyspark in spark setting. Now a days large data set is more common, so a book like that will be very useful for data scientists.

5. Anthony The Koala August 12, 2020 at 10:14 am #

Dear Dr Jason,
This is about the lines of code in the nested model, particularly

After some experimentation, I discovered that you could say the same thing as:

The reason is that search contains the best model for the given parameters such that we can fit X_train, y_train for the best model and predict using X_test with the best model.
Definition: best model = search, the best model for the GridSearch’s parameters.

Thank you,
Anthony of Sydney

• Jason Brownlee August 12, 2020 at 1:35 pm #

Yes, that is how you would use the technique as a final model without the outer loop for model evaluation.

• Anthony The Koala August 12, 2020 at 3:00 pm #

Dear Dr Jason,
Thank you for the reply. The question I asked related to the nested example achieving the same score with less code.

My question is on the last mode.

If the last model has less code and no loops, but achieve the same results, why not implement the last example that achieves the same outcome of removing the noise from the scores?

Thank you again,
Anthony of Sydney

• Jason Brownlee August 13, 2020 at 6:04 am #

It is doing something different.

The example in the tutorial is evaluating the modeling pipeline with k-fold cross-validation (outer loop).

Your example fits the pipeline once and uses it to make a prediction.

• Janine S March 18, 2021 at 10:28 pm #

Hey Jason,

I think that Anthony is right. Actually, if we could not replace

yhat = best_model.predict(X_test) by yhat = search.redict(X_test),

it would make no sense that the shorter code version at the very of your article and would work, right?
Because what cross_val_score does is creating train and test folds, then trains the model given as the first argument (in our case, search), and then evaluates it on the test set using .predict(X_test). So in cross_val_score, we can’t tell the function to work with search.best_estimor_, which requires that in each iteration of the inner loop,

search.best_estimator == search itself.

One can also check that by printing some boolean statements within the inner for loop.

If I’m wrong though, would you please explain me why there is a difference and why cross_val_score does work, though?

Thank you!

• Jason Brownlee March 19, 2021 at 6:21 am #

Not sure I follow.

search.predict() uses the best model directly – they are equivalent:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.predict

6. Anthony The Koala August 12, 2020 at 5:25 pm #

Dear Dr Jason,
Thank you.
In addition to the above question.
In the nested example, you get X_train and y_train from the outer splits and were able to get yhat

In the above code, in order to get X_train and y_train, there were 10 iterations of the loop based on kfolds=10. We had 10 different train_ix Thus we could estimate yhat based on 10 versions of X_train and y_train..

BUT in the last example there are no iterations and not the ability to estimate yhat. So we cannot predict yhat because we don’t have 10 iterations of train_ix as in the nested example?

Put it another way, how can we make a prediction in the last example?

Thank you,
Anthony of Sydney

How do you make a prediction for yhat in the second example

• Jason Brownlee August 13, 2020 at 6:08 am #

The last example does the same thing as the first example in fewer lines of code.

You can make a prediction using the code example you provided, e.g. fit the pipeline then call predict().

• Anthony The Koala August 13, 2020 at 7:50 am #

Dear Dr Jason,
Thank you.
But in the last example, how do I get the indices for train_ix when there is no iteration as in the first example?
I need the train_ix indices in order to get X_train and y_train. I cannot see how that could be obtained in the last example.

In the first example, you get the train_ix from the cv_outer.split(X).

In the last example there is no way to get the train_ix in order to get the X_train and the y_train because you need the indices for train_ix BUT you cannot get the train_ix indices in the second example.

So how do I get the train_ix in the last example.

Thank you and always appreciate your tutorials.
Anthony of Sydney

• Jason Brownlee August 13, 2020 at 10:54 am #

In the last example, the cross-validation of the modeling pipeline (outer) is controlled via the call to cross_val_score(). The cross-validation of each hyperparameter combination is controlled automatically via the GridSearchCV (inner).

It does the same thing as the first example in fewer lines of code.

• Anthony The Koala August 13, 2020 at 11:11 am #

Dear Dr Jason,
I tried to predict yhat from the last example.
In the last example, in order to get y_train, I had to do the following by appending this code to the end of the last example

This is certainly different from either the first or second example which produced
Accuracy: 0.927 (0.019)
That is mine produced
0;926 with std dev (0.0037)

It is strange that my std_dev was less using 3 sets of scores.

Thank you,
Anthony of Sydney

• Anthony The Koala August 13, 2020 at 11:56 am #

Dear Dr Jason,
I did the same thing, by appending the following code, this time using
This time I used the cv_outer which is 10 fold and attached to the end of your last example:

The result is exactly like your first and last code.

Summary of findings:
For the second example, in order to get the test and train sets each for X and y you have to have a loop in order to extract the indices for test and train. The indices for test and train are dependent on the number of folds.

Thank you,
Anthony of Sydney

• Jason Brownlee August 13, 2020 at 1:26 pm #

Sorry, I don’t understand what you’re trying to achieve.

Why would you want to make a prediction within cross-validation?

7. Anthony The Koala August 13, 2020 at 4:06 pm #

Dear Dr Jason
This was the original question and maybe I did not make it clear.
(1) My original question was how to make predictions with the second example labelled above the heading “Further Reading”.

(2) You did make a prediction with the cross validation in the first example located at the section “Nested Cross-Validation With Scikit-Learn”

(3) To answer your question “…what I am trying to achieve…:” is to be able to predict yhat in the second example above the heading “Further Reading” . I was able to predict yhat and achieved the same results.

This required me adding a few extra lines to the second example.

I finally worked it out getting EXACTLY the SAME mean(scores) and std(scores) as your example.

I added the following code to the end of your second example LOCATED above the section “Further Reading”

Conclusion (1) , you can retrieve the means to produce X_train, y_train, X_test, y_test, scores, mean(scores) and std(scores) as your examples.

Conclusion (2), you can make predictions with code and get the same results.

It works.

Thank you again for your patience,
Anthony of Sydney

It was because you used the prediction

• Jason Brownlee August 14, 2020 at 5:57 am #

You provided code to make a prediction with the final example already and I pointed out as much, here it is again:

• Anthony The Koala August 14, 2020 at 8:48 am #

Dear Dr Jason,
Thank you.
It works, and I thank you for that.
Again, thank you for your patience which helps in understanding.
It is appreciated.
Anthony of Sydney

• Jason Brownlee August 14, 2020 at 1:15 pm #

You’re very welcome Anthony, I’m here to help if I can.

8. John Sang August 15, 2020 at 8:46 pm #

Hi I show your result in this topic as below
>acc=0.900, est=0.932, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.940, est=0.924, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.929, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.927, cfg={‘max_features’: 6, ‘n_estimators’: 100}
>acc=0.920, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.910, est=0.918, cfg={‘max_features’: 2, ‘n_estimators’: 100}

Accuracy: 0.927 (0.019)

I read a lot of below comments about how to choose best parameter for building the final model, but I am not really clear.

Can I apply perfomance result straightforwardly by selecting parameter set from the best accuracy? in this case >acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}

can I choose {‘max_features’: 4, ‘n_estimators’: 500} ?

if not , what is your recommendation.

• Jason Brownlee August 16, 2020 at 5:51 am #

You can.

But, the idea of nested CV, is you don’t choose the hyperparameters, the grid search does and it will use whatever is “best” according to your data and metric.

It takes config selection out of your hands and leaves you with a “process” rather than a model and a config. The outer CV is evaluating that process.

9. John Sang August 19, 2020 at 5:01 pm #

10. MS August 25, 2020 at 2:27 am #

How to obtain the best hyperparameter combination for the simpler cross_val_score approach?

• Jason Brownlee August 25, 2020 at 6:43 am #

Does the above example not help?

What problem are you having exactly?

• MS August 26, 2020 at 11:38 pm #

Using the longer approach as I obtained the best hyperparameters combination which I might choose as my final model how do I get it using the cross_val_score method ? In the cross_val_score approach after the GridsearchCV how the best parameters gets automatically selected for the outerCV process?

• Jason Brownlee August 27, 2020 at 6:15 am #

You don’t need to know the best hyperparameters as the pipeline will find the best model for you so you can start using it.

Nevertheless, you can keep a reference to the object if you like and then print the “best_params_” attribute.

• MS August 28, 2020 at 2:15 am #

I have to present a final model (as per client requirement). The cross_val_score approach is a process for tuning and observing how my tuned learner will perform on unseen data(a proxy for the generalization error). Please correct me if i’m wrong.

As you said the pipeline will find the best model (that i can understand) but how to extract the best model so that my client can use it on his own data for prediction?

As you have suggested to print the best_params_ attribute to solve my problem should I use nested-cv package from pypi?

• Jason Brownlee August 28, 2020 at 6:53 am #

You can use the best parametres to define the final model, then fit it on all available training data.

11. MS August 30, 2020 at 2:09 am #

thanks Jason

12. O'Brien September 4, 2020 at 3:23 pm #

Hello,

I understand that after nested CV we are left with a metric (say, accuracy) that evaluates the modelling pipeline, not any individual configuration trained in the inner loop.

What does this accuracy mean exactly as it relates to the final configuration produced by this pipeline? If the outer CV loop says the pipeline is X% accurate, is it also correct to claim that the final configuration produced by the pipeline is accurate X% of the time?

If not, how can we make a claim about the accuracy of the final configuration itself?

• Jason Brownlee September 5, 2020 at 6:39 am #

Good question.

It is the accuracy of the pipeline that included a grid search to find the best config within the pipeline. E.g. the best model found.

• O'Brien September 5, 2020 at 12:55 pm #

Just to clarify, if a client expects a quote of accuracy from a developer, i.e. they want to know how often the algorithm will make a correct prediction, is it correct to provide the mean accuracy calculated from the outer CV loop?

For instance, if the outer CV loop outputs 90% mean accuracy, then is it correct to tell a client that, on average, 90% of the predictions from the final configuration (algorithm + learned parameters + hyperparameters) are true?

• Jason Brownlee September 6, 2020 at 6:03 am #

Yes, the mean and standard deviation – e.g. the distribution of accuracy scores.

Even better would be to give the bootstrap confidence interval for the final model:
https://machinelearningmastery.com/confidence-intervals-for-machine-learning/

• David Verbiest March 6, 2021 at 9:43 am #

Thank you for your great article. Two questions.

1)
Concerning the generation of confidence intervals. How would you go about doing that though? I went through the material you provided, and get the principle. Would you then drop the outer cross validation ? I thought you could use bootstrapping estimate a parameter, In our case the accuracy. Or would there be another step after the outer cross validation for generating the confidence interval?

2)
Somewhere above you mention the following: “You can use the best parametres to define the final model, then fit it on all available training data.”
Instead of selecting the best set of hyperparameters you find from cross validation on all the data, could you not use all best model resulting from the inner cross validation and take an ensemble of them? In the end these different sets of hyperparameters and their respective performances are used to get an estimate of the accuracy? The best set of hyperparameters model still could be fluke

• Jason Brownlee March 6, 2021 at 2:09 pm #

You’re welcome!

I would recommend the bootstrap method for calculating confidence intervals for a chosen model generally:
https://machinelearningmastery.com/confidence-intervals-for-machine-learning/

Not quite, the model is chosen by the outer loop based on averaged performance not one time performance. A fluke is less likely. Some variance is expected though.

13. Lada September 13, 2020 at 8:29 pm #

Hi Jason, thanks a lot for this great tutorial! But I’d like to ask whether there is, in principle, a need for cross validation for such models as random forest that use bagging. E.g., from James et al.: An Introduction to Statistical Learning with Applications in R (p. 317-318):

“It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach.

The resulting OOB (out-of-bag) error is a valid estimate of the test error for the bagged model, since the response for each observation is predicted using only the trees that were not fit using that observation.”

What do you think, can random forest be trained and hyper parameters tuned without CV?

• Jason Brownlee September 14, 2020 at 6:47 am #

Yes, you can use OOB for the model if it is standalone.

For comparing models, a systematic and consistent test harness is required.

14. Edvin September 28, 2020 at 1:58 pm #

How do you select the best model in the outer_cv? I do not see the point of averaging the accuracy of models with different hyperparameters. Also, the best_estimator_ only gives the best model for the specific inner fold since basically we are instantiating a new search for each inner fold/

15. Chahine October 5, 2020 at 9:45 pm #

Hi Jason,

Great Article !!

Are we running the inner loop on the entire dataset splitted into train|test here ?

Shouldn’t we hide 10% of the data from the inner loop and use the entire dataset in the outer loop only ?

• Jason Brownlee October 6, 2020 at 6:50 am #

The outer loop operates on all data, inner loop operates on the training set of one pass from the outer loop.

Why hold back data?

16. Virgilio October 6, 2020 at 10:08 pm #

Dear Dr. Brownlee,

Thank you for this very detailed and interesting article.

I have just one question. When you describe the procedure to obtain the final model, you mention: “We know how well it will perform on average based on the score provided during the final model tuning procedure”.

However, I guess we cannot use this score, since it would be optimistically biased (hyperparameters were tuned). In order to estimate the final model performance, it seems more appropriate to use the unbiased score previously obtained with the nested cross-validation.

Does that make sense?

Thank you again and keep up the great work. Best regards.

• Jason Brownlee October 7, 2020 at 6:46 am #

We can, as the estimate was based on the outer loop.

This is the whole point, that the model tuning is now considered part of the model itself – e.g. the modeling pipeline.

17. Chahine October 7, 2020 at 3:49 am #

Hi Jason,

I am trying to apply your steps on my data, I do have a pipeline containing tf-idf, LDA and LogisticRegression model, my input data is a list of documents X = [‘text,’text’…….], y =[0,1….], I am trying to optimize different parameters of the pipeline components using nested cv.

how can I use your code here to split the data, assuming that the data needs to be a numpy array.

for train_ix, test_ix in cv_outer.split(X):
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]

Thank you

18. Karthik Mamudur October 31, 2020 at 9:26 am #

Hello Jason,

If I have to compare which algorithm (RandomForest,SomeOthermodel-1,SomeOthermodel-2) is the best, do I just repeat the nested cv with each of the above algorithms and take a decision based on the various nested cv accuracies and standard deviations?

Thank you,
Karthik

• Jason Brownlee October 31, 2020 at 9:35 am #

Yes, compare the mean performance from each technique.

• Karthik October 31, 2020 at 3:34 pm #

Thank you!

19. Payal Goyal November 17, 2020 at 4:11 am #

Hi Jason Brownlee,

Thanks for the great explanation. But I have 1 doubt: Shouldn’t we use 1 more for loop in inner iterations as we did for outer loop?

• Jason Brownlee November 17, 2020 at 6:34 am #

We do in the final example, it is just handled for us by the gridsearch.

20. Emin November 23, 2020 at 9:07 am #

Dear Jason,
Thank you for the post. I have two questions.
1) First one is about pre-processing techniques (scaling, imputation, etc.) and nested cross validation. Can nested cross validation be used to decide on a pre-processing technique, say to make a decision about the imputation techniques mean vs. median vs regression?
2) Is it possible to generalize nested cross validation to other resampling techniques? For example, “nested repeated train-test splits” where both outer and inner validations use repeated random train-test splits. If so, do you know any academic references to the topic? If I understand them correctly, the most commonly used nested resampling technique in scientific publications is single train-test split for the outer validation and a regular k-fold cross-validation for the inner validation.
Best,
Emin

21. Emin November 23, 2020 at 12:14 pm #

Thank you

22. Ankush Jamthikar November 25, 2020 at 5:08 pm #

Hi, I am new to machine learning. I was reading this post and I must say it is very well written and clearly explained. I just have one question. I want to deploy my ML model. For this, I need to save the best model. In nested cross-validation, which model should we deploy? Because in each fold, we are getting a new model with a new set of hyper parameters.

• Jason Brownlee November 26, 2020 at 6:29 am #

Thanks!

One solution would be to use the grid search cv model and drop the outer loop.

Another solution would be to inspect the result from the grid search cv model and use the specific configuration chosen.

• Ankush Jamthikar November 26, 2020 at 7:53 pm #

Hi Dr. Brownlee,

1) Nested Cross-validation does not provide a final model. It is just a way to check (or evaluate) the performance of an ML algorithm on different independent training and testing datasets. The averaged performance metrics that come out from cross-validation will tell the overall behavior of the model, which we informed your client.

2) After nested cross-validation, if we want to give the best model to someone (say client), then we need to (a) use the whole dataset, (b) run hyperparameter optimization, (c) get the best parameter, and (d) retrain the fresh ML algorithm with the best parameters to save the ML model as the “Final model.”

Is this correct?

• Jason Brownlee November 27, 2020 at 6:37 am #

Yes, that approach is valid and what I would do.

• Ankush Jamthikar November 29, 2020 at 12:19 am #

Dear Dr. Brownlee,

Thank you very much for validating my thought process and clarifying the doubts. I would be troubling you once agin after reading all your articles/blogs. I hope you will clarify my future doubts as well.

Thank you once again,
Ankush

• Jason Brownlee November 29, 2020 at 8:14 am #

You’re welcome!

• JLH December 14, 2020 at 3:19 am #

Don’t you run the risk here again of having the hyperparam search overfit again? Seems the nested CV shows that it’s possible to have a good model, but how do you guarantee (or at least maximize the chance) of actually getting a good single model at the end that you can use? Thanks!

23. Edward MB December 30, 2020 at 12:18 am #

Hi, thank you very much for this article – it has been very useful. How though, please, can I expand on this by including recursive feature elimination so that not only will the process find the best hyperparameters but also the most optimal features (i.e. the best hyperparameter and feature combination)?

• Jason Brownlee December 30, 2020 at 6:39 am #

A RFECV will automatically configure itself. You can use it in a pipeline prior to your model.

Or use a pipeline with a grid search on the RFE hyperparameter.

This tutorial will show you how to use RFE in a pipeline:
https://machinelearningmastery.com/rfe-feature-selection-in-python/

• Edward MB January 1, 2021 at 2:18 am #

Thank you for your reply. Using the explicit outer-inner loop above, I think I achieve this using the below.

• Jason Brownlee January 1, 2021 at 5:31 am #

I’m happy to hear you are making progress.

Sorry, I don’t have the capacity to review your code.

• Felipe Araya February 3, 2021 at 1:42 am #

Hello Edward,

This code looks great! I have been looking at something like that in Stackoverflow and similar, but this one seems the neatest to me, particularly when it comes to implementing RFE and GridSearchCV together. One question, why did you use RFE instead of RFECV?

Thanks!

24. Felipe Araya January 28, 2021 at 1:18 am #

Hello!

Awesome tutorial as always!

Just have a couple of questions if you don’t mind:

1. In the section “How Do You Configure the Final Model?”, Step 2 seems identical to Step 4 to me, of course it is probably not, would you mind clarifying that, please?

2. What if I wanted to include data preparation for the code you showed above (such as scalling or feature categorisation, would I have to put that within the outer loop, inner loop or outside of the loops?

2.1 if it had to put it within one of the loops, how can you do that if you use “cross_val_score”, I guess you can only use the “split()” method, right?

• Jason Brownlee January 28, 2021 at 6:06 am #

Yes, step 4 is not needed if you use gridsearchcv.best_estimator_

You can use a pipeline for the “model” and include any data prep steps you like.

I don’t understand your last question, sorry. Perhaps you can elaborate?

• Felipe Araya January 28, 2021 at 6:56 am #

Hi Dr, Jason

Thanks for answerting my first question.

Sure, second question has to do with data leakage, for example it is mentioned that for feature manipulation or scalling or any type of preprocessing should be done for Train and Test set separately to prevent data leakage, and since the outer loop represents the “test set” and the inner loop represents the “training + validation” set, I was wondering where to incorporate the preprocessing here whether within the outer loop or inner loop or outside both. Does that make sense?

• Jason Brownlee January 28, 2021 at 7:59 am #

Define a pipeline and let it handle the correct procedure for fitting on the train set and application to train and test set of data preparation methods. This applies all the way down into the iterations of the grid search.

25. Felipe Araya February 3, 2021 at 1:54 am #

HI Jason,

Quick question, can you use RFECV within a GridSearchCV or you should use normal RFE? My logic is to think that within the inner cv that GridSeachCV generates (essentially the train+validation set) there will be another CV happening from RFECV, so I am not sure if this makes much sense or if I am better off using RFE evaluated for range(1, len(features)+1) within the GridSearch

• Jason Brownlee February 3, 2021 at 6:22 am #

You can use either, it’s up to you.

26. Jens February 22, 2021 at 12:28 am #

Dear Jason Brownlee,

I have question regarding the final model. Let’s say we performed a Nested CV with an inner and outer CV of 5 folds. If we want the the final model we perform again a, for example, Random Search on the whole (training) data and get the best model from this search.

Should the number of folds for getting the final model be equal to the inner CV or can it be different, say 10 folds? If it should be the same, why is that?

Kind regards,

Jens

• Jason Brownlee February 22, 2021 at 5:02 am #

The number of folds for the inner loop when training the final model should match what was used during model selection – to be consistent with the decision and comparisons made among models.

27. Giovanna February 25, 2021 at 12:24 pm #

would this technique be equivalent to using RepeatedKFold?

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
model_cv_fit = cross_val_score(model, features.values, y.values, cv=cv)

• Jason Brownlee February 26, 2021 at 4:53 am #

That is an example of repeated cv.

28. Pablo Reinhardt March 25, 2021 at 8:44 pm #

Dear Jason,
when I do stacking, do I treat the stacking model just as “a model” like in your example, or is there something I have to consider?
Cheers
Reini

• Jason Brownlee March 26, 2021 at 6:23 am #

No, it’s just a model fit on data, it just so happens that the inputs in the training data are outputs of other models.

29. Reiner April 15, 2021 at 8:10 am #

Hi dear Jason,

would it be possible to replace the line:
model = RandomForestClassifier

… with a stacking classifier:
stacking = StackingClassifier(estimators=models) where models is a list of classifiers ?

if so, how can you combine nested CV with a two layer classifier ?

• Jason Brownlee April 16, 2021 at 5:27 am #

Sure.

Ouch, that might require some thinking. I can’t give you a good answer off the cuff, sorry.

30. Kwa June 4, 2021 at 1:44 pm #

Can I say the final accuracy is equal to the accuracy of an independent test set?

• Jason Brownlee June 5, 2021 at 5:23 am #

The final accuracy would be the mean over all runs.

31. Tim Stack July 7, 2021 at 12:01 am #

Dear Jason,

At one point in the code you comment the best performing model from Grid Search is subsequently trained (fitted) on the outer loop training data.

 # define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_ # evaluate model on the hold out dataset yhat = best_model.predict(X_test) 

However, it would appear to me that the best model from the inner loop is selected and then used on the outer loop test set, without any training on the outer loop preceding that step.

In order to train the inner loop model on the outer loop set, shouldn’t a line be added?

 # define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model best_model = result.best_estimator_ # fit on the whole training set best_model.fit(X_train, y_train) # evaluate model on the hold out dataset yhat = best_model.predict(X_test) 

• Jason Brownlee July 7, 2021 at 5:34 am #

Yes, the best inner loop is evaluated on the outer loop.

If you want to use the best model as determined by the outer loop, that is about “finalizing a model”, see the section on that.

32. Vinay July 8, 2021 at 9:59 pm #

Hi Jason,

I really appreciate your work and contribution. I need to clarify just one thing, once we obtain the best parameter, should I fit the final model using cv or without? If yes, should cv = ‘inner cv’ or ‘outer cv’?

• Jason Brownlee July 9, 2021 at 5:08 am #

In nested CV, you would use an inner CV process to find the best hyperparameters for the model and then fit a final model on those hyperparameters. The gridsearchcv does this for you if you want or you can do it manually.

33. Vinay July 12, 2021 at 3:51 pm #

Thanks for your reply Jason. I have one more doubt (in general), while fitting the model with the best parameters how should I decide upon which random_state will give me a more reliable model for prediction with high accuracy. I tried to iterate over 100 random_states out of which 7 accuracy values were found to be unique ranging from 0.5 to 0.76. Should I choose the random_state which gives highest or median accuracy to fit the model?

34. kukushiwo September 21, 2021 at 12:57 pm #

Hi, Jason,
Thank you for the post.
I want to know what data(train data set and validation set of inner loop? or train data set and test data set of outer loop?) should be used to draw learning curve(loss v.s. iterations) when performing nested CV?

• Adrian Tam September 23, 2021 at 2:59 am #

When you talk about learning curve, you are plotting the “score” (whatever you decided to use) against the “number of iterations”. So in the middle of an iteration, before the model is trained, you should be able to find a score. What can that be? If the iteration is about the inner loop, of course that score is from the validation set.

• kukushiwo September 23, 2021 at 4:34 pm #

So, if the iteration is about the outer loop, score is from the test data set, is it right?

But your post about the learning curves, in which you replied to other in the comment as follows,
“Jason Brownlee July 4, 2019 at 7:52 am #
Typically just train and validation sets.”

You can find this at the following link.
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/

Does it mean that we cannot draw a learning curve of outer loop using training data set and test data set?

• Adrian Tam September 24, 2021 at 4:56 am #

I don’t think you should have a learning curve for outer loop because the outer loop is about different folds of cross validation. You can draw a bar chart for that to see whether the different folds produce similar score, however.

• kukushiwo September 24, 2021 at 9:49 am #

But inner loop is also about different folds of cross validation, so if I want to draw learning curve for inner loop, which fold of validation set of the inner loop should I use, and which hyperparameters should I use?

• Adrian Tam September 25, 2021 at 4:12 am #

Each fold is about a particular partitioning of dataset. For the model, you start from iteration 0 which everything is random, up to iteration N which it start to churn out meaningful result. What you want to plot is one particular partitioning and one particular model, how the result score compare to iterations. So simply, in k-fold, you get k curves.

35. kukushiwo September 27, 2021 at 12:34 pm #

‘Each fold is about a particular partitioning of dataset. For the model, you start from iteration 0 which everything is random, up to iteration N which it start to churn out meaningful result. What you want to plot is one particular partitioning and one particular model, how the result score compare to iterations. So simply, in k-fold, you get k curves.’

After I got the best hyperparameters by running the inner loop, I should use the best hyperparameters to draw learning curves(k curves) of each fold(k folds in all) of the inner loop, am I right?

• Adrian Tam September 28, 2021 at 8:45 am #

Right. Each learning curve is for one model from its random state to trained state. You shouldn’t mix up different models in the same curve.

• kukushiwo September 28, 2021 at 3:50 pm #

Thank you very much!

We should train all the data we have, but use the accuracy (or F1, ROC AUC etc.) obtained by running the outer loop to see how good the model is. Which data set we should use to do the feature selection?

XGBoost built-in feature importance is calculated based on the training set. And since we provide the model trained by using all the data we have to customer, does it mean we should use all the data to do feature selection?

36. Lamin October 25, 2021 at 4:36 am #

Hello, first of all thanks a lot for your tutorials! They help me a lot. I have got three questions:

What is the difference between hyperparameter tuning and model selection? I thought it meant the same thing.

Can we consider the outcome of the automatic nested cross-validation routine as an estimate for its generalization performance? I noticed, that you ran the nested cross-validation on the complete dataset.

Also, if I needed to scale the data first, would it suffice to scale the whole dataset beforehand or would it be better to somehow build a pipeline with the scaler and the classificator and then feed the pipeline to GridSearchCV?

• Adrian Tam October 27, 2021 at 2:14 am #

1. Model selection is done first before you can have hyperparameters to tune. If you picked kNN, you have hyperparameter k; but if you picked neural network, the k doesn’t apply.
2. Yes, that’s the point of cross validation: using limited dataset to estimate the general performance on unseen dataset
3. Build a pipeline to do inside cross validation is better. Assume that you should not see the validation part of the data when you do the model fitting, why you should see it in scaling?