Nested Cross-Validation for Machine Learning with Python

By Jason Brownlee on November 20, 2021 in Python Machine Learning 170

The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.

This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.

One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.

In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.

After completing this tutorial, you will know:

Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Jan/2021: Added section on pipeline thinking and a link to a related tutorial.

Nested Cross-Validation for Machine Learning with Python
Photo by Andrew Bone, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Combined Hyperparameter Tuning and Model Selection
What Is Nested Cross-Validation
1. What Is the Cost of Nested Cross-Validation?
2. How Do You Set k?
3. How Do You Configure the Final Model?
4. What Configuration Was Chosen by Inner Loop?
Nested Cross-Validation With Scikit-Learn

Combined Hyperparameter Tuning and Model Selection

It is common to evaluate machine learning models on a dataset using k-fold cross-validation.

The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.

For more on the k-fold cross-validation procedure, see the tutorial:

A Gentle Introduction to k-fold Cross-Validation

The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.

Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.

Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, a noisy dataset often score worse. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.

The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.

A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.

One approach to this problem is called nested cross-validation.

What Is Nested Cross-Validation

Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.

In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.

As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “double cross-validation.”

Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.”

Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.

This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.

In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.

— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.

What Is the Cost of Nested Cross-Validation?

A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.

If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.

To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.

How Do You Set k?

The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.

You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.

It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.

For more general help on setting k, see this tutorial:

How to Configure k-Fold Cross-Validation

How Do You Configure the Final Model?

The final model is configured and fit using the procedure applied in one pass of the outer loop, e.g. the outer loop applied to the entire dataset.

As follows:

An algorithm is selected based on its performance on the outer loop of nested cross-validation.
Then the inner-procedure is applied to the entire dataset.
The hyperparameters found during this final search are then used to configure a final model.
The final model is fit on the entire dataset.

This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.

What Configuration Was Chosen by Inner Loop?

It doesn’t matter, that’s the whole idea.

An automatic configuration procedure was used instead of a specific configuration. There is a single final model, but the best configuration for that final model is found via the chosen search procedure, on the final run.

You let go of the need to dive into the specific model configuration chosen, just like at the next level down you let go of the specific model coefficients found each cross-validation fold.

This requires a shift in thinking and can be challenging, e.g. a shift from “I configured my model like this…” to “I used an automatic model configuration procedure with these constraints…“.

This tutorial has more on the topic of “pipeline thinking” and may help:

A Gentle Introduction to Machine Learning Modeling Pipelines

Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.

Nested Cross-Validation With Scikit-Learn

The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.

The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset. The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.

For example:

...
# configure the cross-validation procedure
cv = KFold(n_splits=10, random_state=1)
# perform cross-validation procedure
for train_ix, test_ix in cv_outer.split(X):
	# split data
	X_train, X_test = X[train_ix, :], X[test_ix, :]
	y_train, y_test = y[train_ix], y[test_ix]
	# fit and evaluate a model
	...

...

# configure the cross-validation procedure

cv = KFold(n_splits=10, random_state=1)

# perform cross-validation procedure

for train_ix, test_ix in cv_outer.split(X):

# split data

X_train, X_test = X[train_ix, :], X[test_ix, :]

y_train, y_test = y[train_ix], y[test_ix]

# fit and evaluate a model

...

This class can be used to perform the outer-loop of the nested-cross validation procedure.

The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.

For example:

...
# configure the cross-validation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# define search space
space = dict()
...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)
# execute search
result = search.fit(X, y)

...

# configure the cross-validation procedure

cv = KFold(n_splits=3, shuffle=True, random_state=1)

# define search space

space = dict()

...

# define search

search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv)

# execute search

result = search.fit(X, y)

These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.

We can tie these elements together and implement the nested cross-validation procedure.

Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.

...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv_inner, refit=True)
# execute search
result = search.fit(X_train, y_train)
# get the best performing model fit on the whole training set
best_model = result.best_estimator_

...

# define search

search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv_inner, refit=True)

# execute search

result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set

best_model = result.best_estimator_

This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.

...
# evaluate model on the hold out dataset
yhat = best_model.predict(X_test)

...

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.

We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.

The complete example is listed below.

# manual nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for train_ix, test_ix in cv_outer.split(X):
	# split data
	X_train, X_test = X[train_ix, :], X[test_ix, :]
	y_train, y_test = y[train_ix], y[test_ix]
	# configure the cross-validation procedure
	cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
	# define the model
	model = RandomForestClassifier(random_state=1)
	# define search space
	space = dict()
	space['n_estimators'] = [10, 100, 500]
	space['max_features'] = [2, 4, 6]
	# define search
	search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)
	# execute search
	result = search.fit(X_train, y_train)
	# get the best performing model fit on the whole training set
	best_model = result.best_estimator_
	# evaluate model on the hold out dataset
	yhat = best_model.predict(X_test)
	# evaluate the model
	acc = accuracy_score(y_test, yhat)
	# store the result
	outer_results.append(acc)
	# report progress
	print('>acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_))
# summarize the estimated performance of the model
print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

# manual nested cross-validation for random forest on a classification dataset

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import KFold

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# create dataset

X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

# configure the cross-validation procedure

cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)

# enumerate splits

outer_results = list()

for train_ix, test_ix in cv_outer.split(X):

# split data

X_train, X_test = X[train_ix, :], X[test_ix, :]

y_train, y_test = y[train_ix], y[test_ix]

# configure the cross-validation procedure

cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)

# define the model

model = RandomForestClassifier(random_state=1)

# define search space

space = dict()

space['n_estimators'] = [10, 100, 500]

space['max_features'] = [2, 4, 6]

# define search

search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)

# execute search

result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set

best_model = result.best_estimator_

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

# evaluate the model

acc = accuracy_score(y_test, yhat)

# store the result

outer_results.append(acc)

# report progress

print('>acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_))

# summarize the estimated performance of the model

print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.

Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.

This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.

A final mean classification accuracy is then reported.

>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100}
>acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100}
>acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500}
>acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100}
>acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500}
>acc=0.960, est=0.926, cfg={'max_features': 2, 'n_estimators': 500}
>acc=0.900, est=0.937, cfg={'max_features': 4, 'n_estimators': 500}
Accuracy: 0.927 (0.019)

>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100}

>acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500}

>acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500}

>acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100}

>acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100}

>acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500}

>acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100}

>acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500}

>acc=0.960, est=0.926, cfg={'max_features': 2, 'n_estimators': 500}

>acc=0.900, est=0.937, cfg={'max_features': 4, 'n_estimators': 500}

Accuracy: 0.927 (0.019)

A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.

This greatly reduces the amount of code required to perform the nested cross-validation.

The complete example is listed below.

# automatic nested cross-validation for random forest on a classification dataset
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
# define the model
model = RandomForestClassifier(random_state=1)
# define search space
space = dict()
space['n_estimators'] = [10, 100, 500]
space['max_features'] = [2, 4, 6]
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# execute the nested cross-validation
scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

# automatic nested cross-validation for random forest on a classification dataset

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

# create dataset

X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

# configure the cross-validation procedure

cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)

# define the model

model = RandomForestClassifier(random_state=1)

# define search space

space = dict()

space['n_estimators'] = [10, 100, 500]

space['max_features'] = [2, 4, 6]

# define search

search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)

# configure the cross-validation procedure

cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)

# execute the nested cross-validation

scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1)

# report performance

print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.

Accuracy: 0.927 (0.019)

1	Accuracy: 0.927 (0.019)

Summary

In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.

Specifically, you learned:

Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

170 Responses to Nested Cross-Validation for Machine Learning with Python

Viktor July 29, 2020 at 3:54 pm #

Good afternoon, thanks for your article.
Could you please tell a newbie. The picture “A final mean classification accuracy is then reported” shows the “best parameters” for each outer fold.
1) what parameters from this list should be used in the final model?
2) how can the resulting trained model be called in your example?
thank you in advance.

Reply
- Jason Brownlee July 30, 2020 at 6:18 am #
  
  Think of it this way: We are searching for a modeling process, not a specific model and configuration.
  
  The modeling process that achieves the best result is the one we will use.
  
  So we could compare the best grid search model from multiple different algorithms and use the process that creates the best model as the process to create the final model.
  
  Does that make sense?
  
  Reply
Abhishek V July 31, 2020 at 3:56 am #

double cross-validation no doubt is a better approach, as it looks on surface, we will see how we it works with our own problem

Reply
- Jason Brownlee July 31, 2020 at 6:23 am #
  
  Thanks.
  
  Reply
Áttila July 31, 2020 at 6:11 am #

Hi Jason I follow yours posts and I learn a lot! Great stuff!

I do not know if I understood well the procedure but I think it is an overfitting again when it is selected a model with a better mean metric in the holdout groups.

I view the nested cross validation as a technique that gives k best models and it is not possible to assess the better. To do this I’ll need another holdout data but there is none anymore.

The question is what I’m gonna do with k best models? I do not know I can ensemble then. Or I can discard some using some theoretical analysis for example.

Cheers

Reply
- Jason Brownlee July 31, 2020 at 6:26 am #
  
  Thanks!
  
  No, it selects a modeling pipeline that results in the best model. E.g. the grid search will choose the best model for you to start using directly and make predictions.
  
  Reply
  - Leo August 10, 2020 at 6:02 am #
    
    Hi Jason!! Now I get it! You dont end up with k models. What you get from cross_val_score is an UNBIASED estimate for the generalization error. Now let´s say you´re tasked with making a new prediction using a fresh set of predictors X.
    What you should do is you should run GridSearchCV on your entire dataset. Then use best_model to make your prediction.
    
    Great post!!!
    
    Reply
    - Jason Brownlee August 10, 2020 at 6:52 am #
      
      Exactly!
      
      Reply
      - Felipe Araya January 28, 2021 at 12:32 am #
        
        Hi Jason,
        
        I am a bit confused about this. Shouldn’t it be:
        0. Create Outer and Inner loops (train and test set)
        1. Run the GridseachCV on your train set
        2. Get your “best_model” and put it against the outer loop
        3. Once you get your best generalised model (after you ran the outer loop) get the hyperparameters you found work best for generalization
        4. Use those hyperparameters to retrain your model now using the whole data
        5. You can now used the model trainned using the whole data and the hyperparameters found during the NestedCV, to make predictions
        6. If you get a new fresh set for your predictiors X (meaning more data to add to your previous set), repeat steps 1-4.
        
        Isn’t like that? or am I just reading the question asked wrongly?
      - Jason Brownlee January 28, 2021 at 5:59 am #
        
        No, not with nested cv. See the section on training a final model.
        
        Of course, you can do anything you like as long as you justify it.
      - Greg October 24, 2022 at 11:17 am #
        
        Is double cross validation only useful for estimating generalization error then, if we are going to discard all the inner loop best_models and create a new one from scratch using a grid search on the entire dataset?
      - James Carmichael October 25, 2022 at 6:24 am #
        
        Hi Greg…I would recommend Bayesian Optimization for hyperparameter selection and tuning as opposed to grid search.
        
        https://www.vantage-ai.com/en/blog/bayesian-optimization-for-quicker-hyperparameter-tuning
Shahid Islam August 2, 2020 at 12:10 am #

Hi:
I work with very large data set( more than billion rows). I would like to see a book where these models and methods are described using pyspark in spark setting. Now a days large data set is more common, so a book like that will be very useful for data scientists.

Reply
- Jason Brownlee August 2, 2020 at 5:46 am #
  
  Thanks for the suggestion!
  
  Reply

Anthony The Koala August 12, 2020 at 10:14 am #

Dear Dr Jason,
This is about the lines of code in the nested model, particularly

	search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)
	# execute search
	result = search.fit(X_train, y_train)
	# get the best performing model fit on the whole training set
	best_model = result.best_estimator_
	# evaluate model on the hold out dataset
	yhat = best_model.predict(X_test)

search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)

# execute search

result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set

best_model = result.best_estimator_

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

After some experimentation, I discovered that you could say the same thing as:

	search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)
	search.fit(X_train, y_train)
        yhat = search.predict(X_test)

search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)

search.fit(X_train, y_train)

yhat = search.predict(X_test)

The reason is that search contains the best model for the given parameters such that we can fit X_train, y_train for the best model and predict using X_test with the best model.
Definition: best model = search, the best model for the GridSearch’s parameters.

Thank you,
Anthony of Sydney

Jason Brownlee August 12, 2020 at 1:35 pm #

Yes, that is how you would use the technique as a final model without the outer loop for model evaluation.

Reply
- Anthony The Koala August 12, 2020 at 3:00 pm #
  
  Dear Dr Jason,
  Thank you for the reply. The question I asked related to the nested example achieving the same score with less code.
  
  My question is on the last mode.
  
  If the last model has less code and no loops, but achieve the same results, why not implement the last example that achieves the same outcome of removing the noise from the scores?
  
  Thank you again,
  Anthony of Sydney
  
  Reply
  - Jason Brownlee August 13, 2020 at 6:04 am #
    
    It is doing something different.
    
    The example in the tutorial is evaluating the modeling pipeline with k-fold cross-validation (outer loop).
    
    Your example fits the pipeline once and uses it to make a prediction.
    
    Reply
    - Janine S March 18, 2021 at 10:28 pm #
      
      Hey Jason,
      
      I think that Anthony is right. Actually, if we could not replace
      
      yhat = best_model.predict(X_test) by yhat = search.redict(X_test),
      
      it would make no sense that the shorter code version at the very of your article and would work, right?
      Because what cross_val_score does is creating train and test folds, then trains the model given as the first argument (in our case, search), and then evaluates it on the test set using .predict(X_test). So in cross_val_score, we can’t tell the function to work with search.best_estimor_, which requires that in each iteration of the inner loop,
      
      search.best_estimator == search itself.
      
      One can also check that by printing some boolean statements within the inner for loop.
      
      If I’m wrong though, would you please explain me why there is a difference and why cross_val_score does work, though?
      
      Thank you!
    - Jason Brownlee March 19, 2021 at 6:21 am #
      
      Not sure I follow.
      
      search.predict() uses the best model directly – they are equivalent:
      https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.predict

Anthony The Koala August 12, 2020 at 5:25 pm #

Dear Dr Jason,
Thank you.
In addition to the above question.
In the nested example, you get X_train and y_train from the outer splits and were able to get yhat

result = search.fit(X_train, y_train)
best_model = result.best_estimator_
# evaluate model on the hold out dataset
yhat = best_model.predict(X_test)

result = search.fit(X_train, y_train)

best_model = result.best_estimator_

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

In the above code, in order to get X_train and y_train, there were 10 iterations of the loop based on kfolds=10. We had 10 different train_ix Thus we could estimate yhat based on 10 versions of X_train and y_train..

BUT in the last example there are no iterations and not the ability to estimate yhat. So we cannot predict yhat because we don’t have 10 iterations of train_ix as in the nested example?

Put it another way, how can we make a prediction in the last example?

Thank you,
Anthony of Sydney

How do you make a prediction for yhat in the second example

Jason Brownlee August 13, 2020 at 6:08 am #

The last example does the same thing as the first example in fewer lines of code.

You can make a prediction using the code example you provided, e.g. fit the pipeline then call predict().

Anthony The Koala August 13, 2020 at 7:50 am #

Dear Dr Jason,
Thank you.
But in the last example, how do I get the indices for train_ix when there is no iteration as in the first example?
I need the train_ix indices in order to get X_train and y_train. I cannot see how that could be obtained in the last example.

In the first example, you get the train_ix from the cv_outer.split(X).

for train_ix, test_ix in cv_outer.split(X):

1	for train_ix, test_ix in cv_outer.split(X):

In the last example there is no way to get the train_ix in order to get the X_train and the y_train because you need the indices for train_ix BUT you cannot get the train_ix indices in the second example.

So how do I get the train_ix in the last example.

Thank you and always appreciate your tutorials.
Anthony of Sydney

Jason Brownlee August 13, 2020 at 10:54 am #

In the last example, the cross-validation of the modeling pipeline (outer) is controlled via the call to cross_val_score(). The cross-validation of each hyperparameter combination is controlled automatically via the GridSearchCV (inner).

It does the same thing as the first example in fewer lines of code.

Reply

Anthony The Koala August 13, 2020 at 11:11 am #

Dear Dr Jason,
I tried to predict yhat from the last example.
In the last example, in order to get y_train, I had to do the following by appending this code to the end of the last example

#There were 3 iterations of because search contained 3 folds
my_scores = list()
for a,b in search.cv.split(X):
	X_train,y_train= X[a],y[a]
	X_test ,y_test = X[b],y[b]
	search.fit(X_train,y_train)
	yhat = search.predict(X_test)
	my_scores.append(accuracy_score(yhat,y_test))
my_scores
[0.9221556886227545, 0.9309309309309309, 0.924924924924925]
mean(my_scores), std(my_scores)
(0.9260038481595368, 0.0036628109334627327)

#There were 3 iterations of because search contained 3 folds

my_scores = list()

for a,b in search.cv.split(X):

X_train,y_train= X[a],y[a]

X_test ,y_test = X[b],y[b]

search.fit(X_train,y_train)

yhat = search.predict(X_test)

my_scores.append(accuracy_score(yhat,y_test))

my_scores

[0.9221556886227545, 0.9309309309309309, 0.924924924924925]

mean(my_scores), std(my_scores)

(0.9260038481595368, 0.0036628109334627327)

This is certainly different from either the first or second example which produced
Accuracy: 0.927 (0.019)
That is mine produced
0;926 with std dev (0.0037)

It is strange that my std_dev was less using 3 sets of scores.

Thank you,
Anthony of Sydney

Anthony The Koala August 13, 2020 at 11:56 am #

Dear Dr Jason,
I did the same thing, by appending the following code, this time using
This time I used the cv_outer which is 10 fold and attached to the end of your last example:

for a,b in cv_outer.split(X): # There are 10 folds
	X_train,y_train= X[a],y[a]
	X_test ,y_test = X[b],y[b]
	search.fit(X_train,y_train)
	yhat = search.predict(X_test)
	my_scores.append(accuracy_score(yhat,y_test))
my_scores
[0.9, 0.94, 0.93, 0.93, 0.92, 0.95, 0.91, 0.93, 0.96, 0.9]
mean(my_scores), std(my_scores)
(0.9270000000000002, 0.01899999999999998)

for a,b in cv_outer.split(X): # There are 10 folds

X_train,y_train= X[a],y[a]

X_test ,y_test = X[b],y[b]

search.fit(X_train,y_train)

yhat = search.predict(X_test)

my_scores.append(accuracy_score(yhat,y_test))

my_scores

[0.9, 0.94, 0.93, 0.93, 0.92, 0.95, 0.91, 0.93, 0.96, 0.9]

mean(my_scores), std(my_scores)

(0.9270000000000002, 0.01899999999999998)

The result is exactly like your first and last code.

Summary of findings:
For the second example, in order to get the test and train sets each for X and y you have to have a loop in order to extract the indices for test and train. The indices for test and train are dependent on the number of folds.

When using for a, b in for a,b in search.cv.split(X): 3 folds, 3 scores , mean(scores) = 0.926 with std dev 0.0037 - why is std dev smaller especially with fewer scores when kFold = 3

When using for a,b in cv_outer.split(X): # There are 10 folds and 10 scores
mean(scores) = 0.927 with std dev 0.019

When using for a, b in for a,b in search.cv.split(X): 3 folds, 3 scores , mean(scores) = 0.926 with std dev 0.0037 - why is std dev smaller especially with fewer scores when kFold = 3

When using for a,b in cv_outer.split(X): # There are 10 folds and 10 scores

mean(scores) = 0.927 with std dev 0.019

Thank you,
Anthony of Sydney

Jason Brownlee August 13, 2020 at 1:26 pm #

Sorry, I don’t understand what you’re trying to achieve.

Why would you want to make a prediction within cross-validation?

Anthony The Koala August 13, 2020 at 4:06 pm #

Dear Dr Jason
Thank you for your reply.
This was the original question and maybe I did not make it clear.
(1) My original question was how to make predictions with the second example labelled above the heading “Further Reading”.

(2) You did make a prediction with the cross validation in the first example located at the section “Nested Cross-Validation With Scikit-Learn”

# from the 1st example - nested
search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)
# execute search
result = search.fit(X_train, y_train)
# get the best performing model fit on the whole training set
best_model = result.best_estimator_
# evaluate model on the hold out dataset
yhat = best_model.predict(X_test)

# from the 1st example - nested

search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True)

# execute search

result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set

best_model = result.best_estimator_

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

(3) To answer your question “…what I am trying to achieve…:” is to be able to predict yhat in the second example above the heading “Further Reading” . I was able to predict yhat and achieved the same results.

This required me adding a few extra lines to the second example.

I finally worked it out getting EXACTLY the SAME mean(scores) and std(scores) as your example.

I added the following code to the end of your second example LOCATED above the section “Further Reading”

my_scores = list()
for a,b in cv_outer.split(X):
	X_train,y_train= X[a],y[a]
	X_test ,y_test = X[b],y[b]
	search.fit(X_train,y_train)
	yhat = search.predict(X_test)
	my_scores.append(accuracy_score(yhat,y_test))
my_scores
[0.9, 0.94, 0.93, 0.93, 0.92, 0.95, 0.91, 0.93, 0.96, 0.9]
mean(my_scores), std(my_scores)
(0.9270000000000002, 0.01899999999999998); # SAME RESULT AS YOUR EXAMPLES!!

my_scores = list()

for a,b in cv_outer.split(X):

X_train,y_train= X[a],y[a]

X_test ,y_test = X[b],y[b]

search.fit(X_train,y_train)

yhat = search.predict(X_test)

my_scores.append(accuracy_score(yhat,y_test))

my_scores

[0.9, 0.94, 0.93, 0.93, 0.92, 0.95, 0.91, 0.93, 0.96, 0.9]

mean(my_scores), std(my_scores)

(0.9270000000000002, 0.01899999999999998); # SAME RESULT AS YOUR EXAMPLES!!

Conclusion (1) , you can retrieve the means to produce X_train, y_train, X_test, y_test, scores, mean(scores) and std(scores) as your examples.

Conclusion (2), you can make predictions with code and get the same results.

It works.

Thank you again for your patience,
Anthony of Sydney

It was because you used the prediction

Jason Brownlee August 14, 2020 at 5:57 am #

You provided code to make a prediction with the final example already and I pointed out as much, here it is again:

...
# define search
search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
# run search and find best model
search.fit(X, y)
# make a prediction
yhat = search.predict(...)

...

# define search

search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)

# run search and find best model

search.fit(X, y)

# make a prediction

yhat = search.predict(...)

Anthony The Koala August 14, 2020 at 8:48 am #

Dear Dr Jason,
Thank you.
It works, and I thank you for that.
Again, thank you for your patience which helps in understanding.
It is appreciated.
Anthony of Sydney

Reply
- Jason Brownlee August 14, 2020 at 1:15 pm #
  
  You’re very welcome Anthony, I’m here to help if I can.
  
  Reply

John Sang August 15, 2020 at 8:46 pm #

Hi I show your result in this topic as below
>acc=0.900, est=0.932, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.940, est=0.924, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.929, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.927, cfg={‘max_features’: 6, ‘n_estimators’: 100}
>acc=0.920, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.910, est=0.918, cfg={‘max_features’: 2, ‘n_estimators’: 100}

Accuracy: 0.927 (0.019)

I read a lot of below comments about how to choose best parameter for building the final model, but I am not really clear.

Can I apply perfomance result straightforwardly by selecting parameter set from the best accuracy? in this case >acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}

can I choose {‘max_features’: 4, ‘n_estimators’: 500} ?

if not , what is your recommendation.

Reply
- Jason Brownlee August 16, 2020 at 5:51 am #
  
  You can.
  
  But, the idea of nested CV, is you don’t choose the hyperparameters, the grid search does and it will use whatever is “best” according to your data and metric.
  
  It takes config selection out of your hands and leaves you with a “process” rather than a model and a config. The outer CV is evaluating that process.
  
  Reply
John Sang August 19, 2020 at 5:01 pm #

Thank you, great advice.

Reply
- Jason Brownlee August 20, 2020 at 6:35 am #
  
  You’re welcome.
  
  Reply
MS August 25, 2020 at 2:27 am #

How to obtain the best hyperparameter combination for the simpler cross_val_score approach?

Reply
- Jason Brownlee August 25, 2020 at 6:43 am #
  
  Does the above example not help?
  
  What problem are you having exactly?
  
  Reply
  - MS August 26, 2020 at 11:38 pm #
    
    Using the longer approach as I obtained the best hyperparameters combination which I might choose as my final model how do I get it using the cross_val_score method ? In the cross_val_score approach after the GridsearchCV how the best parameters gets automatically selected for the outerCV process?
    
    Reply
    - Jason Brownlee August 27, 2020 at 6:15 am #
      
      You don’t need to know the best hyperparameters as the pipeline will find the best model for you so you can start using it.
      
      Nevertheless, you can keep a reference to the object if you like and then print the “best_params_” attribute.
      
      Reply
      - MS August 28, 2020 at 2:15 am #
        
        I have to present a final model (as per client requirement). The cross_val_score approach is a process for tuning and observing how my tuned learner will perform on unseen data(a proxy for the generalization error). Please correct me if i’m wrong.
        
        As you said the pipeline will find the best model (that i can understand) but how to extract the best model so that my client can use it on his own data for prediction?
        
        As you have suggested to print the best_params_ attribute to solve my problem should I use nested-cv package from pypi?
      - Jason Brownlee August 28, 2020 at 6:53 am #
        
        You can use the best parametres to define the final model, then fit it on all available training data.
MS August 30, 2020 at 2:09 am #

thanks Jason

Reply
- Jason Brownlee August 30, 2020 at 6:45 am #
  
  You’re welcome.
  
  Reply
O'Brien September 4, 2020 at 3:23 pm #

Hello,

I understand that after nested CV we are left with a metric (say, accuracy) that evaluates the modelling pipeline, not any individual configuration trained in the inner loop.

What does this accuracy mean exactly as it relates to the final configuration produced by this pipeline? If the outer CV loop says the pipeline is X% accurate, is it also correct to claim that the final configuration produced by the pipeline is accurate X% of the time?

If not, how can we make a claim about the accuracy of the final configuration itself?

Reply
- Jason Brownlee September 5, 2020 at 6:39 am #
  
  Good question.
  
  It is the accuracy of the pipeline that included a grid search to find the best config within the pipeline. E.g. the best model found.
  
  Reply
  - O'Brien September 5, 2020 at 12:55 pm #
    
    Thank you for the reply.
    
    Just to clarify, if a client expects a quote of accuracy from a developer, i.e. they want to know how often the algorithm will make a correct prediction, is it correct to provide the mean accuracy calculated from the outer CV loop?
    
    For instance, if the outer CV loop outputs 90% mean accuracy, then is it correct to tell a client that, on average, 90% of the predictions from the final configuration (algorithm + learned parameters + hyperparameters) are true?
    
    Reply
    - Jason Brownlee September 6, 2020 at 6:03 am #
      
      Yes, the mean and standard deviation – e.g. the distribution of accuracy scores.
      
      Even better would be to give the bootstrap confidence interval for the final model:
      https://machinelearningmastery.com/confidence-intervals-for-machine-learning/
      
      Reply
      - David Verbiest March 6, 2021 at 9:43 am #
        
        Thank you for your great article. Two questions.
        
        1)
        Concerning the generation of confidence intervals. How would you go about doing that though? I went through the material you provided, and get the principle. Would you then drop the outer cross validation ? I thought you could use bootstrapping estimate a parameter, In our case the accuracy. Or would there be another step after the outer cross validation for generating the confidence interval?
        
        2)
        Somewhere above you mention the following: “You can use the best parametres to define the final model, then fit it on all available training data.”
        Instead of selecting the best set of hyperparameters you find from cross validation on all the data, could you not use all best model resulting from the inner cross validation and take an ensemble of them? In the end these different sets of hyperparameters and their respective performances are used to get an estimate of the accuracy? The best set of hyperparameters model still could be fluke
      - Jason Brownlee March 6, 2021 at 2:09 pm #
        
        You’re welcome!
        
        I would recommend the bootstrap method for calculating confidence intervals for a chosen model generally:
        https://machinelearningmastery.com/confidence-intervals-for-machine-learning/
        
        Not quite, the model is chosen by the outer loop based on averaged performance not one time performance. A fluke is less likely. Some variance is expected though.
Lada September 13, 2020 at 8:29 pm #

Hi Jason, thanks a lot for this great tutorial! But I’d like to ask whether there is, in principle, a need for cross validation for such models as random forest that use bagging. E.g., from James et al.: An Introduction to Statistical Learning with Applications in R (p. 317-318):

“It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach.
…
The resulting OOB (out-of-bag) error is a valid estimate of the test error for the bagged model, since the response for each observation is predicted using only the trees that were not fit using that observation.”

What do you think, can random forest be trained and hyper parameters tuned without CV?

Reply
- Jason Brownlee September 14, 2020 at 6:47 am #
  
  Yes, you can use OOB for the model if it is standalone.
  
  For comparing models, a systematic and consistent test harness is required.
  
  Reply
Edvin September 28, 2020 at 1:58 pm #

How do you select the best model in the outer_cv? I do not see the point of averaging the accuracy of models with different hyperparameters. Also, the best_estimator_ only gives the best model for the specific inner fold since basically we are instantiating a new search for each inner fold/

Reply
- Jason Brownlee September 28, 2020 at 5:04 pm #
  
  You don’t, the models are discarded. Once you choose a “final” model/pipeline you can fit it on your available data and start making predictions:
  https://machinelearningmastery.com/train-final-machine-learning-model/
  
  Reply
Chahine October 5, 2020 at 9:45 pm #

Hi Jason,

Great Article !!

Are we running the inner loop on the entire dataset splitted into train|test here ?

Shouldn’t we hide 10% of the data from the inner loop and use the entire dataset in the outer loop only ?

Reply
- Jason Brownlee October 6, 2020 at 6:50 am #
  
  The outer loop operates on all data, inner loop operates on the training set of one pass from the outer loop.
  
  Why hold back data?
  
  Reply
Virgilio October 6, 2020 at 10:08 pm #

Dear Dr. Brownlee,

Thank you for this very detailed and interesting article.

I have just one question. When you describe the procedure to obtain the final model, you mention: “We know how well it will perform on average based on the score provided during the final model tuning procedure”.

However, I guess we cannot use this score, since it would be optimistically biased (hyperparameters were tuned). In order to estimate the final model performance, it seems more appropriate to use the unbiased score previously obtained with the nested cross-validation.

Does that make sense?

Thank you again and keep up the great work. Best regards.

Reply
- Jason Brownlee October 7, 2020 at 6:46 am #
  
  We can, as the estimate was based on the outer loop.
  
  This is the whole point, that the model tuning is now considered part of the model itself – e.g. the modeling pipeline.
  
  Reply
Chahine October 7, 2020 at 3:49 am #

Hi Jason,

I am trying to apply your steps on my data, I do have a pipeline containing tf-idf, LDA and LogisticRegression model, my input data is a list of documents X = [‘text,’text’…….], y =[0,1….], I am trying to optimize different parameters of the pipeline components using nested cv.

how can I use your code here to split the data, assuming that the data needs to be a numpy array.

for train_ix, test_ix in cv_outer.split(X):
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]

Thank you

Reply
- Jason Brownlee October 7, 2020 at 6:48 am #
  
  Yes, numpy arrays.
  
  Perhaps try using the all in one solution at the end of the tutorial instead?
  
  Reply
  - Chahine Badis October 15, 2020 at 12:56 am #
    
    Hi Jason ,
    
    Can you please elaborate on how exactly cross_val_score works ? it is hard to understand how the Model is going to be fitted ? is there any reference please ? I can’t find an explanation in sklearn documentation.
    
    Million Thanks
    
    Reply
    - Jason Brownlee October 15, 2020 at 6:13 am #
      
      It evaluates the model using cross validation and reports the results.
      
      You can learn more here:
      https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
      
      If cross-validation is new for you, start here:
      https://machinelearningmastery.com/k-fold-cross-validation/
      
      Reply
Karthik Mamudur October 31, 2020 at 9:26 am #

Hello Jason,

If I have to compare which algorithm (RandomForest,SomeOthermodel-1,SomeOthermodel-2) is the best, do I just repeat the nested cv with each of the above algorithms and take a decision based on the various nested cv accuracies and standard deviations?

Thank you,
Karthik

Reply
- Jason Brownlee October 31, 2020 at 9:35 am #
  
  Yes, compare the mean performance from each technique.
  
  Reply
  - Karthik October 31, 2020 at 3:34 pm #
    
    Thank you!
    
    Reply
    - Jason Brownlee November 1, 2020 at 7:30 am #
      
      You’re welcome.
      
      Reply
Payal Goyal November 17, 2020 at 4:11 am #

Hi Jason Brownlee,

Thanks for the great explanation. But I have 1 doubt: Shouldn’t we use 1 more for loop in inner iterations as we did for outer loop?

Reply
- Jason Brownlee November 17, 2020 at 6:34 am #
  
  We do in the final example, it is just handled for us by the gridsearch.
  
  Reply
Emin November 23, 2020 at 9:07 am #

Dear Jason,
Thank you for the post. I have two questions.
1) First one is about pre-processing techniques (scaling, imputation, etc.) and nested cross validation. Can nested cross validation be used to decide on a pre-processing technique, say to make a decision about the imputation techniques mean vs. median vs regression?
2) Is it possible to generalize nested cross validation to other resampling techniques? For example, “nested repeated train-test splits” where both outer and inner validations use repeated random train-test splits. If so, do you know any academic references to the topic? If I understand them correctly, the most commonly used nested resampling technique in scientific publications is single train-test split for the outer validation and a regular k-fold cross-validation for the inner validation.
Best,
Emin

Reply
- Jason Brownlee November 23, 2020 at 10:53 am #
  
  I don’t think nested cv helps with choose data preparation, perhaps this will help:
  https://machinelearningmastery.com/grid-search-data-preparation-techniques/
  
  Yes, I don’t see why the approach would not generalize.
  
  I doubt it is discussed in research papers – it seems too elementary. You can search for references on scholar.google.com
  
  Reply
Emin November 23, 2020 at 12:14 pm #

Thank you

Reply
- Jason Brownlee November 23, 2020 at 1:25 pm #
  
  You’re welcome.
  
  Reply
Ankush Jamthikar November 25, 2020 at 5:08 pm #

Hi, I am new to machine learning. I was reading this post and I must say it is very well written and clearly explained. I just have one question. I want to deploy my ML model. For this, I need to save the best model. In nested cross-validation, which model should we deploy? Because in each fold, we are getting a new model with a new set of hyper parameters.

Reply
- Jason Brownlee November 26, 2020 at 6:29 am #
  
  Thanks!
  
  One solution would be to use the grid search cv model and drop the outer loop.
  
  Another solution would be to inspect the result from the grid search cv model and use the specific configuration chosen.
  
  Reply
  - Ankush Jamthikar November 26, 2020 at 7:53 pm #
    
    Hi Dr. Brownlee,
    
    Thank you for your quick reply! I am not sure I fully understood your answer. However, I was reading some of the previous comments and I figured the following points.
    
    1) Nested Cross-validation does not provide a final model. It is just a way to check (or evaluate) the performance of an ML algorithm on different independent training and testing datasets. The averaged performance metrics that come out from cross-validation will tell the overall behavior of the model, which we informed your client.
    
    2) After nested cross-validation, if we want to give the best model to someone (say client), then we need to (a) use the whole dataset, (b) run hyperparameter optimization, (c) get the best parameter, and (d) retrain the fresh ML algorithm with the best parameters to save the ML model as the “Final model.”
    
    Is this correct?
    
    Reply
    - Jason Brownlee November 27, 2020 at 6:37 am #
      
      Yes, that approach is valid and what I would do.
      
      Reply
      - Ankush Jamthikar November 29, 2020 at 12:19 am #
        
        Dear Dr. Brownlee,
        
        Thank you very much for validating my thought process and clarifying the doubts. I would be troubling you once agin after reading all your articles/blogs. I hope you will clarify my future doubts as well.
        
        Thank you once again,
        Ankush
      - Jason Brownlee November 29, 2020 at 8:14 am #
        
        You’re welcome!
      - JLH December 14, 2020 at 3:19 am #
        
        Don’t you run the risk here again of having the hyperparam search overfit again? Seems the nested CV shows that it’s possible to have a good model, but how do you guarantee (or at least maximize the chance) of actually getting a good single model at the end that you can use? Thanks!

Edward MB December 30, 2020 at 12:18 am #

Hi, thank you very much for this article – it has been very useful. How though, please, can I expand on this by including recursive feature elimination so that not only will the process find the best hyperparameters but also the most optimal features (i.e. the best hyperparameter and feature combination)?

Jason Brownlee December 30, 2020 at 6:39 am #

A RFECV will automatically configure itself. You can use it in a pipeline prior to your model.

Or use a pipeline with a grid search on the RFE hyperparameter.

This tutorial will show you how to use RFE in a pipeline:
https://machinelearningmastery.com/rfe-feature-selection-in-python/

Edward MB January 1, 2021 at 2:18 am #

Thank you for your reply. Using the explicit outer-inner loop above, I think I achieve this using the below.

...
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from sklearn.pipeline import Pipeline
# create dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)
# configure the cross-validation procedure
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
# enumerate splits
outer_results = list()
for train_ix, test_ix in cv_outer.split(X):
    # split data
    X_train, X_test = X[train_ix, :], X[test_ix, :]
    y_train, y_test = y[train_ix], y[test_ix]
    # configure the cross-validation procedure
    cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
    #configure pipeline
    model = LogisticRegression()
    rfe = RFE(estimator=LogisticRegression())
    pipe = Pipeline(steps=[('rfe', rfe), ('model', model)])
    # hyperparamater options
    n_features_to_select_options = np.arange(1,len(X[0])+1,1)
    c_options = [1, 10, 100, 1000]
    solver_options = ['newton-cg', 'lbfgs', 'liblinear']
    penalty = ['l2']
    #parameter grid
    param_grid = [
        {
        'rfe__n_features_to_select': n_features_to_select_options,
        'model__C': c_options,
        'model__solver': solver_options,
        'model__penalty': penalty
        }
    ]
    # define search
    search = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid, cv=cv_inner, refit=True)    
    # execute search
    result = search.fit(X_train, y_train)
    # get the best performing model fit on the whole training set
    best_model = result.best_estimator_
    # evaluate model on the hold out dataset
    yhat = best_model.predict(X_test)
    # evaluate the model
    accuracy = accuracy_score(y_test, yhat)
    # store the result
    outer_results.append(accuracy)
    # report progress
    print('>accuracy=%.3f, est=%.3f, cfg=%s' % (accuracy, result.best_score_, result.best_params_))
# summarize the estimated performance of the model
print('accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))
...

...

from numpy import mean

from numpy import std

from sklearn.datasets import make_classification

from sklearn.model_selection import KFold

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

from sklearn.feature_selection import RFE

from sklearn.pipeline import Pipeline

# create dataset

X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10)

# configure the cross-validation procedure

cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)

# enumerate splits

outer_results = list()

for train_ix, test_ix in cv_outer.split(X):

# split data

X_train, X_test = X[train_ix, :], X[test_ix, :]

y_train, y_test = y[train_ix], y[test_ix]

# configure the cross-validation procedure

cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)

#configure pipeline

model = LogisticRegression()

rfe = RFE(estimator=LogisticRegression())

pipe = Pipeline(steps=[('rfe', rfe), ('model', model)])

# hyperparamater options

n_features_to_select_options = np.arange(1,len(X[0])+1,1)

c_options = [1, 10, 100, 1000]

solver_options = ['newton-cg', 'lbfgs', 'liblinear']

penalty = ['l2']

#parameter grid

param_grid = [

{

'rfe__n_features_to_select': n_features_to_select_options,

'model__C': c_options,

'model__solver': solver_options,

'model__penalty': penalty

}

]

# define search

search = GridSearchCV(pipe, n_jobs=1, param_grid=param_grid, cv=cv_inner, refit=True)

# execute search

result = search.fit(X_train, y_train)

# get the best performing model fit on the whole training set

best_model = result.best_estimator_

# evaluate model on the hold out dataset

yhat = best_model.predict(X_test)

# evaluate the model

accuracy = accuracy_score(y_test, yhat)

# store the result

outer_results.append(accuracy)

# report progress

print('>accuracy=%.3f, est=%.3f, cfg=%s' % (accuracy, result.best_score_, result.best_params_))

# summarize the estimated performance of the model

print('accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results)))

...

Jason Brownlee January 1, 2021 at 5:31 am #

I’m happy to hear you are making progress.

Sorry, I don’t have the capacity to review your code.

Reply
Felipe Araya February 3, 2021 at 1:42 am #

Hello Edward,

This code looks great! I have been looking at something like that in Stackoverflow and similar, but this one seems the neatest to me, particularly when it comes to implementing RFE and GridSearchCV together. One question, why did you use RFE instead of RFECV?

Thanks!

Reply

Felipe Araya January 28, 2021 at 1:18 am #

Hello!

Awesome tutorial as always!

Just have a couple of questions if you don’t mind:

1. In the section “How Do You Configure the Final Model?”, Step 2 seems identical to Step 4 to me, of course it is probably not, would you mind clarifying that, please?

2. What if I wanted to include data preparation for the code you showed above (such as scalling or feature categorisation, would I have to put that within the outer loop, inner loop or outside of the loops?

2.1 if it had to put it within one of the loops, how can you do that if you use “cross_val_score”, I guess you can only use the “split()” method, right?

Reply
- Jason Brownlee January 28, 2021 at 6:06 am #
  
  Yes, step 4 is not needed if you use gridsearchcv.best_estimator_
  
  You can use a pipeline for the “model” and include any data prep steps you like.
  
  I don’t understand your last question, sorry. Perhaps you can elaborate?
  
  Reply
  - Felipe Araya January 28, 2021 at 6:56 am #
    
    Hi Dr, Jason
    
    Thanks for answerting my first question.
    
    Sure, second question has to do with data leakage, for example it is mentioned that for feature manipulation or scalling or any type of preprocessing should be done for Train and Test set separately to prevent data leakage, and since the outer loop represents the “test set” and the inner loop represents the “training + validation” set, I was wondering where to incorporate the preprocessing here whether within the outer loop or inner loop or outside both. Does that make sense?
    
    Reply
    - Jason Brownlee January 28, 2021 at 7:59 am #
      
      Define a pipeline and let it handle the correct procedure for fitting on the train set and application to train and test set of data preparation methods. This applies all the way down into the iterations of the grid search.
      
      Reply
Felipe Araya February 3, 2021 at 1:54 am #

HI Jason,

Quick question, can you use RFECV within a GridSearchCV or you should use normal RFE? My logic is to think that within the inner cv that GridSeachCV generates (essentially the train+validation set) there will be another CV happening from RFECV, so I am not sure if this makes much sense or if I am better off using RFE evaluated for range(1, len(features)+1) within the GridSearch

Reply
- Jason Brownlee February 3, 2021 at 6:22 am #
  
  You can use either, it’s up to you.
  
  Reply
Jens February 22, 2021 at 12:28 am #

Dear Jason Brownlee,

I have question regarding the final model. Let’s say we performed a Nested CV with an inner and outer CV of 5 folds. If we want the the final model we perform again a, for example, Random Search on the whole (training) data and get the best model from this search.

Should the number of folds for getting the final model be equal to the inner CV or can it be different, say 10 folds? If it should be the same, why is that?

Kind regards,

Jens

Reply
- Jason Brownlee February 22, 2021 at 5:02 am #
  
  The number of folds for the inner loop when training the final model should match what was used during model selection – to be consistent with the decision and comparisons made among models.
  
  Reply
Giovanna February 25, 2021 at 12:24 pm #

would this technique be equivalent to using RepeatedKFold?

cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
model_cv_fit = cross_val_score(model, features.values, y.values, cv=cv)

Reply
- Jason Brownlee February 26, 2021 at 4:53 am #
  
  That is an example of repeated cv.
  
  Reply
Pablo Reinhardt March 25, 2021 at 8:44 pm #

Dear Jason,
when I do stacking, do I treat the stacking model just as “a model” like in your example, or is there something I have to consider?
Thanks for your help, I love your work here!
Cheers
Reini

Reply
- Jason Brownlee March 26, 2021 at 6:23 am #
  
  No, it’s just a model fit on data, it just so happens that the inputs in the training data are outputs of other models.
  
  Reply
Reiner April 15, 2021 at 8:10 am #

Hi dear Jason,

would it be possible to replace the line:
model = RandomForestClassifier

… with a stacking classifier:
stacking = StackingClassifier(estimators=models) where models is a list of classifiers ?

if so, how can you combine nested CV with a two layer classifier ?

Reply
- Jason Brownlee April 16, 2021 at 5:27 am #
  
  Sure.
  
  Ouch, that might require some thinking. I can’t give you a good answer off the cuff, sorry.
  
  Reply
Kwa June 4, 2021 at 1:44 pm #

Can I say the final accuracy is equal to the accuracy of an independent test set?

Reply
- Jason Brownlee June 5, 2021 at 5:23 am #
  
  The final accuracy would be the mean over all runs.
  
  Reply
Tim Stack July 7, 2021 at 12:01 am #

Dear Jason,

At one point in the code you comment the best performing model from Grid Search is subsequently trained (fitted) on the outer loop training data.

# define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_ # evaluate model on the hold out dataset yhat = best_model.predict(X_test)

However, it would appear to me that the best model from the inner loop is selected and then used on the outer loop test set, without any training on the outer loop preceding that step.

In order to train the inner loop model on the outer loop set, shouldn’t a line be added?

# define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model best_model = result.best_estimator_ # fit on the whole training set best_model.fit(X_train, y_train) # evaluate model on the hold out dataset yhat = best_model.predict(X_test)

Reply
- Jason Brownlee July 7, 2021 at 5:34 am #
  
  Yes, the best inner loop is evaluated on the outer loop.
  
  If you want to use the best model as determined by the outer loop, that is about “finalizing a model”, see the section on that.
  
  Reply
Vinay July 8, 2021 at 9:59 pm #

Hi Jason,

I really appreciate your work and contribution. I need to clarify just one thing, once we obtain the best parameter, should I fit the final model using cv or without? If yes, should cv = ‘inner cv’ or ‘outer cv’?

Thanks in advance.

Reply
- Jason Brownlee July 9, 2021 at 5:08 am #
  
  In nested CV, you would use an inner CV process to find the best hyperparameters for the model and then fit a final model on those hyperparameters. The gridsearchcv does this for you if you want or you can do it manually.
  
  Reply
Vinay July 12, 2021 at 3:51 pm #

Thanks for your reply Jason. I have one more doubt (in general), while fitting the model with the best parameters how should I decide upon which random_state will give me a more reliable model for prediction with high accuracy. I tried to iterate over 100 random_states out of which 7 accuracy values were found to be unique ranging from 0.5 to 0.76. Should I choose the random_state which gives highest or median accuracy to fit the model?

Reply
- Jason Brownlee July 13, 2021 at 5:15 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/what-value-should-i-set-for-the-random-number-seed
  
  Reply
kukushiwo September 21, 2021 at 12:57 pm #

Hi, Jason,
Thank you for the post.
I want to know what data(train data set and validation set of inner loop? or train data set and test data set of outer loop?) should be used to draw learning curve(loss v.s. iterations) when performing nested CV?

Reply
- Adrian Tam September 23, 2021 at 2:59 am #
  
  When you talk about learning curve, you are plotting the “score” (whatever you decided to use) against the “number of iterations”. So in the middle of an iteration, before the model is trained, you should be able to find a score. What can that be? If the iteration is about the inner loop, of course that score is from the validation set.
  
  Reply
  - kukushiwo September 23, 2021 at 4:34 pm #
    
    So, if the iteration is about the outer loop, score is from the test data set, is it right?
    
    But your post about the learning curves, in which you replied to other in the comment as follows,
    “Jason Brownlee July 4, 2019 at 7:52 am #
    Typically just train and validation sets.”
    
    You can find this at the following link.
    https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
    
    Does it mean that we cannot draw a learning curve of outer loop using training data set and test data set?
    
    Reply
    - Adrian Tam September 24, 2021 at 4:56 am #
      
      I don’t think you should have a learning curve for outer loop because the outer loop is about different folds of cross validation. You can draw a bar chart for that to see whether the different folds produce similar score, however.
      
      Reply
      - kukushiwo September 24, 2021 at 9:49 am #
        
        But inner loop is also about different folds of cross validation, so if I want to draw learning curve for inner loop, which fold of validation set of the inner loop should I use, and which hyperparameters should I use?
      - Adrian Tam September 25, 2021 at 4:12 am #
        
        Each fold is about a particular partitioning of dataset. For the model, you start from iteration 0 which everything is random, up to iteration N which it start to churn out meaningful result. What you want to plot is one particular partitioning and one particular model, how the result score compare to iterations. So simply, in k-fold, you get k curves.
kukushiwo September 27, 2021 at 12:34 pm #

‘Each fold is about a particular partitioning of dataset. For the model, you start from iteration 0 which everything is random, up to iteration N which it start to churn out meaningful result. What you want to plot is one particular partitioning and one particular model, how the result score compare to iterations. So simply, in k-fold, you get k curves.’

After I got the best hyperparameters by running the inner loop, I should use the best hyperparameters to draw learning curves(k curves) of each fold(k folds in all) of the inner loop, am I right?

Reply
- Adrian Tam September 28, 2021 at 8:45 am #
  
  Right. Each learning curve is for one model from its random state to trained state. You shouldn’t mix up different models in the same curve.
  
  Reply
  - kukushiwo September 28, 2021 at 3:50 pm #
    
    Thank you very much!
    It’s great helpful.
    
    Please let me ask one more question about feature selection when we use Nested CV.
    
    We should train all the data we have, but use the accuracy (or F1, ROC AUC etc.) obtained by running the outer loop to see how good the model is. Which data set we should use to do the feature selection?
    
    XGBoost built-in feature importance is calculated based on the training set. And since we provide the model trained by using all the data we have to customer, does it mean we should use all the data to do feature selection?
    
    Reply
    - Adrian Tam September 29, 2021 at 11:51 pm #
      
      Please see https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/ for the concept of what CV supposed to do.
      
      Reply
Lamin October 25, 2021 at 4:36 am #

Hello, first of all thanks a lot for your tutorials! They help me a lot. I have got three questions:

What is the difference between hyperparameter tuning and model selection? I thought it meant the same thing.

Can we consider the outcome of the automatic nested cross-validation routine as an estimate for its generalization performance? I noticed, that you ran the nested cross-validation on the complete dataset.

Also, if I needed to scale the data first, would it suffice to scale the whole dataset beforehand or would it be better to somehow build a pipeline with the scaler and the classificator and then feed the pipeline to GridSearchCV?

Reply
- Adrian Tam October 27, 2021 at 2:14 am #
  
  1. Model selection is done first before you can have hyperparameters to tune. If you picked kNN, you have hyperparameter k; but if you picked neural network, the k doesn’t apply.
  2. Yes, that’s the point of cross validation: using limited dataset to estimate the general performance on unseen dataset
  3. Build a pipeline to do inside cross validation is better. Assume that you should not see the validation part of the data when you do the model fitting, why you should see it in scaling?
  
  Reply
Naka November 5, 2021 at 7:59 am #

Hello,

thank you very much for the tutorial. I would like to ask: if we are looking for hyperparameters of a model using the whole dataset, how to validate the trained model on the holdout set (test set) so, that no leak occurs? Because we have cv_inner=3, where we are tuning the hyperparameters. And then when training the model with cross-validation we have cv_outer=10, so the data points overlap in cv_inner and cv_outer and the model has tuned hyperparameters on the data it shouldn’t have seen. I hope my comment makes sense.

Reply
- Adrian Tam November 7, 2021 at 8:03 am #
  
  Simple answer is not to make the data shared between your training data and the validation (scoring) data. This includes the preprocessing step (e.g., scaling).
  
  Reply
nocibambi November 19, 2021 at 10:04 pm #

> “Specifically, an often noisy model performance score.”

Possibly a word is missing from here?

Reply
- Adrian Tam November 20, 2021 at 1:58 am #
  
  Yes, I corrected that sentence. Thanks for pointing out.
  
  Reply
IASONAS CHRISTOULAKIS December 21, 2021 at 7:46 am #

Nested Cross Validation is needed much computational effort to outperform for large -scale datasets. Could you suggest practicing another method?

Reply
- James Carmichael February 15, 2022 at 12:47 pm #
  
  Hi Iasonas…I am not following your question as it is currently stated. Please clarify so that I may better assist you.
  
  Reply
Michael January 21, 2022 at 6:17 am #

Jason thank you vey much for all of your amazing articles. I have been trying for over a week now to understand the whole concept behind nested CV. My main problem is that I can in no way understand why the grid search without nested CV is causing bias.
For example if I’m trying to find the optimal value of C in a SVM and I provide the gridsearchCV with the possible values [1,10,100] then the gridsearchCV will fit the SVM on each of the possible values provided, and perform CV. I will get back 3 average scores from CV 1 for each value i provided to the gridsearch. Why are these scores considered biased?
But for argument’s sake let’s say this is a biased estimate. Again wouldn’t it be easier to just perform CV to just the winning model this time?

Reply
- James Carmichael February 21, 2022 at 2:05 pm #
  
  Hi Michael…This is a step by step guide to nested CV that may prove beneficial.
  
  https://www.analyticsvidhya.com/blog/2021/03/a-step-by-step-guide-to-nested-cross-validation/
  
  Reply
Rafael Araujo February 19, 2022 at 6:23 am #

Hello James, I have a question:

– In doing the nested CV we are dealing with many more models, we can think of each model with different hyperparameters as a different model, and our GOAL with NESTED CV is to find the best model:

ex:

Lr (c=1), Lr(c=2), Lr(c=3), Rf(n_estimator=4), Rf(n_estimator=53) etc…

Each one can be viewed as a different model in the context of choosing a final one for our predictions.

So with that in mind, in the non-automatic version of the code you displayed the accuracies and the HyperParameters of the RandomForest:

>acc=0.900, est=0.932, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.940, est=0.924, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.929, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.927, cfg={‘max_features’: 6, ‘n_estimators’: 100}
>acc=0.920, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.910, est=0.918, cfg={‘max_features’: 2, ‘n_estimators’: 100}
>acc=0.930, est=0.924, cfg={‘max_features’: 6, ‘n_estimators’: 500}
>acc=0.960, est=0.926, cfg={‘max_features’: 2, ‘n_estimators’: 500}
>acc=0.900, est=0.937, cfg={‘max_features’: 4, ‘n_estimators’: 500}
Accuracy: 0.927 (0.019)

In this context, it makes sense that the final model would be the RF ( ‘max_features’: 4, ‘n_estimators’: 500) since it had the biggest accuracy.

But the automatic one simply shows us the average accuracy, I understand that the accuracy already takes into consideration the hyperparameters but the main purpose of doing this isn’t so I KNOW the hyperparameters so when I train the model with ALL the TRAINING data I can apply them.

I’m doing a classification project in which I must choose between (L.R(), NB(), KNN()), I was only doing CV but now I’m learning hyperparameter tuning and decided to implement it after reading this article… But the automatic code only tells me what I already knew… That KNN() is the best model, it does not tell me which “version” of KNN() is the best when I fit ALL the training data.

Hope you can understand, I love this website is my guide for anything ML.

Thanks for the attention!

Reply
- James Carmichael February 19, 2022 at 12:51 pm #
  
  Hi Rafael…I would highly recommend that you investigate optimization techniques since you are considering various model types and each have a variety of parameters that can be tuned.
  
  https://machinelearningmastery.com/why-optimization-is-important-in-machine-learning/
  
  https://machinelearningmastery.com/combined-algorithm-selection-and-hyperparameter-optimization/
  
  Reply
Rafael Araujo February 22, 2022 at 6:05 am #

Thanks for the articles!! But my question still remains so i’ll rewrite it:

After discovering in nested CV, the BEST HyperParameters and the Best model (let’s suppose it’s L.R(C=2)) my model becomes:

model=LogisticRegression(C=2) ??

I ask this because in the automatic code, it does not display the HP, it just display that L.R is the best model so it stays:

model=LogisticRegression()

Yes the nested CV tells me that between all the combinations of HP and Models, L.R is the best…But it should also tell me WHICH set of HP of that model is the BEST so i can use like

model=LogisticRegression(C=2)

The “What Configuration Was Chosen by Inner Loop?” part, looks like it tells me that the HP found are not important, but if they aren’t how am i setting the best model with the best HP for the entire dataset?

So what i’ll do in my code is use the automatic from this article (with reference of course) and find a way show not only the model with best accuracy (in the example L.R) but which SET of HP from THAT model is the best (in the example C=2).

Reply
Martin February 25, 2022 at 12:22 pm #

Dear Sir,

Could you explain how can we save the best model to predict external validation data?

Best regards,

Reply
- James Carmichael February 26, 2022 at 12:34 pm #
  
  Hi Martin…The following may be of interest to you:
  
  https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
  
  Reply
Daniel June 30, 2022 at 6:25 am #

Hi Jason,
great article 🙂
I have a question: You say that “it [nested cv] selects a modeling pipeline that results in the best model”. But: Between which pipelines do we select here? In the shown example, there is just one pipeline which is cross-validated to find the best hyperparameters, and the best model is tested using the outer loop. Thus, there is no other pipeline shown here. Does “selecting a model pipeline” mean that we establish multiple such pipelines (e.g. one with RandomForest, one with NaiveBayes…), run a nested cv on each of them and use the average score of the outer loop to choose among them?

Reply
- James Carmichael June 30, 2022 at 11:58 am #
  
  Hi Daniel…The following resource related to best practices of utilizing pipelines may be of interest to you:
  
  https://machinelearningmastery.com/machine-learning-modeling-pipelines/
  
  Reply
Daniel July 1, 2022 at 3:12 am #

Hum, not really… my question is specifically related to the content of this article (nested Cross validation)…

Reply
Abhishek Kishore July 16, 2022 at 2:32 am #

Please change cv to cv_outer on line 3 of the very first code snippet. Thanks for explaining nested CV

Reply
- James Carmichael July 16, 2022 at 7:07 am #
  
  Thank you for the feedback Abhishek!
  
  Reply
skan July 17, 2022 at 5:24 am #

Hello.

Is there any function in R equivalent to “cross_val_score”?
I think caret, ml3 and tidymodels don’t have it.
I’ve just found very large codes trying to do it.

I’ve also found a package called nestedcv, I need to find out how to use it, for example for decision trees.

Reply
- James Carmichael July 17, 2022 at 11:22 am #
  
  Hi skan…the following may be of interest to you:
  
  https://www.geeksforgeeks.org/cross-validation-in-r-programming/
  
  Reply
Anthony July 22, 2022 at 7:57 am #

Grammar correction: The downside of not using an outer CV might be that some of the data ends up *not* being used

Reply
- James Carmichael July 22, 2022 at 8:07 am #
  
  Thank you for the feedback Anthony!
  
  Reply
NE November 11, 2022 at 11:32 pm #

Hello.
Do we still have nested cross-validation if inner_CV = outer_CV?

Reply
- James Carmichael November 12, 2022 at 8:54 am #
  
  Hi NE…The following resource may add clarity:
  
  https://www.analyticsvidhya.com/blog/2021/03/a-step-by-step-guide-to-nested-cross-validation/
  
  Reply
Matthew January 5, 2023 at 11:43 am #

I am a little confused on the actual data leakage problem itself. I don’t see clearly what’s wrong with using the same data for tuning and model selection, which is the problem we are trying to solve. The literal process of nested-cv makes sense though.

Reply
Markus March 6, 2023 at 12:33 am #

Hi

Am I right that the rational behind nested cross-validation is to select the best model with the best choosen hyperparameters along the way given the best accuracy?

If that is the case why the given examples here go only through the loop of hyperparameter tuning and not the model selection itself? As I see only one sine model being used, namely RandomForestClassifier.

Reply
Nasim March 11, 2023 at 5:20 am #

Hi Jason,

Any complete example|code for regression?

Reply
- James Carmichael March 11, 2023 at 7:53 am #
  
  Hi Nasim…The following is a great starting point:
  
  https://machinelearningmastery.com/training-a-multi-target-multilinear-regression-model-in-pytorch/
  
  Reply
Murilo April 13, 2023 at 5:56 am #

Hello, i fail to understand what ‘result.best_score_’ is here.

‘acc’ is the score obtained with the best model using the heldout test data from the outer cv, correct?

But that is this ‘result.best_score’?

Reply
Rahil May 4, 2023 at 10:30 pm #

Hello,

Thank you for the very useful blog.
I have a question regarding finding the optimal parameters when GridSearchCV is used inside cross_val_predict in nested k-fold cross validation. How can we get/see those optimal parameters in this case?

Reply
- James Carmichael May 6, 2023 at 10:19 am #
  
  Hi Rahil…The following resource provides application of the results from GridSearchCV:
  
  https://www.analyticsvidhya.com/blog/2021/06/tune-hyperparameters-with-gridsearchcv/
  
  Reply
Dmitry June 2, 2023 at 1:25 pm #

Sorry, read this tutorial, but still can’t get one thing about nested cv. You say
” The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. ”

Yet in the end we perform exactly this k-fold cv on the whole dataset to find best hyper parameters for the model. So what is the point of whole nested cv process if in the end we can overfit the whole dataset as you mention in this quote?

Reply
- James Carmichael June 3, 2023 at 11:35 am #
  
  Hi Dmitry…Your understanding is correct. In some cases overfitting is still possible. The following may be of interest:
  
  https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
  
  Reply
Marco June 7, 2023 at 4:32 am #

Hi, thanks for your useful blog.
I have a question regarding the number of folds used for the final model selection. If I use 7-fold cv for the outer loop and 6-fold cv for the inner loop, should I use 7 or 6 folds for the final model selection on the full dataset? Or could i use another number of folds like 5 or 10?
In case I need to use repeated cross validation for the final model selection would the whole process still be valid? And what’s again the right choice of k-fold in this case?

Reply
- James Carmichael June 7, 2023 at 1:13 pm #
  
  Hi Marco…You are very welcome! The following resource may be of interest to you:
  
  https://machinelearningmastery.com/how-to-configure-k-fold-cross-validation/
  
  Reply
  - Marco June 7, 2023 at 5:17 pm #
    
    Hi James, thanks for your answer, but my question is not about the general choice of k, but if this number must be equal between outer / inner / final cross-validation.
    
    Reply
Hiroro July 18, 2023 at 2:37 am #

Hi James,

Thanks for the great article! I want to clarify one thing… why is it that the overall accuracy of the model is based on the average accuracy score of different models with different hyperparameter? To me, it makes more sense to first do hyperparameter tuning using nested CV, then using the best parameter, we train the dataset and do cross validation (for example, run the particular model with best parameter 5 times using 5-fold CV method) and get the average accuracy.

Reply
- James Carmichael July 18, 2023 at 7:38 am #
  
  Hi Hiroro…The following resource may be of interest to you:
  
  https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/
  
  Reply
Joaquín February 7, 2024 at 8:58 am #

Hi, Jason:

Congratulations on your tutorials, I think you share a lot of valuable knowledge!

In theory, we use nested Cross-Validation for estimating with less bias the model performance on unseen data when we combine hyperparameter optimization with k-fold cross-validation, since the hyperparameter optimization may lead to an overfitting scenario.

My question is: Does this also occur when we provide the fit method of Keras with both training and validation data and monitor only the improvements in validation error using ModelCheckpoint class?

Thank you!

Reply
- James Carmichael February 7, 2024 at 10:41 am #
  
  Hi Joaquin…The following location is a great resource for better understanding model_checkpoint in Keras.
  
  https://keras.io/api/callbacks/model_checkpoint/
  
  Reply
George March 17, 2024 at 6:30 pm #

Hi Jason,

Highly appreciate your paramount effort in explaining all these complicated issues.

I have a question. Suppose I want to use Random Forest and I don’t care to check the performance of other algorithms.

Q: Can I use nested cross-validation for hyperparameter tuning?

In other words, to pick the hyperparameters that yield the highest accuracy? For example, in your data above that would be the last but one configuration:

>acc=0.960, est=0.926, cfg={‘max_features’: 2, ‘n_estimators’: 500}

Then use these hyperparameters for fitting a model to the entire dataset (X,y) and call for prediction in unseen data. Or, this is not correct?

Thanks!

Reply
- James Carmichael March 18, 2024 at 6:35 am #
  
  Hi George…Yes, you can use nested cross-validation for hyperparameter tuning, and your understanding of its application is correct. Nested cross-validation provides a robust way to select the best hyperparameters for your model while also giving an unbiased estimate of its performance on unseen data. Here’s how it works and why it fits your scenario:
  
  ### What is Nested Cross-Validation?
  
  Nested cross-validation consists of two layers of cross-validation:
  – The **outer loop** is used to assess the model’s performance (accuracy, in your case) and ensure that the evaluation of the model is unbiased.
  – The **inner loop** is used for hyperparameter tuning, selecting the best hyperparameters like max_features and n_estimators for the Random Forest model.
  
  ### Process
  
  1. **Inner Loop (Hyperparameter Tuning)**: In the inner loop, you perform hyperparameter tuning (e.g., using grid search or random search) to find the best hyperparameters. This step is performed on the training set that is split from the original dataset during each iteration of the outer loop.
  
  2. **Outer Loop (Performance Evaluation)**: The outer loop is used to evaluate the model’s performance with the best hyperparameters obtained from the inner loop. This provides an unbiased estimate of the model performance on unseen data.
  
  ### Your Scenario
  
  – You are interested in using Random Forest and want to select the hyperparameters that yield the highest accuracy.
  – By using nested cross-validation, you correctly aim to tune the hyperparameters (max_features and n_estimators in your case) within the inner loop.
  – After finding the best configuration (e.g., {'max_features': 2, 'n_estimators': 500} with acc=0.960 and est=0.926), you can indeed use these hyperparameters to fit a model to the entire dataset (X, y).
  – Finally, you use this model for predictions on unseen data.
  
  ### Advantages
  
  – **Unbiased Evaluation**: Nested cross-validation provides an unbiased estimate of the model performance, as the hyperparameter tuning is done independently in each iteration of the outer loop.
  – **Robust Hyperparameter Selection**: It helps in selecting the hyperparameters that generalize well across different subsets of the data, rather than fitting to a specific train-test split.
  
  ### Conclusion
  
  Your approach to using nested cross-validation for selecting hyperparameters and then fitting a model to the entire dataset for future predictions is correct and methodologically sound. This technique is particularly useful when you’re focused on a single model type and wish to optimize its configuration before finalizing the model.
  
  Reply
  - George March 18, 2024 at 7:23 pm #
    
    Hi Janes. Thank you so much for the detailed and to the point response!!!
    
    Reply
Luis Talavera June 23, 2024 at 3:06 pm #

Maybe I’m a bit confused, but I was wondering if nested cross-validation is necessary when the hyperparameter tuning strategy is Bayesian optimization. In that case, will a single cross-validation be enough?

Reply
- James Carmichael June 24, 2024 at 2:03 am #
  
  Hi Luis…When using Bayesian optimization for hyperparameter tuning, the need for nested cross-validation depends on your specific goals and the robustness you require in your model evaluation. Let’s break down the concepts to clarify:
  
  ### Cross-Validation and Nested Cross-Validation
  
  1. **Cross-Validation (CV)**:
  – **Purpose**: Evaluate the model’s performance on unseen data and ensure that it generalizes well.
  – **Procedure**: The data is split into k folds. The model is trained on k-1 folds and tested on the remaining fold. This process is repeated k times, and the performance metrics are averaged.
  
  2. **Nested Cross-Validation**:
  – **Purpose**: Provides an unbiased estimate of model performance, especially when hyperparameter tuning is involved. It helps avoid overfitting and ensures that the hyperparameter tuning process does not leak information from the test set.
  – **Procedure**: Two loops of cross-validation:
  – **Outer Loop**: Splits the data into k folds for model evaluation.
  – **Inner Loop**: Within each outer fold, another cross-validation is performed to tune hyperparameters.
  
  ### Bayesian Optimization
  
  Bayesian optimization is a method for hyperparameter tuning that builds a probabilistic model of the objective function and uses it to select the most promising hyperparameters to evaluate in the true objective function.
  
  ### When to Use Nested Cross-Validation
  
  **If your goal is to evaluate the true performance of a model with hyperparameter tuning**:
  
  – **Nested CV is necessary**: This ensures that the performance metrics are unbiased and reflective of how the model would perform on completely unseen data.
  – **Outer loop**: Used to evaluate the performance.
  – **Inner loop**: Used to perform Bayesian optimization for hyperparameter tuning.
  
  **If your goal is only to find the best hyperparameters**:
  
  – **Single CV can be enough**: Use a single cross-validation to perform Bayesian optimization. However, be aware that the performance estimates might be slightly optimistic because the same data is used for both tuning and evaluation.
  
  ### Practical Recommendation
  
  1. **For model evaluation and comparison**:
  – Use nested cross-validation to get a robust estimate of model performance. This is crucial if you are comparing different models or approaches.
  
  2. **For hyperparameter tuning**:
  – You can use a single cross-validation with Bayesian optimization if you are focused on finding the best hyperparameters efficiently.
  – After finding the best hyperparameters, you might want to perform a final evaluation using a separate validation set or another round of cross-validation to ensure the performance generalizes well.
  
  ### Example Workflow
  
  1. **Nested Cross-Validation**:
  – Outer CV: 5 folds
  – Inner CV: 3 folds with Bayesian optimization
  
  2. **Single Cross-Validation**:
  – CV: 5 or 10 folds with Bayesian optimization
  – After tuning, evaluate on a separate validation set.
  
  ### Conclusion
  
  While Bayesian optimization is powerful for hyperparameter tuning, using nested cross-validation is important when you need an unbiased estimate of your model’s performance. If the primary goal is to tune hyperparameters efficiently, a single cross-validation may suffice, but be cautious about the potential for optimistic performance estimates.
  
  Reply
  - Luis Talavera July 1, 2024 at 12:45 pm #
    
    Amazing!
    My problem was just that I am evaluating different models and a comment from my assessor was that how could I make sure that there was no over-fitting in my training process. Thank you very much for your answer!
    
    Reply
Puck June 24, 2024 at 5:05 pm #

Hi Jason,

First of all, thank you very much for writing the blog and sharing your knowledge. I have a question about the section ‘How Do You Configure the Final Model?’. I am working with a small dataset (around 500 samples) and perform the following steps:

1. Outer cross-validation over the entire dataset (10-fold)
2. With the training part of the outer cross-validation, an inner cross-validation is performed (5-fold)
3. This inner cross-validation is used for hyperparameter optimization and model selection
4. From the five folds of the inner CV, one model including settings is selected as the best
5. This model is evaluated on the test set of the outer cross-validation fold
6. This results in 10 different models and AUCs
7. The average of these is taken
8. Then the entire dataset is used to run the inner cross-validation again
9. The best model is selected
10. This model is evaluated one more time by splitting the entire dataset into test and train sets

My question is about the performance of the model. Is it correct that the estimated performance of the model is the AUC generated in step 7 (the average of the 10 AUCs)? And that the AUC generated in step 10 is merely a form of control to ensure it does not deviate too much? I ask this because the test set from step 10 is also used to configure the final model in step 8 and 9, and if I only use this AUC to estimate the performance, there is a high chance of overfitting.

Thank you very much in advance for your answer!

Reply
- James Carmichael June 25, 2024 at 8:18 am #
  
  Hi Puck…You’re welcome! Your question highlights an important aspect of model evaluation and selection, especially when dealing with a small dataset. Let’s break down your process and clarify the estimated performance of the model.
  
  ### Your Steps Summarized
  1. **Outer Cross-Validation (10-fold)**: Splits the entire dataset into 10 parts. Each part is used once as a test set while the remaining 9 parts are used for training.
  2. **Inner Cross-Validation (5-fold) within the training set of the outer CV**: This is used for hyperparameter tuning and model selection.
  3. **Model Selection and Evaluation**: The best model from the inner CV is evaluated on the test set of the outer CV fold.
  4. **Average AUC**: Calculate the average AUC from the 10 outer CV folds.
  5. **Final Model Training**: After determining the best hyperparameters, the entire dataset is used again to perform another round of inner cross-validation.
  6. **Final Model Evaluation**: The model is evaluated by splitting the entire dataset into train and test sets once more.
  
  ### Estimated Performance
  The key points to address are the estimated performance of the model and the use of the final dataset split:
  
  1. **Estimated Performance (Step 7)**:
  – The estimated performance of your model is indeed the average AUC generated from the outer cross-validation (step 7). This is because each fold in the outer CV provides an unbiased evaluation of the model’s performance on unseen data.
  – Averaging the AUCs from the outer CV gives you a robust estimate of how the model is expected to perform on new, unseen data.
  
  2. **Final Model Evaluation (Steps 8-10)**:
  – The final step where you split the entire dataset again and perform another evaluation (step 10) should not be used to estimate the model’s performance. This is because the entire dataset has already been used for hyperparameter tuning and model selection, leading to potential overfitting.
  – The purpose of this final step is more about using the best model parameters to train on the entire dataset, which can be useful for practical deployment but not for estimating performance.
  
  ### Recommendations
  1. **Performance Estimation**:
  – Use the average AUC from the outer cross-validation (step 7) as the performance estimate of your model. This is the most reliable estimate since it reflects how your model performs on unseen data through multiple folds.
  
  2. **Final Model Training**:
  – After determining the best hyperparameters from the nested CV, you can train a final model on the entire dataset using these hyperparameters. However, do not use this model’s evaluation on a split of the entire dataset to report performance.
  
  3. **Avoid Overfitting**:
  – Be cautious about using the entire dataset for both hyperparameter tuning and final model evaluation, as this can lead to overfitting and an overly optimistic performance estimate.
  
  ### Summary
  – The correct estimated performance of your model is the average AUC from the outer cross-validation (step 7).
  – The final evaluation on the entire dataset (step 10) should not be used to estimate performance but can be used to train a final model for deployment.
  – Ensuring the separation of model evaluation and hyperparameter tuning processes is crucial to avoid overfitting and obtain an unbiased performance estimate.
  
  By adhering to these principles, you’ll ensure that your model evaluation is robust and reliable, even with a small dataset.
  
  Reply
PJ June 4, 2025 at 10:29 am #

Hi, great article!!! Congratulations!!

When comparing algorithms and using Nested Cross-Validation (NCV), I don’t need to split the data into training and test sets (e.g., 80/20), right? I can use NCV in the entirety data as it already provides an unbiased estimate of generalization through the outer folds, right? So I can simply report the mean ± standard deviation of the evaluation metric across these outer folds as the final generalization estimate and then compare the algorithms. Is that correct?

Reply
- James Carmichael June 5, 2025 at 12:15 am #
  
  Yes, you’re absolutely correct—your understanding of Nested Cross-Validation (NCV) is spot on. Here’s a straightforward explanation to confirm and clarify the key points:
  
  —
  
  You don’t need a separate 80/20 train-test split.
  Why? NCV already splits the data into outer and inner folds:
  
  * Outer folds are used to estimate generalization performance, similar to a test set.
  * Inner folds are used for model selection and hyperparameter tuning.
  
  —
  
  You can use the entire dataset in NCV.
  NCV ensures that all the data is utilized across different folds, just not all at once. This provides a robust performance estimate while preventing data leakage.
  
  —
  
  The unbiased performance estimate comes from the outer folds.
  The outer test folds are completely unseen during model training and tuning, so the results you get from them are a reliable measure of how your model would perform on truly unseen data.
  
  —
  
  Reporting the mean and standard deviation of the outer scores is appropriate.
  Run your chosen metric—like accuracy, RMSE, or AUC—on each outer fold. Then report the average score along with the standard deviation. This is a standard and accepted approach when comparing algorithms.
  
  —
  
  Algorithm comparisons should be based on these outer fold scores.
  If you want to go further, you can apply statistical tests like the paired t-test or the Wilcoxon signed-rank test on these scores to see if the differences between algorithms are statistically significant.
  
  —
  
  Summary — you’re right:
  
  * No need for a separate train/test split
  * You can and should use all the data in NCV
  * Outer folds give an unbiased estimate of generalization
  * Report mean ± standard deviation across outer folds
  * These results are suitable for comparing algorithms fairly
  
  Reply
  - PJ June 5, 2025 at 3:18 am #
    
    Thank you! So, I’m going the right way in my research. I’m comparing many algorithms and approaches, and I think the best way to do this is through the NCV method.
    
    What would be the best paper or book about NCV for citation? Would you recommend any for further reading? I need to support my decisions about using NCV for algorithms comparison based on the outer folds scores.
    
    Thank you again!
    
    Reply
EMPJ July 16, 2025 at 10:55 am #

Hi, I’m trying to understand the best way to apply early stopping in the context of nested cross-validation (NCV). Is it possible or advisable to use early stopping in both inner and outer loops?

In the inner loop, I think it makes a lot of sense to apply early stopping during training, since I can monitor the performance on the validation fold (e.g., F1-score), and stop if it doesn’t improve for x epochs.

However, in the outer loop, I don’t have a separate validation set — just training and test folds. So how would early stopping work in the outer loop?

Would it make more sense to fix the number of epochs in training for outer (not using early stopping) and then simply select the model checkpoint with the lowest training loss?

Alternatively, I’ve considered using the average number of epochs (based on the best F1 from the inner folds) to train in the outer loop. What do you think about this approach?

Reply
- James Carmichael July 17, 2025 at 7:42 am #
  
  Hi EMPJ…Your suggestions are definitely reasonable. I would encourage you to try them an share your results or issues you encounter.
  
  Reply

Navigation

Nested Cross-Validation for Machine Learning with Python

Tutorial Overview

Combined Hyperparameter Tuning and Model Selection

What Is Nested Cross-Validation

What Is the Cost of Nested Cross-Validation?

How Do You Set k?

How Do You Configure the Final Model?

What Configuration Was Chosen by Inner Loop?

Nested Cross-Validation With Scikit-Learn

Further Reading

Tutorials

Papers

APIs

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

170 Responses to Nested Cross-Validation for Machine Learning with Python

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Combined Hyperparameter Tuning and Model Selection

What Is Nested Cross-Validation

What Is the Cost of Nested Cross-Validation?

How Do You Set k?

How Do You Configure the Final Model?

What Configuration Was Chosen by Inner Loop?

Nested Cross-Validation With Scikit-Learn

Further Reading

Tutorials

Papers

APIs

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

170 Responses to Nested Cross-Validation for Machine Learning with Python

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects