Last Updated on January 7, 2021
The k-fold cross-validation procedure is used to estimate the performance of machine learning models when making predictions on data not used during training.
This procedure can be used both when optimizing the hyperparameters of a model on a dataset, and when comparing and selecting a model for the dataset. When the same cross-validation procedure and dataset are used to both tune and select a model, it is likely to lead to an optimistically biased evaluation of the model performance.
One approach to overcoming this bias is to nest the hyperparameter optimization procedure under the model selection procedure. This is called double cross-validation or nested cross-validation and is the preferred way to evaluate and compare tuned machine learning models.
In this tutorial, you will discover nested cross-validation for evaluating tuned machine learning models.
After completing this tutorial, you will know:
- Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
- Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
- How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.
Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Jan/2021: Added section on pipeline thinking and a link to a related tutorial.

Nested Cross-Validation for Machine Learning with Python
Photo by Andrew Bone, some rights reserved.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Combined Hyperparameter Tuning and Model Selection
- What Is Nested Cross-Validation
- What Is the Cost of Nested Cross-Validation?
- How Do You Set k?
- How Do You Configure the Final Model?
- What Configuration Was Chosen by Inner Loop?
- Nested Cross-Validation With Scikit-Learn
Combined Hyperparameter Tuning and Model Selection
It is common to evaluate machine learning models on a dataset using k-fold cross-validation.
The k-fold cross-validation procedure divides a limited dataset into k non-overlapping folds. Each of the k folds is given an opportunity to be used as a held back test set whilst all other folds collectively are used as a training dataset. A total of k models are fit and evaluated on the k holdout test sets and the mean performance is reported.
For more on the k-fold cross-validation procedure, see the tutorial:
The procedure provides an estimate of the model performance on the dataset when making a prediction on data not used during training. It is less biased than some other techniques, such as a single train-test split for small- to modestly-sized dataset. Common values for k are k=3, k=5, and k=10.
Each machine learning algorithm includes one or more hyperparameters that allow the algorithm behavior to be tailored to a specific dataset. The trouble is, there is rarely if ever good heuristics on how to configure the model hyperparameters for a dataset. Instead, an optimization procedure is used to discover a set of hyperparameters that perform well or best on the dataset. Common examples of optimization algorithms include grid search and random search, and each distinct set of model hyperparameters are typically evaluated using k-fold cross-validation.
This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.
The k-fold cross-validation procedure is an effective approach for estimating the performance of a model. Nevertheless, a limitation of the procedure is that if it is used multiple times with the same algorithm, it can lead to overfitting.
Each time a model with different model hyperparameters is evaluated on a dataset, it provides information about the dataset. Specifically, an often noisy model performance score. This knowledge about the model on the dataset can be exploited in the model configuration procedure to find the best performing configuration for the dataset. The k-fold cross-validation procedure attempts to reduce this effect, yet it cannot be removed completely, and some form of hill-climbing or overfitting of the model hyperparameters to the dataset will be performed. This is the normal case for hyperparameter optimization.
The problem is that if this score alone is used to then select a model, or the same dataset is used to evaluate the tuned models, then the selection process will be biased by this inadvertent overfitting. The result is an overly optimistic estimate of model performance that does not generalize to new data.
A procedure is required that allows both the models to select well-performing hyperparameters for the dataset and select among a collection of well-configured models on a dataset.
One approach to this problem is called nested cross-validation.
What Is Nested Cross-Validation
Nested cross-validation is an approach to model hyperparameter optimization and model selection that attempts to overcome the problem of overfitting the training dataset.
In order to overcome the bias in performance evaluation, model selection should be viewed as an integral part of the model fitting procedure, and should be conducted independently in each trial in order to prevent selection bias and because it reflects best practice in operational use.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
The procedure involves treating model hyperparameter optimization as part of the model itself and evaluating it within the broader k-fold cross-validation procedure for evaluating models for comparison and selection.
As such, the k-fold cross-validation procedure for model hyperparameter optimization is nested inside the k-fold cross-validation procedure for model selection. The use of two cross-validation loops also leads the procedure to be called “double cross-validation.”
Typically, the k-fold cross-validation procedure involves fitting a model on all folds but one and evaluating the fit model on the holdout fold. Let’s refer to the aggregate of folds used to train the model as the “train dataset” and the held-out fold as the “test dataset.”
Each training dataset is then provided to a hyperparameter optimized procedure, such as grid search or random search, that finds an optimal set of hyperparameters for the model. The evaluation of each set of hyperparameters is performed using k-fold cross-validation that splits up the provided train dataset into k folds, not the original dataset.
This is termed the “internal” protocol as the model selection process is performed independently within each fold of the resampling procedure.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
Under this procedure, hyperparameter search does not have an opportunity to overfit the dataset as it is only exposed to a subset of the dataset provided by the outer cross-validation procedure. This reduces, if not eliminates, the risk of the search procedure overfitting the original dataset and should provide a less biased estimate of a tuned model’s performance on the dataset.
In this way, the performance estimate includes a component properly accounting for the error introduced by overfitting the model selection criterion.
— On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
What Is the Cost of Nested Cross-Validation?
A downside of nested cross-validation is the dramatic increase in the number of model evaluations performed.
If n * k models are fit and evaluated as part of a traditional cross-validation hyperparameter search for a given model, then this is increased to k * n * k as the procedure is then performed k more times for each fold in the outer loop of nested cross-validation.
To make this concrete, you might use k=5 for the hyperparameter search and test 100 combinations of model hyperparameters. A traditional hyperparameter search would, therefore, fit and evaluate 5 * 100 or 500 models. Nested cross-validation with k=10 folds in the outer loop would fit and evaluate 5,000 models. A 10x increase in this case.
How Do You Set k?
The k value for the inner loop and the outer loop should be set as you would set the k-value for a single k-fold cross-validation procedure.
You must choose a k-value for your dataset that balances the computational cost of the evaluation procedure (not too many model evaluations) and unbiased estimate of model performance.
It is common to use k=10 for the outer loop and a smaller value of k for the inner loop, such as k=3 or k=5.
For more general help on setting k, see this tutorial:
How Do You Configure the Final Model?
The final model is configured and fit using the procedure applied in one pass of the outer loop, e.g. the outer loop applied to the entire dataset.
As follows:
- An algorithm is selected based on its performance on the outer loop of nested cross-validation.
- Then the inner-procedure is applied to the entire dataset.
- The hyperparameters found during this final search are then used to configure a final model.
- The final model is fit on the entire dataset.
This model can then be used to make predictions on new data. We know how well it will perform on average based on the score provided during the final model tuning procedure.
What Configuration Was Chosen by Inner Loop?
It doesn’t matter, that’s the whole idea.
An automatic configuration procedure was used instead of a specific configuration. There is a single final model, but the best configuration for that final model is found via the chosen search procedure, on the final run.
You let go of the need to dive into the specific model configuration chosen, just like at the next level down you let go of the specific model coefficients found each cross-validation fold.
This requires a shift in thinking and can be challenging, e.g. a shift from “I configured my model like this…” to “I used an automatic model configuration procedure with these constraints…“.
This tutorial has more on the topic of “pipeline thinking” and may help:
Now that we are familiar with nested-cross validation, let’s review how we can implement it in practice.
Nested Cross-Validation With Scikit-Learn
The k-fold cross-validation procedure is available in the scikit-learn Python machine learning library via the KFold class.
The class is configured with the number of folds (splits), then the split() function is called, passing in the dataset. The results of the split() function are enumerated to give the row indexes for the train and test sets for each fold.
For example:
1 2 3 4 5 6 7 8 9 10 |
... # configure the cross-validation procedure cv = KFold(n_splits=10, random_state=1) # perform cross-validation procedure for train_ix, test_ix in cv_outer.split(X): # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix] # fit and evaluate a model ... |
This class can be used to perform the outer-loop of the nested-cross validation procedure.
The scikit-learn library provides cross-validation random search and grid search hyperparameter optimization via the RandomizedSearchCV and GridSearchCV classes respectively. The procedure is configured by creating the class and specifying the model, dataset, hyperparameters to search, and cross-validation procedure.
For example:
1 2 3 4 5 6 7 8 9 10 |
... # configure the cross-validation procedure cv = KFold(n_splits=3, shuffle=True, random_state=1) # define search space space = dict() ... # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv) # execute search result = search.fit(X, y) |
These classes can be used for the inner loop of nested cross-validation where the train dataset defined by the outer loop is used as the dataset for the inner loop.
We can tie these elements together and implement the nested cross-validation procedure.
Importantly, we can configure the hyperparameter search to refit a final model with the entire training dataset using the best hyperparameters found during the search. This can be achieved by setting the “refit” argument to True, then retrieving the model via the “best_estimator_” attribute on the search result.
1 2 3 4 5 6 7 |
... # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=-1, cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_ |
This model can then be used to make predictions on the holdout data from the outer loop and estimate the performance of the model.
1 2 3 |
... # evaluate model on the hold out dataset yhat = best_model.predict(X_test) |
Tying all of this together, we can demonstrate nested cross-validation for the RandomForestClassifier on a synthetic classification dataset.
We will keep things simple and tune just two hyperparameters with three values each, e.g. (3 * 3) 9 combinations. We will use 10 folds in the outer cross-validation and three folds for the inner cross-validation, resulting in (10 * 9 * 3) or 270 model evaluations.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# manual nested cross-validation for random forest on a classification dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # create dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10) # configure the cross-validation procedure cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) # enumerate splits outer_results = list() for train_ix, test_ix in cv_outer.split(X): # split data X_train, X_test = X[train_ix, :], X[test_ix, :] y_train, y_test = y[train_ix], y[test_ix] # configure the cross-validation procedure cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) # define the model model = RandomForestClassifier(random_state=1) # define search space space = dict() space['n_estimators'] = [10, 100, 500] space['max_features'] = [2, 4, 6] # define search search = GridSearchCV(model, space, scoring='accuracy', cv=cv_inner, refit=True) # execute search result = search.fit(X_train, y_train) # get the best performing model fit on the whole training set best_model = result.best_estimator_ # evaluate model on the hold out dataset yhat = best_model.predict(X_test) # evaluate the model acc = accuracy_score(y_test, yhat) # store the result outer_results.append(acc) # report progress print('>acc=%.3f, est=%.3f, cfg=%s' % (acc, result.best_score_, result.best_params_)) # summarize the estimated performance of the model print('Accuracy: %.3f (%.3f)' % (mean(outer_results), std(outer_results))) |
Running the example evaluates random forest using nested-cross validation on a synthetic classification dataset.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
You can use the example as a starting point and adapt it to evaluate different algorithm hyperparameters, different algorithms, or a different dataset.
Each iteration of the outer cross-validation procedure reports the estimated performance of the best performing model (using 3-fold cross-validation) and the hyperparameters found to perform the best, as well as the accuracy on the holdout dataset.
This is insightful as we can see that the actual and estimated accuracies are different, but in this case, similar. We can also see that different hyperparameters are found on each iteration, showing that good hyperparameters on this dataset are dependent on the specifics of the dataset.
A final mean classification accuracy is then reported.
1 2 3 4 5 6 7 8 9 10 11 |
>acc=0.900, est=0.932, cfg={'max_features': 4, 'n_estimators': 100} >acc=0.940, est=0.924, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.930, est=0.929, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.930, est=0.927, cfg={'max_features': 6, 'n_estimators': 100} >acc=0.920, est=0.927, cfg={'max_features': 4, 'n_estimators': 100} >acc=0.950, est=0.927, cfg={'max_features': 4, 'n_estimators': 500} >acc=0.910, est=0.918, cfg={'max_features': 2, 'n_estimators': 100} >acc=0.930, est=0.924, cfg={'max_features': 6, 'n_estimators': 500} >acc=0.960, est=0.926, cfg={'max_features': 2, 'n_estimators': 500} >acc=0.900, est=0.937, cfg={'max_features': 4, 'n_estimators': 500} Accuracy: 0.927 (0.019) |
A simpler way that we can perform the same procedure is by using the cross_val_score() function that will execute the outer cross-validation procedure. This can be performed on the configured GridSearchCV directly that will automatically use the refit best performing model on the test set from the outer loop.
This greatly reduces the amount of code required to perform the nested cross-validation.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# automatic nested cross-validation for random forest on a classification dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier # create dataset X, y = make_classification(n_samples=1000, n_features=20, random_state=1, n_informative=10, n_redundant=10) # configure the cross-validation procedure cv_inner = KFold(n_splits=3, shuffle=True, random_state=1) # define the model model = RandomForestClassifier(random_state=1) # define search space space = dict() space['n_estimators'] = [10, 100, 500] space['max_features'] = [2, 4, 6] # define search search = GridSearchCV(model, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True) # configure the cross-validation procedure cv_outer = KFold(n_splits=10, shuffle=True, random_state=1) # execute the nested cross-validation scores = cross_val_score(search, X, y, scoring='accuracy', cv=cv_outer, n_jobs=-1) # report performance print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores))) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the examples performs the nested cross-validation on the random forest algorithm, achieving a mean accuracy that matches our manual procedure.
1 |
Accuracy: 0.927 (0.019) |
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Tutorials
- A Gentle Introduction to k-fold Cross-Validation
- How to Configure k-Fold Cross-Validation
- A Gentle Introduction to Machine Learning Modeling Pipelines
Papers
- Cross-validatory choice and assessment of statistical predictions, 1974.
- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, 2010.
- Cross-validation pitfalls when selecting and assessing regression and classification models, 2014.
- Nested cross-validation when selecting classifiers is overzealous for most practical applications, 2018.
APIs
- Cross-validation: evaluating estimator performance, scikit-learn.
- Nested versus non-nested cross-validation, scikit-learn example.
- sklearn.model_selection.KFold API.
- sklearn.model_selection.GridSearchCV API.
- sklearn.ensemble.RandomForestClassifier API.
- sklearn.model_selection.cross_val_score API.
Summary
In this tutorial, you discovered nested cross-validation for evaluating tuned machine learning models.
Specifically, you learned:
- Hyperparameter optimization can overfit a dataset and provide an optimistic evaluation of a model that should not be used for model selection.
- Nested cross-validation provides a way to reduce the bias in combined hyperparameter tuning and model selection.
- How to implement nested cross-validation for evaluating tuned machine learning algorithms in scikit-learn.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Good afternoon, thanks for your article.
Could you please tell a newbie. The picture “A final mean classification accuracy is then reported” shows the “best parameters” for each outer fold.
1) what parameters from this list should be used in the final model?
2) how can the resulting trained model be called in your example?
thank you in advance.
Think of it this way: We are searching for a modeling process, not a specific model and configuration.
The modeling process that achieves the best result is the one we will use.
So we could compare the best grid search model from multiple different algorithms and use the process that creates the best model as the process to create the final model.
Does that make sense?
double cross-validation no doubt is a better approach, as it looks on surface, we will see how we it works with our own problem
Thanks.
Hi Jason I follow yours posts and I learn a lot! Great stuff!
I do not know if I understood well the procedure but I think it is an overfitting again when it is selected a model with a better mean metric in the holdout groups.
I view the nested cross validation as a technique that gives k best models and it is not possible to assess the better. To do this I’ll need another holdout data but there is none anymore.
The question is what I’m gonna do with k best models? I do not know I can ensemble then. Or I can discard some using some theoretical analysis for example.
Cheers
Thanks!
No, it selects a modeling pipeline that results in the best model. E.g. the grid search will choose the best model for you to start using directly and make predictions.
Hi Jason!! Now I get it! You dont end up with k models. What you get from cross_val_score is an UNBIASED estimate for the generalization error. Now let´s say you´re tasked with making a new prediction using a fresh set of predictors X.
What you should do is you should run GridSearchCV on your entire dataset. Then use best_model to make your prediction.
Great post!!!
Exactly!
Hi:
I work with very large data set( more than billion rows). I would like to see a book where these models and methods are described using pyspark in spark setting. Now a days large data set is more common, so a book like that will be very useful for data scientists.
Thanks for the suggestion!
Dear Dr Jason,
This is about the lines of code in the nested model, particularly
After some experimentation, I discovered that you could say the same thing as:
The reason is that search contains the best model for the given parameters such that we can fit X_train, y_train for the best model and predict using X_test with the best model.
Definition: best model = search, the best model for the GridSearch’s parameters.
Thank you,
Anthony of Sydney
Yes, that is how you would use the technique as a final model without the outer loop for model evaluation.
Dear Dr Jason,
Thank you for the reply. The question I asked related to the nested example achieving the same score with less code.
My question is on the last mode.
If the last model has less code and no loops, but achieve the same results, why not implement the last example that achieves the same outcome of removing the noise from the scores?
Thank you again,
Anthony of Sydney
It is doing something different.
The example in the tutorial is evaluating the modeling pipeline with k-fold cross-validation (outer loop).
Your example fits the pipeline once and uses it to make a prediction.
Dear Dr Jason,
Thank you.
In addition to the above question.
In the nested example, you get X_train and y_train from the outer splits and were able to get yhat
In the above code, in order to get X_train and y_train, there were 10 iterations of the loop based on kfolds=10. We had 10 different train_ix Thus we could estimate yhat based on 10 versions of X_train and y_train..
BUT in the last example there are no iterations and not the ability to estimate yhat. So we cannot predict yhat because we don’t have 10 iterations of train_ix as in the nested example?
Put it another way, how can we make a prediction in the last example?
Thank you,
Anthony of Sydney
How do you make a prediction for yhat in the second example
The last example does the same thing as the first example in fewer lines of code.
You can make a prediction using the code example you provided, e.g. fit the pipeline then call predict().
Dear Dr Jason,
Thank you.
But in the last example, how do I get the indices for train_ix when there is no iteration as in the first example?
I need the train_ix indices in order to get X_train and y_train. I cannot see how that could be obtained in the last example.
In the first example, you get the train_ix from the cv_outer.split(X).
In the last example there is no way to get the train_ix in order to get the X_train and the y_train because you need the indices for train_ix BUT you cannot get the train_ix indices in the second example.
So how do I get the train_ix in the last example.
Thank you and always appreciate your tutorials.
Anthony of Sydney
In the last example, the cross-validation of the modeling pipeline (outer) is controlled via the call to cross_val_score(). The cross-validation of each hyperparameter combination is controlled automatically via the GridSearchCV (inner).
It does the same thing as the first example in fewer lines of code.
Dear Dr Jason,
I tried to predict yhat from the last example.
In the last example, in order to get y_train, I had to do the following by appending this code to the end of the last example
This is certainly different from either the first or second example which produced
Accuracy: 0.927 (0.019)
That is mine produced
0;926 with std dev (0.0037)
It is strange that my std_dev was less using 3 sets of scores.
Thank you,
Anthony of Sydney
Dear Dr Jason,
I did the same thing, by appending the following code, this time using
This time I used the cv_outer which is 10 fold and attached to the end of your last example:
The result is exactly like your first and last code.
Summary of findings:
For the second example, in order to get the test and train sets each for X and y you have to have a loop in order to extract the indices for test and train. The indices for test and train are dependent on the number of folds.
Thank you,
Anthony of Sydney
Sorry, I don’t understand what you’re trying to achieve.
Why would you want to make a prediction within cross-validation?
Dear Dr Jason
Thank you for your reply.
This was the original question and maybe I did not make it clear.
(1) My original question was how to make predictions with the second example labelled above the heading “Further Reading”.
(2) You did make a prediction with the cross validation in the first example located at the section “Nested Cross-Validation With Scikit-Learn”
(3) To answer your question “…what I am trying to achieve…:” is to be able to predict yhat in the second example above the heading “Further Reading” . I was able to predict yhat and achieved the same results.
This required me adding a few extra lines to the second example.
I finally worked it out getting EXACTLY the SAME mean(scores) and std(scores) as your example.
I added the following code to the end of your second example LOCATED above the section “Further Reading”
Conclusion (1) , you can retrieve the means to produce X_train, y_train, X_test, y_test, scores, mean(scores) and std(scores) as your examples.
Conclusion (2), you can make predictions with code and get the same results.
It works.
Thank you again for your patience,
Anthony of Sydney
It was because you used the prediction
You provided code to make a prediction with the final example already and I pointed out as much, here it is again:
Dear Dr Jason,
Thank you.
It works, and I thank you for that.
Again, thank you for your patience which helps in understanding.
It is appreciated.
Anthony of Sydney
You’re very welcome Anthony, I’m here to help if I can.
Hi I show your result in this topic as below
>acc=0.900, est=0.932, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.940, est=0.924, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.929, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.930, est=0.927, cfg={‘max_features’: 6, ‘n_estimators’: 100}
>acc=0.920, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 100}
>acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}
>acc=0.910, est=0.918, cfg={‘max_features’: 2, ‘n_estimators’: 100}
Accuracy: 0.927 (0.019)
I read a lot of below comments about how to choose best parameter for building the final model, but I am not really clear.
Can I apply perfomance result straightforwardly by selecting parameter set from the best accuracy? in this case >acc=0.950, est=0.927, cfg={‘max_features’: 4, ‘n_estimators’: 500}
can I choose {‘max_features’: 4, ‘n_estimators’: 500} ?
if not , what is your recommendation.
You can.
But, the idea of nested CV, is you don’t choose the hyperparameters, the grid search does and it will use whatever is “best” according to your data and metric.
It takes config selection out of your hands and leaves you with a “process” rather than a model and a config. The outer CV is evaluating that process.
Thank you, great advice.
You’re welcome.
How to obtain the best hyperparameter combination for the simpler cross_val_score approach?
Does the above example not help?
What problem are you having exactly?
Using the longer approach as I obtained the best hyperparameters combination which I might choose as my final model how do I get it using the cross_val_score method ? In the cross_val_score approach after the GridsearchCV how the best parameters gets automatically selected for the outerCV process?
You don’t need to know the best hyperparameters as the pipeline will find the best model for you so you can start using it.
Nevertheless, you can keep a reference to the object if you like and then print the “best_params_” attribute.
I have to present a final model (as per client requirement). The cross_val_score approach is a process for tuning and observing how my tuned learner will perform on unseen data(a proxy for the generalization error). Please correct me if i’m wrong.
As you said the pipeline will find the best model (that i can understand) but how to extract the best model so that my client can use it on his own data for prediction?
As you have suggested to print the best_params_ attribute to solve my problem should I use nested-cv package from pypi?
You can use the best parametres to define the final model, then fit it on all available training data.
thanks Jason
You’re welcome.
Hello,
I understand that after nested CV we are left with a metric (say, accuracy) that evaluates the modelling pipeline, not any individual configuration trained in the inner loop.
What does this accuracy mean exactly as it relates to the final configuration produced by this pipeline? If the outer CV loop says the pipeline is X% accurate, is it also correct to claim that the final configuration produced by the pipeline is accurate X% of the time?
If not, how can we make a claim about the accuracy of the final configuration itself?
Good question.
It is the accuracy of the pipeline that included a grid search to find the best config within the pipeline. E.g. the best model found.
Thank you for the reply.
Just to clarify, if a client expects a quote of accuracy from a developer, i.e. they want to know how often the algorithm will make a correct prediction, is it correct to provide the mean accuracy calculated from the outer CV loop?
For instance, if the outer CV loop outputs 90% mean accuracy, then is it correct to tell a client that, on average, 90% of the predictions from the final configuration (algorithm + learned parameters + hyperparameters) are true?
Yes, the mean and standard deviation – e.g. the distribution of accuracy scores.
Even better would be to give the bootstrap confidence interval for the final model:
https://machinelearningmastery.com/confidence-intervals-for-machine-learning/
Hi Jason, thanks a lot for this great tutorial! But I’d like to ask whether there is, in principle, a need for cross validation for such models as random forest that use bagging. E.g., from James et al.: An Introduction to Statistical Learning with Applications in R (p. 317-318):
“It turns out that there is a very straightforward way to estimate the test error of a bagged model, without the need to perform cross-validation or the validation set approach.
…
The resulting OOB (out-of-bag) error is a valid estimate of the test error for the bagged model, since the response for each observation is predicted using only the trees that were not fit using that observation.”
What do you think, can random forest be trained and hyper parameters tuned without CV?
Yes, you can use OOB for the model if it is standalone.
For comparing models, a systematic and consistent test harness is required.
How do you select the best model in the outer_cv? I do not see the point of averaging the accuracy of models with different hyperparameters. Also, the best_estimator_ only gives the best model for the specific inner fold since basically we are instantiating a new search for each inner fold/
You don’t, the models are discarded. Once you choose a “final” model/pipeline you can fit it on your available data and start making predictions:
https://machinelearningmastery.com/train-final-machine-learning-model/
Hi Jason,
Great Article !!
Are we running the inner loop on the entire dataset splitted into train|test here ?
Shouldn’t we hide 10% of the data from the inner loop and use the entire dataset in the outer loop only ?
The outer loop operates on all data, inner loop operates on the training set of one pass from the outer loop.
Why hold back data?
Dear Dr. Brownlee,
Thank you for this very detailed and interesting article.
I have just one question. When you describe the procedure to obtain the final model, you mention: “We know how well it will perform on average based on the score provided during the final model tuning procedure”.
However, I guess we cannot use this score, since it would be optimistically biased (hyperparameters were tuned). In order to estimate the final model performance, it seems more appropriate to use the unbiased score previously obtained with the nested cross-validation.
Does that make sense?
Thank you again and keep up the great work. Best regards.
We can, as the estimate was based on the outer loop.
This is the whole point, that the model tuning is now considered part of the model itself – e.g. the modeling pipeline.
Hi Jason,
I am trying to apply your steps on my data, I do have a pipeline containing tf-idf, LDA and LogisticRegression model, my input data is a list of documents X = [‘text,’text’…….], y =[0,1….], I am trying to optimize different parameters of the pipeline components using nested cv.
how can I use your code here to split the data, assuming that the data needs to be a numpy array.
for train_ix, test_ix in cv_outer.split(X):
# split data
X_train, X_test = X[train_ix, :], X[test_ix, :]
y_train, y_test = y[train_ix], y[test_ix]
Thank you
Yes, numpy arrays.
Perhaps try using the all in one solution at the end of the tutorial instead?
Hi Jason ,
Can you please elaborate on how exactly cross_val_score works ? it is hard to understand how the Model is going to be fitted ? is there any reference please ? I can’t find an explanation in sklearn documentation.
Million Thanks
It evaluates the model using cross validation and reports the results.
You can learn more here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
If cross-validation is new for you, start here:
https://machinelearningmastery.com/k-fold-cross-validation/
Hello Jason,
If I have to compare which algorithm (RandomForest,SomeOthermodel-1,SomeOthermodel-2) is the best, do I just repeat the nested cv with each of the above algorithms and take a decision based on the various nested cv accuracies and standard deviations?
Thank you,
Karthik
Yes, compare the mean performance from each technique.
Thank you!
You’re welcome.
Hi Jason Brownlee,
Thanks for the great explanation. But I have 1 doubt: Shouldn’t we use 1 more for loop in inner iterations as we did for outer loop?
We do in the final example, it is just handled for us by the gridsearch.
Dear Jason,
Thank you for the post. I have two questions.
1) First one is about pre-processing techniques (scaling, imputation, etc.) and nested cross validation. Can nested cross validation be used to decide on a pre-processing technique, say to make a decision about the imputation techniques mean vs. median vs regression?
2) Is it possible to generalize nested cross validation to other resampling techniques? For example, “nested repeated train-test splits” where both outer and inner validations use repeated random train-test splits. If so, do you know any academic references to the topic? If I understand them correctly, the most commonly used nested resampling technique in scientific publications is single train-test split for the outer validation and a regular k-fold cross-validation for the inner validation.
Best,
Emin
I don’t think nested cv helps with choose data preparation, perhaps this will help:
https://machinelearningmastery.com/grid-search-data-preparation-techniques/
Yes, I don’t see why the approach would not generalize.
I doubt it is discussed in research papers – it seems too elementary. You can search for references on scholar.google.com
Thank you
You’re welcome.
Hi, I am new to machine learning. I was reading this post and I must say it is very well written and clearly explained. I just have one question. I want to deploy my ML model. For this, I need to save the best model. In nested cross-validation, which model should we deploy? Because in each fold, we are getting a new model with a new set of hyper parameters.
Thanks!
One solution would be to use the grid search cv model and drop the outer loop.
Another solution would be to inspect the result from the grid search cv model and use the specific configuration chosen.
Hi Dr. Brownlee,
Thank you for your quick reply! I am not sure I fully understood your answer. However, I was reading some of the previous comments and I figured the following points.
1) Nested Cross-validation does not provide a final model. It is just a way to check (or evaluate) the performance of an ML algorithm on different independent training and testing datasets. The averaged performance metrics that come out from cross-validation will tell the overall behavior of the model, which we informed your client.
2) After nested cross-validation, if we want to give the best model to someone (say client), then we need to (a) use the whole dataset, (b) run hyperparameter optimization, (c) get the best parameter, and (d) retrain the fresh ML algorithm with the best parameters to save the ML model as the “Final model.”
Is this correct?
Yes, that approach is valid and what I would do.
Dear Dr. Brownlee,
Thank you very much for validating my thought process and clarifying the doubts. I would be troubling you once agin after reading all your articles/blogs. I hope you will clarify my future doubts as well.
Thank you once again,
Ankush
You’re welcome!
Don’t you run the risk here again of having the hyperparam search overfit again? Seems the nested CV shows that it’s possible to have a good model, but how do you guarantee (or at least maximize the chance) of actually getting a good single model at the end that you can use? Thanks!
Hi, thank you very much for this article – it has been very useful. How though, please, can I expand on this by including recursive feature elimination so that not only will the process find the best hyperparameters but also the most optimal features (i.e. the best hyperparameter and feature combination)?
A RFECV will automatically configure itself. You can use it in a pipeline prior to your model.
Or use a pipeline with a grid search on the RFE hyperparameter.
This tutorial will show you how to use RFE in a pipeline:
https://machinelearningmastery.com/rfe-feature-selection-in-python/
Thank you for your reply. Using the explicit outer-inner loop above, I think I achieve this using the below.
I’m happy to hear you are making progress.
Sorry, I don’t have the capacity to review your code.