How to Use Out-of-Fold Predictions in Machine Learning

Machine learning algorithms are typically evaluated using resampling techniques such as k-fold cross-validation.

During the k-fold cross-validation process, predictions are made on test sets comprised of data not used to train the model. These predictions are referred to as out-of-fold predictions, a type of out-of-sample predictions.

Out-of-fold predictions play an important role in machine learning in both estimating the performance of a model when making predictions on new data in the future, so-called the generalization performance of the model, and in the development of ensemble models.

In this tutorial, you will discover a gentle introduction to out-of-fold predictions in machine learning.

After completing this tutorial, you will know:

  • Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
  • Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
  • Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Kick-start your project with my new book Ensemble Learning Algorithms With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Jan/2020: Updated for changes in scikit-learn v0.22 API.
How to Use Out-of-Fold Predictions in Machine Learning

How to Use Out-of-Fold Predictions in Machine Learning
Photos by Gael Varoquaux, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. What Are Out-of-Fold Predictions?
  2. Out-of-Fold Predictions for Evaluation
  3. Out-of-Fold Predictions for Ensembles

What Are Out-of-Fold Predictions?

It is common to evaluate the performance of a machine learning algorithm on a dataset using a resampling technique such as k-fold cross-validation.

The k-fold cross-validation procedure involves splitting a training dataset into k groups, then using each of the k groups of examples on a test set while the remaining examples are used as a training set.

This means that k different models are trained and evaluated. The performance of the model is estimated using the predictions by the models made across all k-folds.

This procedure can be summarized as follows:

  • 1. Shuffle the dataset randomly.
  • 2. Split the dataset into k groups.
  • 3. For each unique group:
    • a. Take the group as a holdout or test data set.
    • b. Take the remaining groups as a training data set.
    • c. Fit a model on the training set and evaluate it on the test set.
    • d. Retain the evaluation score and discard the model.
  • 4. Summarize the skill of the model using the sample of model evaluation scores.

Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the holdout set 1 time and used to train the model k-1 times.

For more on the topic of k-fold cross-validation, see the tutorial:

An out-of-fold prediction is a prediction by the model during the k-fold cross-validation procedure.

That is, out-of-fold predictions are those predictions made on the holdout datasets during the resampling procedure. If performed correctly, there will be one prediction for each example in the training dataset.

Sometimes, out-of-fold is summarized with the acronym OOF.

  • Out-of-Fold Predictions: Predictions made by models during the k-fold cross-validation procedure on the holdout examples.

The notion of out-of-fold predictions is directly related to the idea of out-of-sample predictions, as the predictions in both cases are made on examples that were not used during the training of the model and can be used to estimate the performance of the model when used to make predictions on new data.

As such, out-of-fold predictions are a type of out-of-sample prediction, although described in the context of a model evaluated using k-fold cross-validation.

  • Out-of-Sample Predictions: Predictions made by a model on data not used during the training of the model.

Out-of-sample predictions may also be referred to as holdout predictions.

There are two main uses for out-of-fold predictions; they are:

  • Estimate the performance of the model on unseen data.
  • Fit an ensemble model.

Let’s take a closer look at these two cases.

Want to Get Started With Ensemble Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Out-of-Fold Predictions for Evaluation

The most common use for out-of-fold predictions is to estimate the performance of the model.

That is, predictions on data that were not used to train the model can be made and evaluated using a scoring metric such as error or accuracy. This metric provides an estimate of the performance of the model when used to make predictions on new data, such as when the model will be used in practice to make predictions.

Generally, predictions made on data not used to train a model provide insight into how the model will generalize to new situations. As such, scores that evaluate these predictions are referred to as the generalized performance of a machine learning model.

There are two main approaches that these predictions can use to estimate the performance of the model.

The first is to score the model on the predictions made during each fold, then calculate the average of those scores. For example, if we are evaluating a classification model, then classification accuracy can be calculated on each group of out-of-fold predictions, then the mean accuracy can be reported.

  • Approach 1: Estimate performance as the mean score estimated on each group of out-of-fold predictions.

The second approach is to consider that each example appears just once in each test set. That is, each example in the training dataset has a single prediction made during the k-fold cross-validation process. As such, we can collect all predictions and compare them to their expected outcome and calculate a score directly across the entire training dataset.

  • Approach 2: Estimate performance using the aggregate of all out-of-fold predictions.

Both are reasonable approaches and the scores that result from each procedure should be approximately equivalent.

Calculating the mean from each group of out-of-sample predictions may be the most common approach, as the variance of the estimate can also be calculated as the standard deviation or standard error.

The k resampled estimates of performance are summarized (usually with the mean and standard error) …

— Page 70, Applied Predictive Modeling, 2013.

We can demonstrate the difference between these two approaches to evaluating models using out-of-fold predictions with a small worked example.

We will use the make_blobs() scikit-learn function to create a test binary classification problem with 1,000 examples, two classes, and 100 input features.

The example below prepares a data sample and summarizes the shape of the input and output elements of the dataset.

Running the example prints the shape of the input data showing 1,000 rows of data with 100 columns or input features and the corresponding classification labels.

Next, we can use k-fold cross-validation to evaluate a KNeighborsClassifier model.

We will use k=10 for the KFold object, the sensible default, fit a model on each training dataset, and evaluate it on each holdout fold.

Accuracy scores will be stored in a list across each model evaluation and will report the mean and standard deviation of these scores.

The complete example is listed below.

Running the example reports the model classification accuracy on the holdout fold for each iteration.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

At the end of the run, the mean and standard deviation of the accuracy scores are reported.

We can contrast this with the alternate approach that evaluates all predictions as a single group.

Instead of evaluating the model on each holdout fold, predictions are made and stored in a list. Then, at the end of the run, the predictions are compared to the expected values for each holdout test set and a single accuracy score is reported.

The complete example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example collects all of the expected and predicted values for each holdout dataset and reports a single accuracy score at the end of the run.

Again, both approaches are comparable and it may be a matter of taste as to the method you use on your own predictive modeling problem.

Out-of-Fold Predictions for Ensembles

Another common use for out-of-fold predictions is to use them in the development of an ensemble model.

An ensemble is a machine learning model that combines the predictions from two or more models prepared on the same training dataset.

This is a very common procedure to use when working on a machine learning competition.

The out-of-fold predictions in aggregate provide information about how the model performs on each example in the training dataset when not used to train the model. This information can be used to train a model to correct or improve upon those predictions.

First, the k-fold cross-validation procedure is performed on each base model of interest, and all of the out-of-fold predictions are collected. Importantly, the same split of the training data into k-folds is performed for each model. Now we have one aggregated group of out-of-sample predictions for each model, e.g. predictions for each example in the training dataset.

  • Base-Models: Models evaluated using k-fold cross-validation on the training dataset and all out-of-fold predictions are retained.

Next, a second higher-order model, called a meta-model, is trained on the predictions made by the other models. This meta-model may or may not also take the input data for each example as input when making predictions. The job of this model is to learn how to best combine and correct the predictions made by the other models using their out-of-fold predictions.

  • Meta-Model: Model that takes the out-of-fold predictions made by one or more models as input and shows how to best combine and correct the predictions.

For example, we may have a two-class classification predictive modeling problem and train a decision tree and a k-nearest neighbor model as the base models. Each model predicts a 0 or 1 for each example in the training dataset via out-of-fold predictions. These predictions, along with the input data, can then form a new input to the meta-model.

  • Meta-Model Input: Input portion of a given sample concatenated with the predictions made by each base model.
  • Meta-Model Output: Output portion of a given sample.

Why use the out-of-fold predictions to train the meta-model?

We could train each base model on the entire training dataset, then make a prediction for each example in the training dataset and use the predictions as input to the meta-model. The problem is the predictions will be optimistic because the samples were used in the training of each base model. This optimistic bias means that the predictions will be better than normal, and the meta-model will likely not learn what is required to combine and correct the predictions from the base models.

By using out-of-fold predictions from the base model to train the meta-model, the meta-model can see and harness the expected behavior of each base model when operating on unseen data, as will be the case when the ensemble is used in practice to make predictions on new data.

Finally, each of the base models are trained on the entire training dataset and these final models and the meta-model can be used to make predictions on new data. The performance of this ensemble can be evaluated on a separate holdout test dataset not used during training.

This procedure can be summarized as follows:

  • 1. For each base model:
    • a. Use k-fold cross-validation and collect out-of-fold predictions.
    • b.Train meta-model on the out-of-fold predictions from all models.
    • c. Train each base model on the entire training dataset.

This procedure is called stacked generalization, or stacking for short. Because it is common to use a linear weighted sum as the meta-model, this procedure is sometimes called blending.

For more on the topic of stacking, see the tutorials:

We can make this procedure concrete with a worked example using the same dataset used in the previous section.

First, we will split the data into training and validation datasets. The training dataset will be used to fit the submodels and meta-model, and the validation dataset will be held back from training and used at the end to evaluate the meta-model and submodels.

In this example, we will use k-fold cross-validation to fit a DecisionTreeClassifier and KNeighborsClassifier model each cross-validation fold, and use the fit models to make out-of-fold predictions.

The models will make predictions of probabilities instead of class labels in an attempt to provide more useful input features for the meta-model. This is a good practice.

We will also keep track of the input data (100 features) and output data (expected label) for the out-of-fold data.

At the end of the run, we can then construct a dataset for a meta classifier comprised of 100 input features for the input data and the two columns of predicted probabilities from the kNN and decision tree models.

The create_meta_dataset() function below implements this, taking the out-of-fold data and predictions across the folds as input and constructs the input dataset for the meta-model.

We can then call this function to prepare data for the meta-model.

We can then fit each of the submodels on the entire training dataset ready for making predictions on the validation dataset.

We can then fit the meta-model on the prepared dataset, in this case, a LogisticRegression model.

Finally, we can use the meta-model to make predictions on the holdout dataset.

This requires that data first pass through the sub models, the outputs used in the construction of a dataset for the meta-model, then the meta-model is used to make a prediction. We will wrap all of this up into a function named stack_prediction() that takes the models and the data for which the prediction will be made.

We can then evaluate the submodels on the holdout dataset for reference, then use the meta-model to make a prediction on the holdout dataset and evaluate it.

We expect that the meta-model would achieve as good or better performance on the holdout dataset than any single submodel. If this is not the case, alternate submodels or meta-models could be used on the problem instead.

Tying this all together, the complete example is listed below.

Running the example first reports the accuracy of the decision tree and kNN model, then the performance of the meta-model on the holdout dataset, not seen during training.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the meta-model has out-performed both submodels.

It might be interesting to try an ablative study to re-run the example with just model1, just model2, and neither model 1 and model 2 as input to the meta-model to confirm that the predictions from the submodels are actually adding value to the meta-model.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Tutorials

Books

Articles

APIs

Summary

In this tutorial, you discovered out-of-fold predictions in machine learning.

Specifically, you learned:

  • Out-of-fold predictions are a type of out-of-sample predictions made on data not used to train a model.
  • Out-of-fold predictions are most commonly used to estimate the performance of a model when making predictions on unseen data.
  • Out-of-fold predictions can be used to construct an ensemble model called a stacked generalization or stacking ensemble.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Modern Ensemble Learning!

Ensemble Learning Algorithms With Python

Improve Your Predictions in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Ensemble Learning Algorithms With Python

It provides self-study tutorials with full working code on:
Stacking, Voting, Boosting, Bagging, Blending, Super Learner, and much more...

Bring Modern Ensemble Learning Techniques to
Your Machine Learning Projects


See What's Inside

18 Responses to How to Use Out-of-Fold Predictions in Machine Learning

  1. Avatar
    Markus December 7, 2019 at 11:35 pm #

    This is just another AWESOME blog post of yours, THANKS!

    NIT:
    I just noticed that you make use of the numpy array indexing [:, 0] to reduce the dimensions from [LENGTH, 1], to [LENGTH,] and then later in create_meta_dataset function you reshape it back to the dimension [LENGTH, 1].

    I removed all the [:, 0] indexing by the complete example as well as the following lines:

    yhat1 = array(yhat1).reshape((len(yhat1), 1))
    yhat2 = array(yhat2).reshape((len(yhat2), 1))

    And the example still works the same.

    • Avatar
      Jason Brownlee December 8, 2019 at 6:11 am #

      Thanks, I’m happy it’s fun/helpful.

      Very nice! Thanks for sharing.

  2. Avatar
    Ismalia December 19, 2019 at 2:10 am #

    Thanks for your amazing tutorials. i am interested to implement something similar to this but i get the error
    TypeError: array() argument 1 must be a unicode character, not list
    when i run the code below , How to fix this

    # create a meta dataset
    import numpy as np
    from array import array
    # create a meta dataset
    def create_meta_dataset(data_x, yhat1, yhat2):
    # convert to columns
    yhat1 = array(yhat1).reshape((len(yhat1), 1))
    yhat2 = array(yhat2).reshape((len(yhat2), 1))
    # stack as separate columns
    meta_X = hstack((data_x, yhat1, yhat2))
    return meta_X

    ##Here is where i call the function and it gives that error
    # construct meta dataset
    meta_X = create_meta_dataset(data_x, knn_yhat, cart_yhat)

    • Avatar
      Jason Brownlee December 19, 2019 at 6:33 am #

      Sorry, I am not familiar with this error, perhaps try posting to stackoverflow?

      • Avatar
        Ismalia December 19, 2019 at 9:42 pm #

        Thank you, i solved the problem , there was a need to add np.array and np.hstack. make sure you import numpy array. Other readers can benefit from it. Its a basic error but can be a bit frustrating

  3. Avatar
    Ehsan March 23, 2020 at 7:20 pm #

    Thanks for this tutorial.
    Since what I know, when we put “fit method” inside a loop, previous results discard and replace with the new ones after each iteration. It means that the model fit only on the last fold of training data. Is it true? I would grateful if you make it clear for me.

    Thanks in advance

  4. Avatar
    Dan May 27, 2021 at 6:45 am #

    Thank you very much!!!!!!

  5. Avatar
    TLM August 2, 2021 at 4:39 am #

    Hi Jason,

    Love this site, I have learned so much over the past 18 months reading your articles.

    I followed your suggestion for further study and modified the example to not include the predictions from the submodels in the metamodel.

    After 100 runs each:

    No meta features results:
    Model1 Accuracy: 0.731, Model2 Accuracy: 0.929
    Meta Model Accuracy: 0.955

    With meta features results:
    Model1 Accuracy: 0.733, Model2 Accuracy: 0.926
    Meta Model Accuracy: 0.955

    Looking at these results, it seems unlikely to me that these submodel predictions are adding any value at all. The more likely scenario to me is that Logistic Regression is simply a better model for this problem compared to Decision Trees or K-Neighbors.

  6. Avatar
    Braden August 28, 2021 at 6:52 am #

    Hi Jason,

    Great tutorial, especially the conceptual generalization! I do have a question about evaluating performance with a stacked model.

    Given computational limitations aren’t an issue, would it be reasonable to evaluate a stacked model’s performance using cross validation across the entire training process? That is, keeping a “meta” out-of-fold set of data, training the base models using using the all the “meta” in-fold data broken down with further cross-validation described above, then evaluating the meta model on the meta-out-of-fold, and repeating?

    • Avatar
      Adrian Tam August 28, 2021 at 9:44 am #

      Yes, that sounds reasonable.

  7. Avatar
    Alexander Adamov October 6, 2021 at 10:23 pm #

    Thank you for a clear and insightful tutorial!

    • Avatar
      Adrian Tam October 7, 2021 at 3:51 am #

      Thank you. Glad you like it.

Leave a Reply