The machine learning model that we use to make predictions on new data is called the final model.

There can be confusion in applied machine learning about how to train a final model.

This error is seen with beginners to the field who ask questions such as:

*How do I predict with cross validation?**Which model do I choose from cross-validation?**Do I use the model after preparing it on the training dataset?*

This post will clear up the confusion.

In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.

Let’s get started.

## What is a Final Model?

A final machine learning model is a model that you use to make predictions on new data.

That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).

For example, whether the photo is a picture of a *dog* or a *cat*, or the estimated number of sales for tomorrow.

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

**Data**: the historical data that you have available.**Time**: the time you have to spend on the project.**Procedure**: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.

In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.

The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.

## The Purpose of Train/Test Sets

Why do we use train and test sets?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.

The training dataset is used to prepare a model, to train it.

We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

### Let’s unpack this further

When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).

The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.

We generalize the performance measure from:

- “
*the skill of the procedure on the*“**test set**

to

- “
*the skill of the procedure on*“.**unseen data**

This is quite a leap and requires that:

- The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.
- The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.
- The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.
- The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).

A lot rides on the estimated skill of the whole procedure on the test set.

In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.

The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.

Often, time permitting, we prefer to use k-fold cross-validation instead.

## The Purpose of k-fold Cross Validation

Why do we use k-fold cross validation?

Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

This, in turn, provides a population of performance measures.

- We can calculate the mean of these measures to get an idea of how well the procedure performs on average.
- We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

Both train-test splits and k-fold cross validation are examples of resampling methods.

## Why do we use Resampling Methods?

The problem with applied machine learning is that we are trying to model the unknown.

On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.

We don’t have new data, so we have to pretend with statistical tricks.

The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.

In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.

Once we have the estimated skill, we are finished with the resampling method.

- If you are using a train-test split, that means you can discard the split datasets and the trained model.
- If you are using k-fold cross-validation, that means you can throw away all of the trained models.

They have served their purpose and are no longer needed.

You are now ready to finalize your model.

## How to Finalize a Model?

You finalize a model by applying the chosen machine learning procedure on all of your data.

That’s it.

With the finalized model, you can:

- Save the model for later or operational use.
- Make predictions on new data.

What about the cross-validation models or the train-test datasets?

They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.

## Common Questions

This section lists some common questions you might have.

### Why not keep the model trained on the training dataset?

and

### Why not keep the best model from the cross-validation?

You can if you like.

You may save time and effort by reusing one of the models trained during skill estimation.

This can be a big deal if it takes days, weeks, or months to train a model.

Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

This is why we prefer to train the final model on all available data.

### Won’t the performance of the model trained on all of the data be different?

I think this question drives most of the misunderstanding around model finalization.

Put another way:

- If you train a model on all of the available data, then how do you know how well the model will perform?

You have already answered this question using the resampling procedure.

If well designed, the performance measures you calculate using train-test or k-fold cross validation suitably describe how well the finalized model trained on all available historical data will perform in general.

If you used k-fold cross validation, you will have an estimate of how “wrong” (or conversely, how “right”) the model will be on average, and the expected spread of that wrongness or rightness.

This is why the careful design of your test harness is so absolutely critical in applied machine learning. A more robust test harness will allow you to lean on the estimated performance all the more.

### Each time I train the model, I get a different performance score; should I pick the model with the best score?

Machine learning algorithms are stochastic and this behavior of different performance on the same data is to be expected.

Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method.

If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.

I talk more about this in the post:

## Summary

In this post, you discovered how to train a final machine learning model for operational use.

You have overcome obstacles to finalizing your model, such as:

- Understanding the goal of resampling procedures such as train-test splits and k-fold cross validation.
- Model finalization as training a new model on all available data.
- Separating the concern of estimating performance from finalizing the model.

Do you have another question or concern about finalizing your model that I have not addressed?

Ask in the comments and I will do my best to help.

Hi Jason,

Thank you for this very informative post. I have a question regarding the train-test split for classification problems: Can we perform a rain/test split in a stratified way for classification or does this introduce what is called data snooping (a biased estimate of test error)?

Thanks

Elie

The key is to ensure that fitting your model does not use any information about the test dataset, including min/max values if you are scaling.

“Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.”

I have to assume a normal distribution for that right? But is this the always the case? Or should i normalize my data in a preprocessing step and then it would be correct to assume that? Thanks

Hi Dan, great question!

Yes, we are assuming results are Gaussian to report results using mean and standard deviation.

Repeating experiments and gathering info on the min, max and central tendency (median, percentiles) regardless of the distribution of results is a valuable exercise in reporting on model performance.

Great post….my little experience teached me that:

a) for classification you can use your final trained model with no risk

b) for regression, you have to rerun your model againt all data (using the parameters tuned during training)

b) specifically for time series regression, you can’t use normal cross validation – it should respect the cronology of the data (from old to new always) and you have to rerun your model againt all data (using the parameters tuned during training) as well, as the latest data are the crucial ones for the model to learn.

Cheers!

Thanks for the tips Kleyn.