How to Train a Final Machine Learning Model

The machine learning model that we use to make predictions on new data is called the final model.

There can be confusion in applied machine learning about how to train a final model.

This error is seen with beginners to the field who ask questions such as:

  • How do I predict with cross validation?
  • Which model do I choose from cross-validation?
  • Do I use the model after preparing it on the training dataset?

This post will clear up the confusion.

In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.

Let’s get started.

How to Train a Final Machine Learning Model

How to Train a Final Machine Learning Model
Photo by Camera Eye Photography, some rights reserved.

What is a Final Model?

A final machine learning model is a model that you use to make predictions on new data.

That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).

For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

  • Data: the historical data that you have available.
  • Time: the time you have to spend on the project.
  • Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.

In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.

The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.

The Purpose of Train/Test Sets

Why do we use train and test sets?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.

The training dataset is used to prepare a model, to train it.

We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

Let’s unpack this further

When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).

The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.

We generalize the performance measure from:

  • the skill of the procedure on the test set

to

  • the skill of the procedure on unseen data“.

This is quite a leap and requires that:

  • The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.
  • The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.
  • The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.
  • The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).

A lot rides on the estimated skill of the whole procedure on the test set.

In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.

The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.

Often, time permitting, we prefer to use k-fold cross-validation instead.

The Purpose of k-fold Cross Validation

Why do we use k-fold cross validation?

Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

This, in turn, provides a population of performance measures.

  • We can calculate the mean of these measures to get an idea of how well the procedure performs on average.
  • We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

Both train-test splits and k-fold cross validation are examples of resampling methods.

Why do we use Resampling Methods?

The problem with applied machine learning is that we are trying to model the unknown.

On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.

We don’t have new data, so we have to pretend with statistical tricks.

The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.

In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.

Once we have the estimated skill, we are finished with the resampling method.

  • If you are using a train-test split, that means you can discard the split datasets and the trained model.
  • If you are using k-fold cross-validation, that means you can throw away all of the trained models.

They have served their purpose and are no longer needed.

You are now ready to finalize your model.

How to Finalize a Model?

You finalize a model by applying the chosen machine learning procedure on all of your data.

That’s it.

With the finalized model, you can:

  • Save the model for later or operational use.
  • Make predictions on new data.

What about the cross-validation models or the train-test datasets?

They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.

Common Questions

This section lists some common questions you might have.

Why not keep the model trained on the training dataset?

and

Why not keep the best model from the cross-validation?

You can if you like.

You may save time and effort by reusing one of the models trained during skill estimation.

This can be a big deal if it takes days, weeks, or months to train a model.

Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

This is why we prefer to train the final model on all available data.

Won’t the performance of the model trained on all of the data be different?

I think this question drives most of the misunderstanding around model finalization.

Put another way:

  • If you train a model on all of the available data, then how do you know how well the model will perform?

You have already answered this question using the resampling procedure.

If well designed, the performance measures you calculate using train-test or k-fold cross validation suitably describe how well the finalized model trained on all available historical data will perform in general.

If you used k-fold cross validation, you will have an estimate of how “wrong” (or conversely, how “right”) the model will be on average, and the expected spread of that wrongness or rightness.

This is why the careful design of your test harness is so absolutely critical in applied machine learning. A more robust test harness will allow you to lean on the estimated performance all the more.

Each time I train the model, I get a different performance score; should I pick the model with the best score?

Machine learning algorithms are stochastic and this behavior of different performance on the same data is to be expected.

Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method.

If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.

I talk more about this in the post:

Summary

In this post, you discovered how to train a final machine learning model for operational use.

You have overcome obstacles to finalizing your model, such as:

  • Understanding the goal of resampling procedures such as train-test splits and k-fold cross validation.
  • Model finalization as training a new model on all available data.
  • Separating the concern of estimating performance from finalizing the model.

Do you have another question or concern about finalizing your model that I have not addressed?
Ask in the comments and I will do my best to help.

299 Responses to How to Train a Final Machine Learning Model

  1. Avatar
    Elie Kawerk March 17, 2017 at 7:06 am #

    Hi Jason,

    Thank you for this very informative post. I have a question regarding the train-test split for classification problems: Can we perform a rain/test split in a stratified way for classification or does this introduce what is called data snooping (a biased estimate of test error)?

    Thanks
    Elie

    • Avatar
      Jason Brownlee March 17, 2017 at 8:33 am #

      The key is to ensure that fitting your model does not use any information about the test dataset, including min/max values if you are scaling.

    • Avatar
      Indunil July 4, 2018 at 3:03 am #

      How to save final model in Tenorflow and use it in Tenorflow.js

  2. Avatar
    Dan March 18, 2017 at 5:59 am #

    “Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.”

    I have to assume a normal distribution for that right? But is this the always the case? Or should i normalize my data in a preprocessing step and then it would be correct to assume that? Thanks

    • Avatar
      Jason Brownlee March 18, 2017 at 7:53 am #

      Hi Dan, great question!

      Yes, we are assuming results are Gaussian to report results using mean and standard deviation.

      Repeating experiments and gathering info on the min, max and central tendency (median, percentiles) regardless of the distribution of results is a valuable exercise in reporting on model performance.

  3. Avatar
    Kleyn Guerreiro March 20, 2017 at 10:36 pm #

    Great post….my little experience teached me that:
    a) for classification you can use your final trained model with no risk
    b) for regression, you have to rerun your model againt all data (using the parameters tuned during training)
    b) specifically for time series regression, you can’t use normal cross validation – it should respect the cronology of the data (from old to new always) and you have to rerun your model againt all data (using the parameters tuned during training) as well, as the latest data are the crucial ones for the model to learn.
    Cheers!

    • Avatar
      Jason Brownlee March 21, 2017 at 8:40 am #

      Thanks for the tips Kleyn.

      • Avatar
        jiggy August 5, 2019 at 9:17 pm #

        Hi Jason
        So, my question is how to predict the time series data not only for the test set but for further future forecast?
        As I cannot used the saved model to predict the time series data.

        Thank you

        • Avatar
          Jason Brownlee August 6, 2019 at 6:36 am #

          Fit a final model on all available data, save it, load it and use it to make predictions.

          If new data becomes available, you must decide whether you want to refit the model or use it as is.

  4. Avatar
    Hank May 12, 2017 at 5:16 am #

    Great post! I really learned a lot from your post and applied it to my academic project. However, there are few questions still in my mind. In our project, we want to compare different machine algorithms with and without 10-fold cv, including logistics regression, SVM, random forest, and ANN. We can get the cv score of each model with 10-fold cross validation, but the problem is how can we get the final model with 10-fold? Does the cross-validation function as finding best parameter of the different model? (such determine k in kNN?) I am still a little bit confused about the purpose of cross-validation. Thanks

    • Avatar
      Jason Brownlee May 12, 2017 at 7:49 am #

      Hi Hank, the above directly answers this question.

      Cross-validation is a tool to help you estimate the skill of models. We calculate these estimates so we can compare models and configs.

      After we have chosen a model and it’s config, we throw away all of the CV models. We’re done estimating.

      We can now fit the “final model” on all available data and use it to make predictions.

      Does that make sense?
      Please ask more questions if this in not clear. This is really important to understand and I thought I answered all of this in the post.

      • Avatar
        Hank May 12, 2017 at 3:20 pm #

        Hi Jason,

        Thank you so much! Does that mean cross-validation is just a tool to help us compare different models based on cross-validation score?
        After we are done with evaluation, we would apply original model to whole dataset and make predictions. Since I read a paper where the author compare auc, true positive rate, true negative rate, false positive rate and false negative rate between those models with and without cross-validation. It turns out that logistic regression with 10fold perform best. So I though we will apply logistics regression with 10-fold to test data. Is my understanding incorrectly? Thanks!

        • Avatar
          Jason Brownlee May 13, 2017 at 6:12 am #

          Yes, CV is just a tool to compare configs for a model or compare models.

          • Avatar
            Noushan Farooqi June 12, 2020 at 7:26 am #

            Hi Jason,

            Still trying to wrap my head around the CV concept. So essentially, let’s say we have a data set and we want to use an xgboost model. So all CV really does is create 10 different variations of train-test splits of same data set, test them on 10 different model of xgboost, give you a mean performance score and then discard all the 10 models and variations of data sets that were created. After this, you sort of like start fresh and fit your model on the entire data set where the model has no past memory of it having being test on data historically. Is my understanding correct?

          • Avatar
            Jason Brownlee June 12, 2020 at 11:12 am #

            Correct. But we will now have an estimate of how well we expect the model or modeling pipeline to perform on average.

      • Avatar
        Josefine Wilms January 20, 2021 at 7:15 pm #

        Hi Jason

        I was under the impression that, during k-fold cross validation, one of the folds are used to tune the hyperparameters. The performance on this fold is therefore not equivalent to performance on unseen data since the unseen data is used to improve the model.

        I would recommend that you use k-folding on your training set to not have a validation set that might not be representative of the entire data. But, this is not a real indication of your model performance, this is only a manner in which to obtain the best hyperparameters.

        Once you have obtained these optimal hyperparameters, you retrain the model on the entire training data (the hold-out test set is still kept safely locked away). Then you report the performance on this hold-out test set.

        If you fear that the test set is maybe not representative of the training set, you can do one of two things:
        1) Compare the distribution of your test set values to that of your training set.
        2) Repeat the entire training and testing procedure, described above, multiple times. Each time randomly selecting a training and testing set and report the average performance on these test sets.

        Would you agree with the above?

        • Avatar
          Jason Brownlee January 21, 2021 at 6:47 am #

          You can tune the model within cv if you like, it is called nested cv and you can learn more here:
          https://machinelearningmastery.com/nested-cross-validation-for-machine-learning-with-python/

          Nested cv is not typical, it is a more advanced technique. A typical approach might be to first evaluate and compare algorithms, choose one, then tune it.

          You can use any procedure you want – that gives you confidence you are choosing an effective model for your problem.

          • Avatar
            Sandip Khaire February 25, 2021 at 9:14 pm #

            Hi Jason ,
            This means,
            Let’s say for my data i will do 10-fold CV for multiple algos say logistics regression,Random Forest,Xgb,SVM model etc .Then by comparing CV scores i will choose algo which gives me better expected performance .
            Say in this case Random Forest is doing better in CV evaluation
            Then i will do a again 10-fold cv for tuning hyperparameters for Random forest and then select the best hyperparameters basis best cv scores.

            and finally then i will train my random forest model with tuned hyperparameters on entire data and then will use it for prediction on unseen data …
            Is this right procedure to come up for final model? Correct me if i am wrong..

          • Avatar
            Jason Brownlee February 26, 2021 at 4:57 am #

            Sounds good.

          • Avatar
            Anil October 19, 2021 at 2:29 pm #

            Hi Jason,
            Thanks for very informative and critically important post.
            I am still having a query.
            As Sandip Khaire, also mentioned we tuned our hyperparameters on the basis of cross validation scores, now reporting the estimating skill or model performance on the same validation won’t be biased? I got the idea of final model but for reporting the model performance, which data set we should use. Is it ok to report performance of cross validation itself although parameters were tuned on it or do we need to keep a unseen test separate from training and cross validation data for this purpose?

          • Avatar
            Adrian Tam October 20, 2021 at 10:19 am #

            I believe this post explained the correct procedure: https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/

  5. Avatar
    Petros May 12, 2017 at 10:55 am #

    Hi Jason,

    Great post.

    It took me awhile to get this but when the penny dropped about 18 months ago it was liberating. I liken cross validation to experimenting a process which you want to emulate against all your train data. One idea though.

    When you cross validate you might say 10 folds of 3 repeats for each combination of parameters. Now say with whatever measure you are taking for accuracy you typically taken the mean from these 30. Is it sensible to bootstrap with replacement, particularly if it is not Gaussian, from this sample of 30 say 1000 times and from their calculate the median and 2.5/97.5 percentiles?

    What does everyone else think!

    PK

    • Avatar
      Jason Brownlee May 13, 2017 at 6:09 am #

      Yes, I like to use the bootstrap + empirical confidence intervals to report final model skill.

      I have a post showing how to do this scheduled for later in the month.

    • Avatar
      Noushan Farooqi June 12, 2020 at 6:36 pm #

      Awesome, thanks. Could you please clarify two more things-

      1) How is cross validation score different from let’s say other performance metrics (confusion matrix/MSE)? What is the cross validation score based on as in what is it actually – the error? some type of unique metric?

      2) In one of your articles that I later came across yesterday, you mentioned that during CV for lets say xgboost, if k=10, then 10 variations of the original dataset is tested on the same xgboost model. However, in my previous comment, my understanding was that 10 variations are tested on 10 models of xgboost and you did not refute that comment. So I’m just trying to understand is the same model exposed to 10 variations of dataset or are 10 different models exposed to each of the unique variation from K fold CV.

      Thanks a lot,

  6. Avatar
    Warren van Niekerk May 12, 2017 at 2:56 pm #

    Thanks for the very informative post. Just one question: When you train the final model, are you learning a completely new model or is some or all of the value of the previously learned models somehow retained?

    • Avatar
      Jason Brownlee May 13, 2017 at 6:11 am #

      Yes, generally, you are training an entirely new model. All the CV models are discarded.

  7. Avatar
    Muralidhar SJ May 12, 2017 at 6:53 pm #

    Thanks Jason. Very Useful info & insight , helping lot to take right approach .

  8. Avatar
    Imene May 14, 2017 at 6:57 am #

    Thank you very much Jason. I found in this post answers to many questions.

  9. Avatar
    issam May 14, 2017 at 8:42 am #

    Hi Jason
    I want tank you for this informative post . I m working in project “emotion recognition on image” I want to know how can I create my model and train it.

    thanks in advance

  10. Avatar
    EN MO May 14, 2017 at 5:37 pm #

    Very informative, thanks alot, am also trying to see if this will be useful in a project I would like to do, and how it can be applied in biometrics and pattern recognition

  11. Avatar
    Ras May 17, 2017 at 10:38 pm #

    Thanks for the article. What about the parameters. You will likely do tunning on the a development set or via cross-validation. The optimum parameter set you find is the best for that particular split or fold. Wouldn’t it be left to chance for our optimized parameters to be the optimum with the whole training data as well?

    • Avatar
      Jason Brownlee May 18, 2017 at 8:37 am #

      Hi Ras,

      k-fold cross-validation is generally the best practice for using the training dataset to find a “good” configuration of a model.

      Does that help? Is that clearer?

  12. Avatar
    lalneirem May 23, 2017 at 7:28 pm #

    thanks for this post
    i know this may be useful but i don’t know what we do in training phase using KNN
    if u can write the details step that is done during training phase
    i will be so grateful

  13. Avatar
    aquaq June 1, 2017 at 6:29 pm #

    Thanks for this post, it has given a clear explanation for most of my questions. However, I still have one question: if I have used undersampling duting CV, how should apply it to my whole data. To be clearer
    – I have a training set of around 1 million positive (+) and 130 thousand negative (-) examples. I also have an independent test data set with a hundred thousand (+) and 4000 (-) examples.
    – I have estimated performance with 10-fold CV and applied undersampling (I have used R gmlnet package, logit regression with LASSO, training for AUC). It gave me super results for the CV.

    And now I’m lost a bit. Training for all data would mean to randomly select 130 thousand (+) from the 1 million and only use this ~260 thousand examples? Should I evaluate my model after training on my test data set?

    Thank you for your help!

    • Avatar
      Jason Brownlee June 2, 2017 at 12:56 pm #

      If you can, I would suggest evaluating the model on all data and see if skill improves.

      In fact, it is a good idea to understand the data set size and model skill relationship to find the point of diminishing returns.

  14. Avatar
    Chayanika Mudiar June 19, 2017 at 10:19 pm #

    I have a question. In the training process using gausion naive bayes, can you say what are the steps to be taken to train the model.

  15. Avatar
    Tyrone July 7, 2017 at 5:40 pm #

    Hi Jason. Thanks for a great article!

    When you say that “You finalize a model by applying the chosen machine learning procedure on all of your data”, does this mean that before deploying the model you should train a completely new model with the best hyperparameters from the validation phase, but now using training data + validation data + testing data, i.e. including the completely unseen testing data that you had never touched before?

    This is how I interpret it, and it makes sense to me given that the whole the whole point of validation is to estimate the performance of a method of generating a model, rather than the performance of the model itself. Some people may argue, though, that because you’re now training on previously unseen data, it is impossible to know how the new trained model is actually performing and whether or not the new, real-world results will be in line with those estimated during validation and testing.

    If I am interpreting this correctly, is there a good technical description anywhere for why this works in theory, or a good explanation for convincing people that this is the correct approach?

    • Avatar
      Jason Brownlee July 9, 2017 at 10:37 am #

      Yes. Correct.

      Yes. The prior results are estimates of the performance of the final model in practice.

      • Avatar
        Tyrone July 10, 2017 at 11:19 pm #

        Thanks Jason. It’s great to have confirmation of that. Do you know of any published papers or sources out there that spell this out explicitly or go into the theory as to why this is theoretically sound?

  16. Avatar
    aquaq July 25, 2017 at 10:56 pm #

    Thanks Jason for this explanation. I would like to ask how to deal with test sets when I would like to compare the performance of my model to existing models. Do I have to hold out a test set, train my model on the remaining data and compare all models using my test set?
    After that, can I merge this held out set to my original training set and use all data for training a final model?
    What other solutions can be used?

    • Avatar
      Jason Brownlee July 26, 2017 at 7:54 am #

      Yes. Choose models based on skill on the test set. Then re-fit the model on all available data (if this makes sense for your chosen model and data).

      Does that make sense?

  17. Avatar
    Paul August 3, 2017 at 10:21 am #

    Thank you for the great post Jason.
    I have a question about forecasting unseen data in RNN with LSTM.
    I’ve built complete model using RNN with LSTM by using the post(https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras).
    How can we forecast unseen data(like ahead of current) from complete model?
    I mean we don’t have any base data except time though.

    I already saw some comments that you replied “You can make predictions on new data by calling Y = model.predict(X)” on that post. However, I couldn’t understand.. :'(

    • Avatar
      Paul August 3, 2017 at 10:30 am #

      I mean in real-time. 🙂

      Thanks in advance.

      Best,
      Paul

      • Avatar
        Jason Brownlee August 4, 2017 at 6:47 am #

        In real time, the same applies, but you can decide whether you re-train a new model, update the model or do nothing and just make predictions.

    • Avatar
      Jason Brownlee August 4, 2017 at 6:46 am #

      You can predict the next step beyond the available data by training the model on all current data, then calling predict with whatever input your model takes taken from the end of the training data.

      Does that help?

      Which part is confusing?

  18. Avatar
    S H September 8, 2017 at 11:45 pm #

    Hi Jason.

    Thanks a lot for this great and informative post. I have 2 questions I would be thankful if you can help me with them:

    1- Is that possible to refresh (update) a model without retraining it in full? To elaborate more, I have a model built using 9 weeks of data (weekly snapshots). As the size of the dataset is very large, when I want to update the model on a weekly basis, it takes a lot of time. Is that possible to update the model with the new snapshot (say for week 10), without retraining the model on the whole 10 weeks (9 old snapshots + 1 new snapshot)?

    2- When I train my model and evaluate it using cross-validation, I get errors (or alternatively, I get AUCs) which are consistently better than what I get when I score serving data and test the real performance of the model. Why is that so, and how can I treat it? To elaborate more, taking the 9 snapshots explained in the first question, I use snapshot_date column as cross-validation fold column. Therefore, in each round, the algorithm uses 8 weeks of data for training, and test the model on the remaining unseen week. Therefore, I would end up with 9 different models and 9 different AUCs on the validation frame. All the AUCs are between 0.83 to 0.91. So I would expect that the real performance of the model built using whole data should be at minimum AUC 0.83. However, when I score the serving data, and the next week I assess the performance of the model, I see no better than AUC 0.78. I have experienced it for 3 weeks (3 times), so I don’t think it’s just random variation. Additionally, I am quite sure there is no data leakage and there is no future variable in my data. Also, I tune the model quite well and there is no overfitting.

    Your help is highly appreciated.

    • Avatar
      Jason Brownlee September 9, 2017 at 11:57 am #

      You can update a model. The amount of updating depends on your data and domain. I have some posts on this and more in my LSTM book.

      Model evaluation on test data is biased and optimistic generally. You may want to further refine your test harness to make the reported scores less biases for your specific dataset (e.g. fewer folds, more folds, more data, etc. depends on your dataset).

  19. Avatar
    Kenny October 12, 2017 at 1:05 am #

    Hello Jason, very interesting this post! what do you get

    I finished my model (with a score of 91%), but how do I do or how to evaluate this model with the new dataset?

    I have saved the model in model.pkl but in my new data (for example iris.csv) How to predict the field “species”? (in my datasets do I need to put this field blank?) how is this step?

    Thks for your help because I’m confused

    • Avatar
      Jason Brownlee October 12, 2017 at 5:33 am #

      Load the model and call model.predict(X) where X is your new input data.

  20. Avatar
    Prasshanth VP January 16, 2018 at 10:05 am #

    Hi Jason – Great post. This cleared things for me in settling with a final model.

    fitControl <- trainControl(
    method = "repeatedcv",
    number = 10,
    savePredictions = 'final',
    verboseIter = T,
    summaryFunction = twoClassSummary,
    classProbs = T)

    glm_fit <- caret::train(dv ~. , data = dataset
    ,method = "glm", family=binomial, trControl = fitControl, metric = "ROC")

    It says that the glm_fit now becomes the final model as it runs 10 fold based on trControl and finally trains model using entire data. Setting verboseIter = T, gives me a summary during this run a message at the end – "Fitting final model on full training set". So can I use this as a final model?

  21. Avatar
    Martin Main January 18, 2018 at 2:38 am #

    Hi there,
    This article makes a lot of sense, but one thing I am surprised was not addressed was the problem of over-fitting. If there is no test/validation data used in the final model generation, and if the model being used has been seen to over-fit the data in testing, then we need to know when to stop training without over-fitting. A simple approach would be to guess from the ‘correct’ training times from the previous tests, but of course the final model with all data will naturally need longer training times. Is there a statistical approach we could use to determine the best time to stop training without using a validation set?

    • Avatar
      Jason Brownlee January 18, 2018 at 10:12 am #

      Concerns of overfitting are addressed prior to finalizing the model as part of model selection.

      • Avatar
        Megat Haziq March 12, 2018 at 7:06 pm #

        If I was using early stopping during k-fold cross-validation, is it correct to average the number of epoch and apply it to the finalized model? Since there is no validation set in the finalized model for early stopping, so I thought of using average number of epoch to train the final model. Please help me 🙂

  22. Avatar
    Tata March 7, 2018 at 5:13 am #

    Hi Jason! Thank you so much for this informative post.

    A little question though. If we don’t have the luxury to acquire another dataset (because it’s only for a little college project, for example), how do you apply k-fold cross validation (or test-training split) to evaluate models then?

    My understanding is that once you apply, let’s say, k-fold cross validation for choosing which model to use and then tuning the parameters to suit your need, you will run your model on another different dataset hoping the model you have built and tuned will give you your expected result.

    • Avatar
      Jason Brownlee March 7, 2018 at 6:17 am #

      You can split your original dataset prior to using CV.

  23. Avatar
    Rose March 21, 2018 at 8:46 am #

    Hi Jason,
    Glad to meet with your tutorial as these are one of the best in teaching deep wuth keras.
    I have already read the notes which people asked you questions about using k-fold cv for training a final deep model but as I am a naive in working with deep learning models I could not understand some things.
    I wanna train (or finalized) CNN,LSTM & RNN for text dataset (it is a sentiment analysis). In fact my teacher told me to apply k-fold cross validation for training=finalizing model to be able to predict the proability of belonging unseen data to each class (binary class).
    my question is this:[ is it wrong to apply k-fold cross validation to train a final deep model?]
    as I wrote commands 15 epoches run in each fold. is there any thing wrong with it?
    I am so sorry for my naive question as i am not a english native to understand perfect the above comments U all wrote about it.
    my written code is like this:
    [
    from sklearn.model_selection import KFold
    kf = KFold(10)
    f1_lstm_kfld=[]
    oos_y_lstm = []
    oos_pred_lstm = []
    fold = 0
    for train, test in kf.split(x_train):
    fold += 1
    print(“Fold #{}”.format(fold))
    print(‘train’, train)
    print(‘test’, test)
    x_train1 = x_train[train]
    y_train1 = y_train[train]
    x_test1 = x_train[test]
    y_test1 = y_train[test]
    print(x_train1.shape, y_train1.shape, x_test1.shape, y_test1.shape)

    print(‘Build model…’)
    model_lstm = Sequential()
    model_lstm.add(Embedding(vocab_dic_size, 128))
    model_lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
    model_lstm.add(Dense(1, activation=’sigmoid’))

    model_lstm.compile(loss=’binary_crossentropy’,
    optimizer=’adam’,
    metrics=[‘accuracy’])

    print(‘Train…’)
    model_lstm.fit(x_train1, y_train1,
    batch_size=32,
    epochs=15,
    validation_data=(x_test1, y_test1))
    score_lstm, acc_lstm = model_lstm.evaluate(x_test1, y_test1,
    batch_size=32)

    sum_f1_lstm_kfld=0
    for i in f1_lstm_kfld:
    sum_f1_lstm_kfld =sum_f1_lstm_kfld+i
    print (‘sum_f1_lstm_kfld’,sum_f1_lstm_kfld)
    mean_f1_lstm_kfld=(sum_f1_lstm_kfld)/10
    print (‘mean_f1_lstm_kfld’, mean_f1_lstm_kfld)
    Please guide me as i get confused.
    Thank U in advanced.
    Rose

    • Avatar
      Jason Brownlee March 21, 2018 at 3:06 pm #

      You cannot train a final model via CV.

      I recommend re-reading the above post as to why.

      • Avatar
        Rose March 23, 2018 at 7:29 am #

        Hi Jason,
        I am afraid to ask about this issue again but when I re-read the above post , I saw this sentence: Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method. If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.
        What do U mean by this sentence: “create multiple final models” do you mean applying k-fold cross-validation to achieve multiple models??
        and also about this one: take the mean from an ensemble of predictions, you mean can we use “this take the mean” in finalizing model?
        I wanna train a model=finalizing model.
        You mentioned: “Why not keep the best model from the cross-validation?
        You can if you like.You may save time and effort by reusing one of the models trained during skill estimation.Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.
        what do u mean by saying above three sentences especially this one:”when trained on all of the available data than just the subset used to estimate the performance of the model?
        you mean by “training on all of the available data” the procedure which we do not use k-fold cross validation?? and you mean applying k-fold cross validation from this sentence:”just the subset used to estimate the performance of the model”?
        If I want to ask my question clearly I shold say in this way: [ I wanna train a CNN ,LSTM and RNN deep model to define a deep model inorder to estimating the proability of unseen data, what should I do? applying splitting data set into train and test or any other procedures?

        Any guidance will be appreciate.

        • Avatar
          Jason Brownlee March 23, 2018 at 8:30 am #

          I mean, if there is a lot of variance in the model predictions (e.g. due to randomness in the model itself), you can train multiple final models on all training data and use them in an ensemble.

          Sometimes training a model can take days or weeks. You may not want to retrain a model, hence, reuse a model from the time when you estimated model skill.

          Does that help?

  24. Avatar
    moses April 6, 2018 at 4:07 pm #

    can u provide a sample code for prediction

    • Avatar
      Jason Brownlee April 7, 2018 at 6:08 am #

      I have many examples on my blog for different platforms.

      Try a search and let me know if you don’t find what you’re looking for.

  25. Avatar
    Mariano April 26, 2018 at 12:44 pm #

    Do you happen to have an example with code where you train with all your data and then predict unknown future data?

  26. Avatar
    Casey May 17, 2018 at 4:46 am #

    Thanks for this post Jason, and two additional questions:

    1) Is there a peer-reviewed article that can be cited to demonstrate the validity of this approach?

    2) Do I understand correctly that if the uncertainty in the relationship derived for the training data is correctly propagated to the test data set, the “best” model can be selected based solely on cross-validation statistics? That is, goodness of fit measure for the training relationship don’t really matter?

    Thanks!

    • Avatar
      Jason Brownlee May 17, 2018 at 6:39 am #

      Of finalizing a model? There may be, I don’t know sorry. It might be tacit knowledge.

      Yes, skill estimated using a well configured k-fold cross-validation may be sufficient, but i the score is reviewed too often (e.g. to tune hyperparams), you can still overfit.

  27. Avatar
    AKBAR HIDAYATULOH May 23, 2018 at 2:41 pm #

    this post is very helpful for my final project to get better understanding,

    i have question, after done with train/test split, and next step is training with all available dataset. Do i have to use all of the dataset for data train no need to split again? and using the best hyperparameter or configurations from train/test split or cross validation before?

    Thank you very much

    • Avatar
      Jason Brownlee May 23, 2018 at 2:43 pm #

      I’m glad to hear that.

      Correct, you would use all available data with hyperparameters chosen via testing on your train/test/validation sets.

  28. Avatar
    Debasish Ghosh June 10, 2018 at 1:25 am #

    Thanks Jason for the great post. I have one question though ..

    During training I pre-process data e.g. scaling, feature reengineering etc. And then I train the model using train / validation /test set. Now I have the final model which I would like to use for prediction.

    Now my prediction system is different (written using Java and TF) and there I import the trained model – incidentally all my training code are in Keras and Python. But in my prediction system I get the data points one at a time and I have to do prediction.

    My question is how can I do the data pre-processing during prediction ? Pre-processing like scaling and feature extraction do not make sense on a single data point. With my use case prediction looks good if I accumulate all the data that I receive (unseen before), do similar pre-processing as in training, once I have quite a bit of them and then submit to the trained model for scoring. Otherwise I get very different and inaccurate results.

    Would love to hear some suggestions on how to tackle this issue.

    • Avatar
      Jason Brownlee June 10, 2018 at 6:05 am #

      Excellent question!

      The single data point must be prepared using the same methods used to prepare the training data.

      Specifically, the coefficients for scaling (min/max or mean/stdev) are calculated on the training dataset, used to scale the training dataset, then used going forward to scale any points that you are predicting.

      Does that help?

  29. Avatar
    Maria June 25, 2018 at 11:26 pm #

    Hi Jason, Thank you for the awesome tutorial.
    As I see, you emphasize on training a neural network on the entire data set without taking apart a sub-set of the whole dataset as a Test data set in order to train and finalizing a data set.
    I have already trained cnn_model on the entire data set (( I mean I did not separate some samples for test set)) but I separate 20% of whole data set as the validation set via this statement:

    ‘model_cnn.fit(x_datasetpad, y_datasetpad, validation_split=0.2, epochs=5, batch_size=32)’

    I think I made a mistake about putting ((validation_split=0.2)) in fitting network process.
    Do I remove validation set to finalizing the cnn network??
    Should I train a network on the entire data set { I mean Should I delete validation_split=0.2???}

    • Avatar
      Jason Brownlee June 26, 2018 at 6:38 am #

      Yes, remove the validation split for the final model.

      • Avatar
        Maria June 26, 2018 at 7:54 am #

        Hi Jason,
        I am so grateful for the quick answer.

  30. Avatar
    K.D.I. Madhuwantha July 4, 2018 at 10:02 pm #

    How to save final model in Tenorflow and use it in Tenorflow.js

    • Avatar
      Jason Brownlee July 5, 2018 at 7:43 am #

      Sorry, I don’t have tensorflow or tensorflow.js examples.

  31. Avatar
    Vaddi Ajay Kumar July 4, 2018 at 11:54 pm #

    I Read Post, all Questions & answers.So finally want to summarise and approve from you 🙂

    Ex: I have Training data 100k values, test data: 50k values.

    1. We try various models like linear regr, decision tree, random forest, neural net on K fold validation with 150k values and see what model gives better performance measure(mean error etc…). We now decided what algorithm/procedure works best on data .

    Ex: Decision Tree.

    2. Now let us run K fold validation with 150k values on Decision tree with different hyperparamter values and check what value gives better performance measure.

    3. we know what model and what hyperparameter “generally” works across the data.

    4. Let us use all the data 150K values and train final Decision tree (FDT) with hyperparameter that we selected(which worked best) previously.

    As model and hyperparameters are checked previously , the above post believes they will and should works best on the unseen data.

    My thoughts: I might take a safer approach at the end by double checking, which means rather than train on all the data that i have i will keep 5% for testing (Unseen Data) and 95% for training.

    Thank You for the Great Post . I thought this might help people who are concerned about hyper parameter tuning post model/ML procedure selection.

    • Avatar
      Jason Brownlee July 5, 2018 at 7:47 am #

      All good except the final check is redundant and could be misleading. What if skill on the 5% is poor, what do you do and why?

      • Avatar
        Vaddi Ajay Kumar July 5, 2018 at 10:02 pm #

        I thought to keep 5% as a double check but after your question i began to ponder what if skill is poor – I have two things to say.

        1. This 5% is a sample that is not representative of data . i.e.. Occurred by chance. So i should have other approach to test on representative of the data.

        2. Model is not good enough or over-fitted – Even this time i cannot come to conclusion as 5% sample may not be representative of data.

        Understood finally that Cross Fold validation is solution for above 2 points which we already did on the whole data prior and so ” final check is redundant”.

        Thank You So much Jason Brownlee.

        • Avatar
          Jason Brownlee July 6, 2018 at 6:42 am #

          Nice reasoning!

          Keep an open mind and adapt methods for your specific problem. There are no “rules”, it is an empirical discipline.

          • Avatar
            Vaddi Ajay Kumar July 6, 2018 at 9:58 pm #

            Thank You. Understood.

        • Avatar
          Sandip Khaire February 25, 2021 at 9:39 pm #

          This one is good.Now i really understood.ThanksJason and Vaddi for such lucid explaination.

      • Avatar
        tranquil.coder October 8, 2019 at 7:14 pm #

        In Vaddi Ajay Kumar’s step 1 and step 2 with cross validation, different algorithms(step 1) and different hyper-parameters (step 2) with different data. Questtons are:

        1.Why not just use train/test split method?at least ensure use the same train data and test data. What is the advantage of cross validation against train/test method?

        2.I know K shoud not be too small or too big, Some books recommend 10-fold (no mention how much optional values), then how to choose train/test in 10-fold cross validation (given each part named 0,1,2…9 ) if the optional hyper-parameter has only 5 values according to prior? If it is not right, how to choose K if the optional hyper-parameter has only 5 values according to prior? and 20 values?

        • Avatar
          Jason Brownlee October 9, 2019 at 8:09 am #

          K for k-CV is unrelated to the hyperparameters.

          CV or repeated CV give a less biased estimate of model skill than a single train/test split. K=10 has been found to be a good trade-off (less optimistic) when tested by many people on many differently sized dataset.

    • Avatar
      Luca October 10, 2019 at 10:11 am #

      @Vaddi Ajay Kumar, why did you split your data in train and test set if you always use 150k that is the sum of the two to do your computation? For what do you use the test set? Thanks.

  32. Avatar
    Ed O July 9, 2018 at 10:06 pm #

    Thank you Jason. I am trying to get probabilities of whether an employee is going to leave or stay with the company. I have 1,500 records of individuals that left the company and 500 that are currently with us. I need to get probabilities for all 500 associates that are still with us.

    The issue is that the model is technically seeing all the data that it is training on in order to get probabilities for the entire data set. I don’t have “new” data I can apply the trained model to since we know all the employees that are currently with us. How do I get probabilities for the 500 current associates without overfitting? Is it as simple as making the predictors more generic? Thank you for your advice in advance.

    • Avatar
      Jason Brownlee July 10, 2018 at 6:47 am #

      You can fit models on some of your data and evaluate it on the rest.

      Once you find a model that works, you can train it on all of your data and use it to make predictions on new data.

      I assume you have historical records for people that have stayed or left, you train on that. Then you have people now that you want to know if they will stay or leave, this is new data for which you want to generate a prediction.

  33. Avatar
    Thusitha Deepal July 15, 2018 at 2:59 am #

    I have as ome problem.I am trying to predict currency exchange rte using historical data.I’m trying to predict tomorrow exchange rate using yesterday rate..I am littlebit confused.What artifical neural network should i used?And ilike to use K-fold cross validation to sampling..I like to know your ideas.

  34. Avatar
    Luv Suneja July 26, 2018 at 1:12 am #

    Hi Jason,

    This is a fantastic article. Cleared a lot of things.

    I used to think that number of folds for k-fold is another hyperparameter to find the best model. For example we test two models with k=[3,5,7,9]. But after reading your post I guess that is pretty unnecessary as k fold validation does not choose a final model anyway.

    So, do I just pick a single value of say, k=10 and run with it?

    Thanks
    Luv

  35. Avatar
    sarah August 5, 2018 at 3:43 am #

    Hi Jason,
    I have become more knowledgeable via your tutorial.
    I wanna write a paper so i have to mention a valid resource for this point you mentioned below:
    “You finalize a model by applying the chosen machine learning procedure on all of your data.”
    in which paper or resource you have seen that we should apply whole data set for training model in order to finalizing it??
    please gimme the paper or book as a resource.
    waiting for the reply.
    Best
    Sarah

    • Avatar
      Jason Brownlee August 5, 2018 at 5:36 am #

      It is tacit knowledge, not written down in a paper.

  36. Avatar
    KALYAN August 11, 2018 at 5:20 am #

    Hello Jason,

    How to Re-Train a model with new data which is already trained with another data?

    Thanks,
    KALYAN.

  37. Avatar
    Albert Tu August 13, 2018 at 6:25 pm #

    Hi Jason,

    Great post! Thanks!
    A quick question here.
    If I have a training dataset of n=500 instances.
    Then I use 10-fold cross-validation and feature selection to identify an optimal machine learning algorithms based on this n=500 training dataset.

    What would be a reasonable number of instances in the independent testing dataset that I can use to evaluate/test a performance of this machine learning algorithm on the un-seen dataset?

    Thanks,
    Albert Tu

    • Avatar
      Jason Brownlee August 14, 2018 at 6:15 am #

      It really depends on the problem.

      Perhaps try some different sized dataset and evaluate the stability of their assessment?

  38. Avatar
    Joni August 15, 2018 at 3:15 am #

    Hello Jason,

    thank you very much for your deep and crystal clear explanations.

    I have one question about two modelling Setups:

    Setup I:

    – Split the entire data into a Training-Set (80%) and Test-Set (20%)
    – Make a 10-Fold Cross Validation on the Training-Set to find the optimal parameter configuration
    – Train the model with the determined parameter configuration on the Trainings-Set
    – Final Evaluation of the model on the Testset (“Holdout”, untouched data)

    Setup II:

    – Split the entire data into a Training-Set (80%) and Test-Set (20%)
    – Make a 10-Fold Cross Validation on the entire data set to find the optimal parameter configuration
    – Train the model with the determined parameter configuration on the Trainings-Set
    – Final Evaluation of the model on the Testset (“Holdout”, untouched data)

    The only difference between setup I and II is that I make in setup I the CV on the Training-Set and in setup II I’m doing it on the entire dataset.

    Which setup is do you think better or do you think both approaches are valid?

    Thanks in advance!
    Kind regards from Germany
    Jonathan

    • Avatar
      Jason Brownlee August 15, 2018 at 6:11 am #

      There is no notion of better. Use an approach that gives you confidence in the findings of your experiment, enough that you can make decisions.

  39. Avatar
    vamsi August 29, 2018 at 4:25 pm #

    i would like to know how to measure the performance of the skill of the model using cross validation. since we have k different models how do we measure the performance ? do we get any aggregate score of performance on all models ? please explain me.. to be specific i use h20 to train my model

  40. Avatar
    Steve Tmat September 4, 2018 at 5:40 am #

    Thank you for this very informative post.

    • Avatar
      Jason Brownlee September 4, 2018 at 6:12 am #

      You’re welcome Steve. I’m happy that it helped.

  41. Avatar
    keras_tf September 7, 2018 at 12:19 pm #

    Why do you use the test data as the validation data in almost all of your examples.Are we not suppose to have two different test and validation data

  42. Avatar
    John Din September 26, 2018 at 11:53 am #

    Would it help to strengthen the ability of classifier to be trained on various data sets (such as from Kaggle, like stock data, car accidents, crime data etc.)…. While we do so, we may tune underlying algorithms or math construct to deal with different issues such as over-fit, low accuracy, etc.

    • Avatar
      Jason Brownlee September 26, 2018 at 2:24 pm #

      This approach could be used to learn generic features for a class of problem. E.g. like unsupervised pre-training or an autoencoder.

      I think this approach is the future for applied ML. I have a post scheduled on this approach being used in time series forecasting with an LSTM autoencoder. Very exciting stuff.

  43. Avatar
    Harshali October 5, 2018 at 9:56 pm #

    Hey Jason,

    Very Very nice article. Archived it for my next project. Thanks for sharing such an informative articles.

  44. Avatar
    Xu Zhang October 10, 2018 at 9:53 am #

    Thank you so much for your great article.

    I understood that we should use all the data which we have to train our final model. However, when should I stop training when I train my final model with dataset including train+validation+test? Especially for a deep learning model. Let me explain it with an example:

    I have a CNN model with 100,000 examples. I will do the following procedure:
    1. I split this dataset into training data 80,000, validation data 10,000 and test data 10,000.
    2. I used my validation dataset to guide my training and hyperparameter tuning. Here I used early stopping to prevent overfitting.
    3. Then I got my best performance and hyperparameters. From early stopping setting, I got that when I trained my model 37 epochs, the losses were low and performance using test data to evaluate was good.
    4 I will finalize my model, train my final model with all my 100,000 data.

    Here is a problem. Without validation dataset, how can I know when I should stop training, that is how many epochs I should choose when I train my final models. Will I use the same epochs which are used before finalizing the model? or should I match the loss which I got before?

    I think for the machine learning models without early stopping training, they are no problems. But like deep learning models, when to stop training is a critical issue. Any advice? Thanks.

    • Avatar
      Jason Brownlee October 10, 2018 at 3:00 pm #

      Great question. It is really a design decision.

      You can try to re-fit on all data, without early stopping by perhaps performing a sensitivity analysis on how many epochs are required on average.

      You could sacrifice the a new validation set and refit a new final model on train-test.

      There is no single answer, find an approach that makes the most sense for your project and what you know about your model performance and variance.

    • Avatar
      Jason Brownlee October 12, 2018 at 11:29 am #

      At the end of the day it’s a design decision that you must make based on the specifics of your problem.

      • Avatar
        Xu Zhang October 13, 2018 at 3:50 am #

        You are surely right! Thank you again.

  45. Avatar
    Tayyab October 16, 2018 at 5:21 am #

    Hi Jason Brownlee,

    I am reading your tutorials and writing the code the understand the knots and bolts of it. Two reasons I do it. To understand Applied Machine Learning and how scikit-learn works. I had an extensive class on machine learning 1.5 years ago and getting back to my notes I feel like I understand most of the algorithms (I need workout statistics and probability as bit). My Calculus and Linear Algebra is fine. I am working as a student data scientist. My pandas and numpy knowledge would be the same as a beginner and I believe I would do good when using them per need. What would you recommend in such a situation how should I proceed with Data Science? I have my rough route but I would love to hear your comments.

    • Avatar
      Jason Brownlee October 16, 2018 at 6:40 am #

      Work through small projects and build up your skills.

  46. Avatar
    Taylor November 28, 2018 at 11:58 am #

    Hi Jason, thank you for the wealth of info you’re sharing here! Such an awesome resource! One thing I’m having trouble with is selecting the most important variables to use in a final Random Forest model. I ran 5-fold cross-validation and was able to get feature importance values from each of the 5 models. It is acceptable to then take the average of the 5 importance values for each of the features and use that average to determine the top N features? If I then train a model on those top features and do another 5-fold cross-validation on that model, have I introduced leakage and risk overfitting? Is it better to just take the top N features using the feature importance of just one model from the initial cross-validation? Thanks for any input!

    • Avatar
      Jason Brownlee November 28, 2018 at 2:53 pm #

      Generally, the random forest will perform the feature selection for you as part of building the model, I would not expect a lift from using feature importance to select features prior to modeling with RF.

    • Avatar
      Shu July 23, 2020 at 3:34 pm #

      Hi Taylor, I know it is quite late now. But I think in your case, assume you do not use RF (SVM for example), I think your approach to determine the top N features is introducing leakage. You should not do CV again.
      My question here is, without doing CV again, can we just take those top N features as a feature selection method, skip CV, and jump right into test phase? Is it valid?

  47. Avatar
    Njoud December 13, 2018 at 7:13 am #

    Thank you, Dr. Jason, for the great website, I read many of your articles and learn a lot.

    Regarding this article, you said: “Why not keep the best model from the cross-validation?
    You can if you like.”

    My questions are:

    1- in python, how can I save one or the best model returned by cross-validation? as far as I know that the function cross_validation.cross_val_score (from scikit-learn package) don’t return a trained model, it just returns the scores? is there another package or function in python that return 10 trained model and I could save one of them?

    2- What about R, does cross-validation return models and I can save any one of them? if yes, then how?

    I read your articles related to cross-validation, I read books and I searched too, but I did not find answers yet.

    some of the links that I find about this issue:
    https://stackoverflow.com/questions/32700797/saving-a-cross-validation-trained-model-in-scikit
    https://stats.stackexchange.com/questions/52274/how-to-choose-a-predictive-model-after-k-fold-cross-validation

    appreciate your help

  48. Avatar
    Paul January 4, 2019 at 4:00 am #

    Hi Jason,
    Could you explain how nested CV comes into play here? And where in the process does hyperparameter tuning come into play as well?
    Thanks!

    • Avatar
      Jason Brownlee January 4, 2019 at 6:34 am #

      All models created for evaluating using CV are discarded.

      They occur prior to fitting a final model.

      • Avatar
        Marzi January 15, 2020 at 5:50 am #

        Hi Jason,

        This post was the best that I found in the Internet, Thank you very much.

        However, I have the same problem. Why do some people use nested (double) CV then?

        Thank you very much for your informative posts!

        • Avatar
          Jason Brownlee January 15, 2020 at 8:30 am #

          Thanks.

          CV and nested CV are all used to find a model and set of configurations.

          After that you can fit a final model on the chosen model and config.

  49. Avatar
    Magnus January 23, 2019 at 2:14 am #

    Hi Jason,

    Good post. So far I have been using the train/validation/test split, and used the validation set to select hyper parameters and avoid overfitting and finally evaluated the model on the test set. Based on this post I guess I can do the same here and train the final model on all the data, or at least on the training set and validation set together? Because I waste a lot of data, not being used for training.

    In k-fold CV you mention that overfitting are addressed prior to finalizing the model. Can you elaborate on this? Because there is no validation set to stop the training. I assume you still use the validation set to stop training, for each fold.

    Do you have a post where you use both k-fold CV and the train/validation/test split and compare the results and derive a final model? If not, it would be very interesting.

    • Avatar
      Jason Brownlee January 23, 2019 at 8:50 am #

      Yes. Re-fit using same parameters on all data.

      Not sure what you’re referring to. CV is for estimating model performance only. If you use early stopping, a validation dataset must be used, even with the final model.

      Great suggestion, thanks.

  50. Avatar
    marta January 24, 2019 at 10:12 am #

    Dear Jason,
    first of all, thank you for helping!
    I wonder if you can provide as some references regarding what you explained in this post.
    It would be great!
    Bye

    • Avatar
      Jason Brownlee January 24, 2019 at 1:24 pm #

      It’s too practical to be written up in a paper, if that is what you mean. Academics don’t talk about a final model for use in operations.

  51. Avatar
    Hussam April 8, 2019 at 7:07 pm #

    Hello Jason,

    Thanks for your great blog, it’s very helpful.

    I have a question about this part “If you train a model on all of the available data, then how do you know how well the model will perform? You have already answered this question using the resampling procedure.””

    1.So is that a correct procedure to train the model on all the available dataset and just use resampling method like k-fold to see the accuracy of our mode?

    2.If we still split the dataset into train/test sets, test model accuracy and then on the other hand, also use k-fold on the train set. Is there any possibility that we get a big difference between the mean accuracy in K-fold and the model accuracy?

    Thanks!

    • Avatar
      Jason Brownlee April 9, 2019 at 6:22 am #

      Yes.

      There could be, it depends on the model and the data. Perhaps test how sensitive your model model is to changes in the size of your dataset.

  52. Avatar
    zzl April 14, 2019 at 11:21 am #

    hi, Jason,

    I still wonder about k-fold cross validation. Support I have a dataset D. When I do k-fold cv, should I split D into training dataset and test dataest first, and then split the train dataset into k-fold? Or just split the whole dataset D into k-fold?

    Another question is how to report final confuse matrix when using k-fold cv. Because I will get k confuse matrix.

  53. Avatar
    PC April 16, 2019 at 11:37 am #

    Hi Jason,
    Thanks for a great post.

    I need a clarification with the following code.

    ensemble = VotingClassifier(estimators=[
    (model1), (model2), (model3)], voting=’hard’)

    ensemble = ensemble.fit(X_train, Y_train)

    predictions=ensemble.predict(X_validation)

    As you have said in this post do I have to discard the X_train and Y_train subsets created using 10-fold Cross validation for fitting the ensemble model for making predictions or is this code correct. Do I need to use the entire dataset for fitting the ensemble model for making prediction.

    Kindly help me.

    • Avatar
      Jason Brownlee April 16, 2019 at 2:25 pm #

      The code appear to define a voting ensemble using 3 models.

      It is often a good idea to use different datasets to fit the ensemble vs the submodels.

      One approach involves using the out of sample data during cross validation.

  54. Avatar
    Ahmad May 15, 2019 at 1:25 am #

    Great post. It answered all my questions I couldn’t find in any other websites. Keep going!

  55. Avatar
    S May 27, 2019 at 6:16 pm #

    Hey Jason, nice article! Do you maybe know about more research on this topic? Any literature written on this, specifically for time series data?

    • Avatar
      Jason Brownlee May 28, 2019 at 8:11 am #

      No, it is an engineering consideration. E.g. how to use a model.

  56. Avatar
    Arno Grigorian June 6, 2019 at 9:45 am #

    Hi Jason,

    Great post. I’m fairly new to the machine learning and python and still learning. I have built few ML models, but once i retrain the model how do i deploy it.

    Like I’m not sure how to proceed in terms of how to save the models, retrain the ML model with entire training set and how to deploy it by code on new/future test data. Greatly appreciated

    • Avatar
      Jason Brownlee June 6, 2019 at 2:17 pm #

      If you are using sklearn, perhaps this will help:
      https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

      • Avatar
        Arno Grigorian June 7, 2019 at 7:16 am #

        Jason,
        Yes that partially helped. Part where I’m really stuck is, once i train/test my model, cross validate it and decide which model to use.

        How do i go back and essentially retrain my model on the entire data (training data set, instead of just on train/test cut of the data)

        • Avatar
          Jason Brownlee June 7, 2019 at 8:12 am #

          Collect all of the data into a single dataset and call model.fit().

          Perhaps I don’t understand the difficulty you are having?

  57. Avatar
    Sofia June 24, 2019 at 6:18 pm #

    Thanks for great tutorial, I have a question, I have some pictures with labels, I make 3 copies by adding Gaussian white noise to pictures and 3 copies by making non align pictures, initial dataset was centered on our image, then I shuffle them and split to test and train, accuracy is around one, my question is that good accuracy could be because of overlap in train and test dataset? should I do any refinment? or is ok

    • Avatar
      Jason Brownlee June 25, 2019 at 6:15 am #

      You cannot have copies of the same image in train and test, it would be an invalid evaluation of the model.

      Data augmentation is only used on the training set.

  58. Avatar
    Wu Xie June 27, 2019 at 9:54 am #

    Hi Jason,
    Recently I am always reading your excellent posts since I a new ML learner. These posts are clear enough to convey useful information. In recent days, I am confused and struggled to understand how cross-validation works.
    I have 500 datasets (may be a little), below is my procedure how I employ ML algorithms. Plz correct me if somewhere is wrong. It is binary classification question.
    1. Collect data to form 500 datasets
    2. Split the 500 datasets into training (80 %, 400) and test (20 %, 100) datasets
    3. Use 10-fold cross-validation to check 9 machine learning algorithms (Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, and AdaBoost) on whole 500 datasets. I find AdaBoost has the highest predictive accuracy from Step 3. This means that AdaBoost has more generalized capacity when applied to my question.
    4. Then I use GridSearchCV to find the best hyperparameter for AdaBoost on training (80 %, 400) datasets. Using test (20 %, 100) datasets to test the AdaBoost model with best hyperparameter. I can get the accuracy score, confusion matrix, AUC-ROC, TPR, FPR, ROC curve, PR curve.
    My question:
    (1) For step 3, it is whole 500 datasets or training (80 %, 400) datasets?
    (2) For step 4, do I need to finalize the model using whole 500 datasets? It is for academic question, not for industrial question.

  59. Avatar
    Krishna July 8, 2019 at 4:13 am #

    Dear Jason

    Request your opinion.

    Assume we build an initial Model using,say 1000 features. The output of variables of importance lists only 800 of these 1000 features.

    As I understand, while building final Model we would use the same parameters as earlier (e.g. optimal nrounds,etc..), but with complete data. I also use the same random seed as was used to build initial model.

    But what about features, should we supply 1000 or 800 features (as obtained above) – while building final model?

    Thanks

    • Avatar
      Jason Brownlee July 8, 2019 at 8:45 am #

      You would use the same framing of the problem (inputs and outputs) as was used during your experiment.

  60. Avatar
    Smitha Rajagopal July 8, 2019 at 5:38 pm #

    Hi. There is a publicly available dataset with pre-determined train(1,75,341) and test (82,332) splits. The performance of the model is always good with this experimental approach and I got 98% accuracy. However, in order to get a reliable assessment of the model, I combined train+test (2,57,673), then applied 80:20 split with cross validation and performed classification by stacking classifiers which resulted in an accuracy of 86%. Can I infer that although the accuracy is less, this method yields reliable predictions than the first method? Pls clarify.

    • Avatar
      Jason Brownlee July 9, 2019 at 8:06 am #

      Generally, it is a good idea to estimate the performance of a model across multiple train/test splits, ideally k-fold cross validation.

  61. Avatar
    Salomon July 19, 2019 at 7:26 am #

    Thanks Jason very useful information on how to get your final model.
    I am working on a churn predictive model. I already got my final model done. Do only thing that is a little confusing is when I use my model to predict with new data, how do I know what time period does predictions are likely to happen? If I use a years accumulation of data to train the model, and 6 months worth of new data to make my predictions. Should I expect my predictions to happen in the next 6 months?
    NOTE: all of my predictor variables are averaged per month.
    Thanks in advance,

    • Avatar
      Salomon July 19, 2019 at 7:28 am #

      *those instead of does

    • Avatar
      Jason Brownlee July 19, 2019 at 9:27 am #

      Not sure I follow, sorry. Can you elaborate?

      • Avatar
        Salomon July 20, 2019 at 12:33 am #

        For example, in my churn project if I am trying to predict clients that are likely to cancel. Lets say the model generated an output of 100 clients that are likely to cancel. Approximately when should I expect those clients to leave the company? Is it possible to know?
        Thanks for your attention

  62. Avatar
    sandipan sarkar July 24, 2019 at 4:25 am #

    Hello Jason,
    I went through all the 150+ reviews concerning this particular article.Based on the reviews I can easily understand that cross validation is i very important topic.
    As far as my understanding goes cross validation is to relate all the independent variables amongst each other to check their importance regards to model building hope I am correct.But this in turn also refers a little bit of multicollinearity. Isnt it???
    Can you please clarify.
    Thanks.
    Best Regards
    Sandipan Sarkar

  63. Avatar
    Bob July 27, 2019 at 9:44 am #

    Jason,

    According to this exchange it claims that you can’t report performance metrics of a model trained on a full data set using averages of the K-folds. Wouldn’t it be more appropriate to train on the full data set and then have a test data set to report the model performance?
    https://stats.stackexchange.com/questions/184095/should-final-production-ready-model-be-trained-on-complete-data-or-just-on-tra

    • Avatar
      Jason Brownlee July 28, 2019 at 6:37 am #

      Your summary goes against many years of findings, and I’m not interested in debates.

      My best advice is to prepare a test harness that give you and project stakeholders confidence in the estimate of model performance on new data.

  64. Avatar
    James MA July 30, 2019 at 1:32 am #

    Thank you for this great post, it’s very informative.
    However, I can’t imagine how the K trained models be applied for predication.

    I have also followed this post: https://machinelearningmastery.com/train-final-machine-learning-model/

    It has mentioned that, “You finalize a model by applying the chosen machine learning procedure on all of your data.”

    But I’m sorry that, I still have no idea how to finalize the model. Just pick the best model on the K trained model? Combine the result to build up a better model? Or anything else?

    Can you give some practical example to finalize a model based on the K-Flod Cross Validation? Thanks.

    • Avatar
      Jason Brownlee July 30, 2019 at 6:17 am #

      The simple approach is once you have a algorithm and config that is reliable on your test harness/procedures, you use it and fit a model on all available data then use that model to start making predictions on new data.

      If you have high variance in the final model (see seen on your test harness), you can reduce this variance by fitting K models on all training data, and using them together as an ensemble when making a prediction on new data.

      Does that help?

      • Avatar
        James MA July 30, 2019 at 8:55 pm #

        Thanks for the explanation.
        I just start self-study on machine leanring with python for few days, it seems that I have mixed up something in k -fold Cross Validation.
        Before I hit the topic k-fold Cross Validation, and I learn by splitting the known data into training and testing data, and we can estimate the performance by compare the predicated value with testing data.
        So, I guess the k-fold Cross Validation is used to find the best model by using different training and testing data group. it seems wrong.
        Can I said that, the k-fold cross validation is used to compare between different algorithm / config? Then based on the score to choose the best algorithm, and build the model using this algorithm.

        Thanks a lot.

  65. Avatar
    skyrim4ever August 1, 2019 at 2:58 pm #

    I’m little confused about k-fold CV in general.

    Initially I thought its purpose for evaluation to validate the created model. It was about using one particular ML model and repeat its evaluation multiple times to get multiple different scores in order to get average score. This is good since it avoids a “lucky” situation which happens if model training and evaluation was done once.

    But then here k-fold CV is used for finding hyperparameters (such as best ML technique to train model) before doing actual training or evaluation.

    Also, in some other sources the k-fold CV is used in whole dataset (training set + test set), in other sources k-fold CV is used only on training set.

    I am quite confused… I appreciate if you are able to clarify these things about k-fold CV.

  66. Avatar
    Kenechukwu August 23, 2019 at 12:59 am #

    Hi,
    Thanks for this great post.

    I have followed through the whole process of building a machine learning project. This includes standardization, feature engineering and even saving the model in a pickle file. However when I want to use my model to make predictions in the real world, i need to prepare the new data in the same manner I prepared the one for training. Knowing full well that my standardization was done based on the values in the training model, how do i standardize my new data according to the values in the training model?

    Does the standardization save rules for new data? How do I access them and also can I save some data into the model file because I would need to do some rounding off of numbers and those numbers were used in the training set.

    How do I get all these? Does the model save them?

    • Avatar
      Jason Brownlee August 23, 2019 at 6:33 am #

      You must use the same coefficients used to prepare the training data or the same objects if that is easier. You may need to pickle the object/coefficients as well.

  67. Avatar
    Coralie August 23, 2019 at 4:59 am #

    Hi Jason,

    Thank you for your post.
    I am confused with the step where we deploy the model. Are we keeping the ‘old’ coefficients that we get from trained data and applying it onto the new data? Or do we re-fit the model (that we conclude is ‘best’ through validation steps) on the new data?

    • Avatar
      Jason Brownlee August 23, 2019 at 6:35 am #

      The coefficients prepared on the training data is the model.

      The model is used in a read-only manner to make predictions on new data.

      Does that help?

  68. Avatar
    Coralie August 23, 2019 at 11:28 pm #

    Thank you! Makes sense.

  69. Avatar
    Lalit August 28, 2019 at 11:22 pm #

    Hi Jason,

    I have trained a model and save it, but the problem is that when i am going access it from my test code, the test code not getting the training model classes name. So, is there any trick via. i could solve the problem

  70. Avatar
    Mike September 12, 2019 at 3:02 am #

    Hi Jason:
    This series on cross-validation and training a final model have been very enlightening. I found both the information and your consistency in comment responses very helpful in my development of understanding this concept. I think my biggest hang-up was in hearing statements like “once you have a model that you like…” As I now understand CV from your explanation, we are not trying to optimize the result of model training during CV (model parameters, scaling factors, features, etc. ), but instead we want to optimize the process that we apply to each fold that gives us the most consistent “score” across all folds. For example, we might try a method of feature selection based each feature’s correlation to the predicted result. Applying that to each fold may in-fact result in different features per fold based on that fold’s training data. But in the end, we are not interested in the best list of features, but the best method by which we programmatically choose those features based on how it affects the variance of results across all folds. Once we have assembled all of the methods we “like” we then apply those against the entire dataset to produce a trained and deployable model. Am I understanding this correctly?

  71. Avatar
    patrick boulay November 10, 2019 at 4:56 am #

    Jason — I emailed you a long question yesterday after not finding the answer among your many posts. Then I suddenly found the topic elsewhere, especially in this posting on the “final model”. no need to reply to my post of yesterday. I now understand that the sampling through splitting and cross-validation yields both a configuration and algorithm that prove effective for the problem at hand. I am sure the application to a new data set presents opportunities for further refinement. So, when I asked, “where does the intelligence live,” I can suggest this answer: “it lives in the refined algorithm and configuration proven through experimentation.” Thinks for providing the excellent resources that you do.

    • Avatar
      Jason Brownlee November 10, 2019 at 8:27 am #

      Thanks Patrick.

      • Avatar
        Egor December 23, 2019 at 9:36 am #

        Hi Jason. Thanks for the article.

        I recently had the following problem. I have two classifiers – Random Forest and GBoost . I use CV to find the optimal parameters. Initially I used as the final model the models trained only on the training set and I used the accuracy metrics from the test set to assess them (accuracy is around 95%).

        Recently after reading your post, I decided to train the final models on the full data set. After doing that I estimate the accuracy of the final models on the full dataset, and while GBoost accuracy is around 95%, I get 100% accuracy for the Random Forest model. In other words, the final RF model classifies all the points in the full data set correctly.

        How is this possible? Have you seen something like that?

        Thanks,
        Egor

        • Avatar
          Jason Brownlee December 24, 2019 at 6:35 am #

          You cannot estimate the accuracy of the model on data used to train it.

          You estimate accuracy using cross validation, then fit a final model. No need to evaluate it again as you already have the estimate.

  72. Avatar
    John December 23, 2019 at 12:34 am #

    Is there any reason not to just ensemble the models trained during cross-validation to produce a final model?

    Surely, 10 models each trained with a different 90% of the data, has to produce better results than 1 model trained on all of it?

    The data I work with is noisy and non-stationary, so I presume lots of models will help reduce the variance in my results.

    • Avatar
      Jason Brownlee December 23, 2019 at 6:53 am #

      Off hand: one model is simpler.

      Ensemble is good only if there is ROI – e.g. a sufficient lift is skill. This is not always the case, can be neural or even negative.

      Reduction in variance is an excellent reason also.

      Try it and see!

  73. Avatar
    MAK January 6, 2020 at 9:27 pm #

    Hello Jason Brownlee,

    I have developed my own model in robust regression, I also did some programming for my proposed model. But now I want to validate the proposed model by using cross validation, to find the optimal values from a defined grid for the tuning and hyper parameters. If you can provide me any tutorial or codes related to cross validation for our own models which we can do manually, rather than for available models.

  74. Avatar
    Amir February 11, 2020 at 9:49 pm #

    Hi Jason, Thank you for this great article!

    I am still a bit confused about the final model. here is my example:

    I have 40 observations.
    1. I apply kfold cross validation with K=10 and then repeat it 50 times and at the end I have 500 models
    2. I calculate the mean of error metrics such as ame, rmse, standard deviation for the 500 models
    3. I chose my best configuration based on cross validation
    4. I train my data on all 40 observation to get the final model parameters

    so here is my question: how I can test my model to see how well it predicts? I do not have any observation left for prediction. do I need to keep some observations for my model predication. let say I train my final model on 30 values and then evaluate its prediction on the 10 left observation? if yes, how these 10 values should be selected among my dataset to be representative?

    Regards,
    Amir

    • Avatar
      Jason Brownlee February 12, 2020 at 5:46 am #

      No need to test the final model, you tested that config using cross validation already. That measure tells you how well the model is expected to perform on average.

  75. Avatar
    kiki April 7, 2020 at 1:55 pm #

    Hi Jason, I already save and load the model but I have a problem to proceed a single prediction. Do you have an idea on how to continue for a single prediction? Sorry for asking cause I’m juat a beginner of this machine learning. Thank you Jason

  76. Avatar
    kiki April 8, 2020 at 12:30 pm #

    Thank you Jason. This was really helpful for me. 🙂

  77. Avatar
    ASD April 14, 2020 at 7:12 pm #

    Hey Jason,

    Thanks for your post. Can you answer a really important question for me?

    I am working on my capstone project right now that does text classification from text scraped through the internet. In order to do so I had to do perform feature engineering to train an SVM (SVC) classifier. The features include: avg tfidf value of the entire sentence, presences of root word and presence of NER.

    Now that I have trained and saved the model I face a dilemma. I wanna use this pre-trained model to generate labels for unknown data that would be scraped at a later time. But the problem is, I had to create the feature called tfidf avg value (sum of all tfidf values of tokens in a sentence/ total tokens in the sentence). So if I have to generate these features for the unknown data later, the words and values will be vastly different. As such, the model may or may not perform well.

    How should I avoid this and ensure that the model works fine? Is there a way to get standard values for it?

    • Avatar
      Jason Brownlee April 15, 2020 at 7:57 am #

      Good question, perhaps you need to store the data prep objects used on the training data along with the model, and reuse them to prepare new data.

  78. Avatar
    Volkan Yurtseven May 4, 2020 at 1:06 am #

    Great post! Very informative..
    i have a question though.
    as fas as i know we don’t have to do k-fold cv if we have loads of data, we can go with train/tests datasets. But when we don’t have enough data, we should prefer k-fold(or other cv methods).
    Given these, do we really need to train our final model with the whole data(say consists of billions of observations), which we didn’t use even on training?

    • Avatar
      Jason Brownlee May 4, 2020 at 6:24 am #

      Thanks.

      Yes, generally.

      Perhaps not if your training dataset is sufficiently large and representative.

  79. Avatar
    Dr ST May 6, 2020 at 5:58 pm #

    Hi Jason!

    I am really grateful and amazed by how you explain very complicated concepts in a simple and understandable way.

    I wanted to thank you and wish you the best.

    Bless you!

  80. Avatar
    Si Wu May 11, 2020 at 4:26 pm #

    Hi Jason,

    Really nice post to answer the common confusion! Though I still have one confusion: I noticed that people sometimes use the entire dataset to do cv, sometimes they first split the entire dataset to the training and test, and only use the training to do cv. which one is correct or they have different purposes?

    If I understand your post correctly, we should first apply the second one (first split the data to the training and test, and only use the training to do cv), by doing this we compare different ML methods, and choose the best one. Then use the best one and use the entire dataset to train the model to get the final model. Is it correct?

    Many thanks and look forward to your answers!

    • Avatar
      Jason Brownlee May 12, 2020 at 6:37 am #

      Thanks.

      Both are correct, and neither. You must choose what makes sense for your specific project and goals.

      No, this post is saying that after you choose a model/config, throw all the models away, train a final model on all data and start using it to make predictions on new data.

  81. Avatar
    GAYATRI June 3, 2020 at 2:56 pm #

    i am a student trying to complete my research is in classification in rough set theory want a few basic answers
    how to find a reduct
    how to build a test model as mine is an algorithmic approach (i am using a tool RSES)
    and how to check the accuracy of the model
    will be

    • Avatar
      Jason Brownlee June 4, 2020 at 6:10 am #

      Sorry, I don’t have tutorials on rough set theory, I cannot give you advice on the topic.

  82. Avatar
    GAYATRI June 3, 2020 at 3:00 pm #

    hi
    i am continuing the above post
    will be grateful and look forward for your answer
    thanks …..

  83. Avatar
    jigya June 15, 2020 at 10:04 pm #

    hey,
    I wanted to know if I want to re-train the ML model with some new and updated data how do I do it?
    and If I want to remove some of the training data from the existing model for e-g- old data which is not so important now. is it possible to omit it while re-training the ML model?

    Could you provide some information or suitable links?

    Best Regards,

  84. Avatar
    Rimi June 19, 2020 at 4:32 am #

    Hi Jason,
    Thanks for the great post. I think this is one of the very few online posts that I’ve found useful re: cross validation and model building. I’m fairly new to this topic of cross-validation and still have some questions; wonder if you might share your expert opinion on this.
    I’m trying to build a multinomial logistic regression model for prediction purposes and asked to do cross-validation. A set of candidate independent variables (IVs) are given to me.
    I use all the IVs and do multinomial logistic regression using k-fold cross validation; I see that the model performs poorly: Pseudo R2 are very low and only a few IVs are significant. So I need to find a better model with reduced number of predictors. My question is:
    • Each time I try a new model (with a different set of IVs), do I do k-fold cross-validation? Then identify the best model and fit it on the entire dataset?
    • Or do I 1st do these procedures (i.e. trying different models with different set of predictors) on a particular sub-set of the entire sample, identify the most relevant/useful model, and then use k-fold cross-validation on the remaining sample only for this model?
    • Or do I do those steps of finding the final model on the entire sample and then do k-fold cross-validation only for the final model?
    The concerns I have regarding the options #1 and #3 above are that, data that are used to develop the model should not be used for testing/validating the model as it will give too “optimistic” results. Options #1 and #3 above will include a portion of the dataset, that might overlap on both the model ‘development’ and ‘validation’ samples…
    Would appreciate your opinion. Thanks!

    • Avatar
      Jason Brownlee June 19, 2020 at 6:20 am #

      Yes, each model should be evaluated in an identical manner (same folds).

      • Avatar
        Rimi June 22, 2020 at 11:28 pm #

        Hi Jason,
        Thank you so much for your prompt answer! I really appreciate it! Just wanted to confirm that I understood what you suggested: “each model should be evaluated in an identical manner (same folds)” using k-fold cross validation and then the final model be fit on the entire dataset…? Thanks again!
        Sincerely
        Rimi

  85. Avatar
    babak June 20, 2020 at 9:41 pm #

    Dear Dr. Jason
    thanks again for your information.
    What I understand from your page is that when somebody so as me using
    sklearn.model_selection import train_test_split
    1) it does not need to use K-fold?
    2) others, I using GBRT for classification and regression and I divide 20% to test data

    does it possible the overfitting or underfitting take place for my data?

  86. Avatar
    babak June 22, 2020 at 10:11 pm #

    thank you for your guidance
    according to your information, I reach to this hyperparameters in GBRT for overcome over and under fitting
    params = {‘n_estimators’: 1000,
    ‘max_depth’: 4,’subsample’:0.25,
    ‘min_samples_split’: 5,
    ‘learning_rate’:0.004,
    ‘loss’: ‘ls’,’min_samples_leaf’:5,
    ‘random_state’:7}
    does this learning rate is reaonabel because most literate talk about minimum 0.01 lr?

  87. Avatar
    babak June 24, 2020 at 4:22 pm #

    yes, I test and I just want to know there is no minimum learning rate that’s right?
    my train deviance chart is so that because of learning rate 0.004 has many fluctuated in contrast of test that moving too smooth over it.

  88. Avatar
    Tom Leung June 29, 2020 at 4:19 pm #

    Thanks for your great article!

    If I want to select the best one from various types of model as well as the best combination of hyperparameters, should I first perform CV for each type of model to get the optimal set of hyperparameters, then perform another CV to select the best model with their optimal hyperparameters?

    • Avatar
      Jason Brownlee June 30, 2020 at 6:14 am #

      One approach is to use tuning within each cv of each model. This is called nested cross-validation.

      Another is to quickly test a suite of models then tune the top performing to get the most out of them. Ideally, the first and second step would use diffrent data, but that may not always be possible.

  89. Avatar
    Tom July 10, 2020 at 7:50 pm #

    Dear Jason,

    Thank you very much for your precious contributions!

    Could you help me on several questions?
    I found that you said we have to decide whether to refit the model or use it as is if new data becomes available in a reply.

    >. if we decide to refit, do we need to redo all the works for choosing the best procedure then finalize the model from the very beginning?
    >. If so, after how many new data is available, then we redo the works?
    >. if we don’t redo, does it mean that we assume the pattern in the data doesn’t change over time?

    Regards,

    Tom

    • Avatar
      Jason Brownlee July 11, 2020 at 6:10 am #

      It is up to you how much work you want to do or not do.

      The simples approach is to assume the approach remains viable and refit it on new data.

  90. Avatar
    AJ July 14, 2020 at 10:45 pm #

    thanks for this really useful post:

    A question about implementing the final model after cross_validation: I want to retrain model now after CV on entirety of the data.

    Presumably now there is no holdout data (validation or independent testing data).

    Is there a strategy for choosing the number of epochs to train to prevent over/under fitting? During 10x cross_val, the average epochs taken to converge (with early stopping) is 25-30. Is taking the average epochs taken to converge over the 10 folds also a viable strategy for training on the entire dataset?

    so could the final model.fit call look like:

    model.fit(allcountdata, allonehotlabels, epochs=25)

    many thanks again for this incredibly useful post!

    • Avatar
      Jason Brownlee July 15, 2020 at 8:24 am #

      One approach is to use your test harness to find the number of epochs.

      Another approach is to hold out a validation set and use early stopping to train the final model.

  91. Avatar
    arijit August 5, 2020 at 3:32 am #

    Thanks for your tutorial. However I have a silly problem. I am training my model in keras using many data sets of the same dimension. It runs very nicely and finally i want to save the model…However my question is for every data sets the model is updated correct me if i am wrong…. then which model should i save to .h5 and .json file…is it the model which will obtain after running the programme over all datasets…

  92. Avatar
    msynquintyn September 11, 2020 at 12:01 am #

    Hi, and thx for this really good post !

    I have a question about categorical data for the final model :
    – I splitted data into train/test split
    – I fitted a sklearn OnehotEncoder on train data and transform train/test data
    – I made cross-validation to select my model

    Now I fit my model on all data to finalize it but… Do I need te ‘refit’ my OneHotEncoder with ALL data before transforming or must I reuse the fitted one on the train data ???

  93. Avatar
    Valentin Mayr September 18, 2020 at 7:04 pm #

    Hi Jason,
    a great post, again! Thank you!
    I hope this is neither too naive nor too off-topic:

    If I finalize my model (lstm-time-series prediction) like done here and then ran model.predict() on all the data again (to see in how far the predictions would have been correct in the past), is that supposed to give me a nicely aligned pair of y and yhat curves? Or will the predictions be off, as the model is not in the state to predict an early state in the time-series again? (I tried, and it was off. So I am worried I build a crappy model…) – Thanks.

    • Avatar
      Jason Brownlee September 19, 2020 at 6:53 am #

      I think you’re asking will the model make good predictions on the training set. If so, the answer is it should, but they won’t be perfect – there’s always some error when we try to generalize.

      Does that help?

      To choose a good model, select a metric and optimize it, and it is a great idea to review both the predictions made by the model vs expected values in a hold out set (test set) as well as a plot of the errors. It can give you ideas of the range of error and how you might model the problem better.

      • Avatar
        Valentin September 21, 2020 at 4:16 pm #

        Hi Jason,
        thank you. That certainly helps. The prediction was far off, really … What threw me off, was that I achieved an excellent RMSE of around 2% during the train and test procedure. I will rethink the model I used and hope to find the wrong turn. However, I still wonder how the graphed RMSE for train and test can look so promising, while the final prediction is so far off.
        I will dig deeper into my model 😉

  94. Avatar
    Babak September 22, 2020 at 5:38 am #

    Great post. But I have a question. How to train a neural network on all data?
    We need a validation set to stop the training of NN to prevent overfitting. Even, let’s say we split the final data to train-validation set with 10% validation to stop the training of the network, then the question would be how to pick that validation set?
    Different train-validations may result in different accuracies for the trained model. Should we try another 10-fold CV wihtout any test data and take the best one as the final model?

    • Avatar
      Jason Brownlee September 22, 2020 at 6:55 am #

      Thanks.

      If you need data for early stopping, then hold some data back for early stopping.

      Yes, fit n models and ensemble their predictions if you want to counter the high variance.

      • Avatar
        Babak September 22, 2020 at 8:51 pm #

        Yes, the ensemble model is always possible. But what if we need only one model to present as the final solution not an esnamble?
        Then how to choose the validation set (for early stoping?) also in what ratio? 10%? 15%? …

        If I fit n final models, then is it wise to take the best one or the worst one as the final product?

        • Avatar
          Jason Brownlee September 23, 2020 at 6:37 am #

          A random sample. Then perhaps do a sensitivity analysis to determine the appropriate ratio.

  95. Avatar
    Erfan October 21, 2020 at 6:04 am #

    Thanks for your invaluable and truly practical material.

    They have helped me a lot in both implementation of different algorithms and also understanding the underlying concepts.

    Please keep it up.

    Thanks for you efforts.

    Erfan

  96. Avatar
    Erfan October 21, 2020 at 6:50 am #

    I have compared and tuned 3 different models using 5-fold CV.

    I tuned the hyper parameters of the models by looking at the averaged 5f_CV scores on validation set (not a training set). And I found model#2 is the best model having the highest averaged 5f-CV score on validation sets.

    Now I am going to pick the tuned model#2 to fit over all my data set to get my finial model.

    Am I right?

    Thanks in advance for your response.

    • Avatar
      Jason Brownlee October 21, 2020 at 7:49 am #

      It is a good idea to choose a model based on their mean performance relative to the performance of other models tested on the same test harness.

  97. Avatar
    Omar November 25, 2020 at 8:31 pm #

    Hi, Jason:

    I have an imbalanced dataset. After applying an Stratified10Fold (with a 3NN and undersamppling every fold) I obtain an f1_score average of 83%. Finally, I train my model on all my available data and apply it to new unseen data, getting a 67% of f1_score. Is this happening because of the imbalance of the classes? Should I undersample all my data and apply the 3NN? Or is it better to keep the model I used on the last fold?

    Thanks in advance!

  98. Avatar
    bill February 5, 2021 at 1:22 pm #

    Hi, Jason:
    Thanks for your post, it’s great. I’m a beginner. I’m confusing that for a single fold of K-fold CV, the test set is unseen data to the model training on train data, so if we don’t attempt to use K-fold CV to choose the best parameters, can I think that K-fold CV is combined of K final model?

    • Avatar
      Jason Brownlee February 6, 2021 at 5:43 am #

      Sorry, I’m not sure I follow your question sorry.

      Perhaps you can rephrase it?

  99. Avatar
    Tomas March 11, 2021 at 12:24 am #

    Hi, I’m starting to learn about this interesting world of Machine Learning and I would like to ask you, once my model is trained, how can I make it learn with new data over time.
    You mentioned that once finalized, the train data-set will not be needed anymore. How will he model’s learning be stored?

    Thanks!

    • Avatar
      Jason Brownlee March 11, 2021 at 5:12 am #

      You save the model to file, load it and start making predictions on nw data.

      Everything learned by the model from the training dataset is in the model.

      Does that help?

  100. Avatar
    Yao March 19, 2021 at 7:24 am #

    Hi Jason, I’m a big fan! Thank you for all you do to educate us. Can we use the return_train_score vs the mean_test_score from CV in Sklearn to determine model stability and overfitting? If yes, should we use the minimum score difference between the test and train as a basis for our selection?

  101. Avatar
    Emil March 28, 2021 at 8:51 am #

    Hi Jason,

    Thanks for the article.

    I am currently working on a classification problem, predicting service provider churn in the sharing economy. I will use an example of Uber drivers to explain my issue.

    Let’s say I have a dataset of 500 drivers that I use to create a model with, I get a decent accuracy on the test set, and then use all the data to train a final model. Is there anyway to predict on the drivers that we currently employ (the ones used in the model) – could I for example use the model to predict on the same drivers 1 month later, when a lot of the values has changed that are used to make a model with? An example of variables that will be changed could be: “Orders handled (drives)”, “Total money earned”, “Customer Reviews”. What is your opinion on this – does it need to be NEW data, or can it be the “same” data, that is now progressed into different data?

    Thank you in advance and stay safe.

    • Avatar
      Jason Brownlee March 29, 2021 at 6:06 am #

      If you have new or different variables, generally the model will become invalid.

  102. Avatar
    felice March 31, 2021 at 11:26 pm #

    Hi Jason
    congratulation for your web pages. I developed a Random forest model able to predict a numerical response by using 20 predictors with a bias of 90% and variance of ~15% on training/test sets. Once I deployed the model in production after some week I did a check of the model performance. With my big surprise I discovered that the bias and variance were close to zero (R2 –>0). I checked the 20 predictor distributions and I saw that some on those had a slight mean shift even though the distributions were still inside the same range of values. In order to verify that this small difference could be the reason for the bad model performance in production, I put all data togheter, and randomly I selected some instances to use later to check the deployed model. This time the distribution between the predictors used to build the model and the ones unseen were excatly the same. And the model performed with same accuracy obtained for the test set. So how is possible that a sligth differene in mean can make a so big difference in terms of model performance? What can be done to make less sensitive the model? In semiconductor production where I work a slight difference in mean distribution along the time is highly probable. What should be the better strategy to use in this case?

    Thanks in advance

    Felix

  103. Avatar
    Felice April 4, 2021 at 5:12 am #

    Jason thanks for your msg.
    According to your experience if i have a table like this one

    Date X1 X2 ….. Y
    1/1/2020 0.1 3 6.2
    1/2/2020 0.3 5 7
    1/3/2020 0.2 10 5.9

    where the X predictors are sensors data coming from a production machine (semiconductor field) and Y an electrical measurement taken at the end of production line, it is a multi-time serie problem? I want to predict the response Y according to the sensor data X. The split training/validation/test must be taken randomly or considering the time axis past/present/future?
    In the first case for example a random forest returns a very good result. while in the second one instead the same model doesn’t work at all.

    Thanks for your help.

    Felice

  104. Avatar
    Doyle Kalumbi June 7, 2021 at 12:12 pm #

    Jason and all folks:

    I am on Windows 10. I have imported and initialized H2O. I am trying to run an XGBOOST. I have imported H2OXGBoostEstimator. When I run my model, I get an error: Error Post/3/Model Builders/xgboost not found. Any help please. Thanks

    • Avatar
      Jason Brownlee June 8, 2021 at 7:11 am #

      I recommend using the xbgoost library and API directly.

  105. Avatar
    GT June 30, 2021 at 4:36 am #

    Hi Jason!

    How do you decide when to use nested cross validation? Some said that it should be done as long as the sample size is small (~2000) but some said that it should be done only when you want to compare algorithmns (e.g., elastic net vs SVM). What is your opinion?

    Also, you mentioned that we should discard all our cross-validation models when training the final model using all our training data.This means that we do not use the top features selected in the best cross validation model to train the final model? But it is ok to use the best tuned hyperparameters to train the final model?

    Thank you!

    • Avatar
      Jason Brownlee June 30, 2021 at 5:24 am #

      No strict rules. If you prefer it, then go for it.

      In nested cv, we can discard the outer loop for the final model and use the “tuned” model returned from the inner loop as the final model – I think sklearn fits it on all data for us. Otherwise, we can use the config and fit a final model ourselves.

  106. Avatar
    GT June 30, 2021 at 9:05 am #

    ohh how do we determine which inner model to choose then? since each outer loop will have their own inner model? do we just select the “best” performing outer loop, in terms of R2 or RMSE, and then choose the inner model for that particular outer loop to train the final model?

    Thanks a lot 🙂

  107. Avatar
    Nagesh Hampapura April 11, 2022 at 12:10 am #

    Hi Jason,
    Thanks for clarifying that –

    we do train / test split or kfold cross validation just to check

    1) whether the given data is a representative sample of the production data &
    2) the type of model we build will help in getting accurate predictions

    Once we are sure about the above two aspects, we can use the entire data to build the final model.

    Regards,
    Nagesh.

    • Avatar
      James Carmichael April 14, 2022 at 3:43 am #

      Thank you for the feedback Nagesh!

  108. Avatar
    Simon Yelisey May 12, 2022 at 12:31 am #

    Great article!

    James could you answer, how to choose the final number of estimators when we use early stopping rounds in gradient boosting algorithms such as CatBoost and XGBoost on Cross Validation?

    Thank you!

  109. Avatar
    A. May 21, 2022 at 7:28 pm #

    Hello James, great article! My question may be silly, but what should we do in case of a pipeline with some preprocessing steps?

    A simple example:

    # A pipeline with preprocessors

    log_clf = Pipeline([(‘preprocessor’, preprocessor), # OHE, Scalers, Imputers etc.

    (‘smote’, SMOTE(random_state=42)), # as I deal with unbalanced dataset

    (‘clf’, LogisticRegression(**log_clf_best_params))]) # params from hyperparameters tuning

    log_clf.fit(X_train, y_train)

    OR

    log_clf.fitt(X, y) ? 🙂

    # joblib dump

    model_name = ‘log_clf_model.pkl’

    joblib.dump(value=log_clf, filename=model_name)

    Should I simply fit the pipeline on the training data, predict on the test set and – if the results are fine – fit it again on the whole dataset and export?

    best!

  110. Avatar
    betty December 11, 2022 at 11:53 pm #

    Thank you!

    so when we want to choose the final model to make a prediction after using cross validation, should retrain the model on all available data? (for available data you mean the training set e,g 70%)or how I can do that?

  111. Avatar
    pongthorn January 14, 2023 at 8:18 pm #

    Hi
    I would like to know about buiding final model.

    I read this article: https://machinelearningmastery.com/train-final-machine-learning-model/
    Supposing , I have 100 items , I split test data(90%) and 10% remaining is test data , I normalize data using MinMaxScaler() like below.
    scaler = MinMaxScaler()
    scaler.fit(train_values)
    train = scaler.transform(train_values)
    test = scaler.transform(test_values)

    After that, I trained and tuned it to get the best model perfectly.

    Finally, I will build the final model by fit mode with the entire data (100 items) for running production.

    The question is
    I need to recreate the scaler with 100 items for fitting to get the final like below?
    final_scaler = MinMaxScaler()
    final_scaler =scaler.fit(all_values)
    all_secaled_values = scaler.transform(all_values)

    finale_model.fit(x=all_x, y=all_x, epochs=1)

    OR I can use scaler from training to run on production?

    • Avatar
      James Carmichael January 15, 2023 at 8:41 am #

      Hi Pongthorn…You should try your idea on a validation dataset (data never seen by the model) prior to putting into production.

  112. Avatar
    Murilo April 11, 2023 at 8:58 pm #

    If i am using EarlyStop during my KFold Crossvalidation, should i also use it when i train my final model with all the dataset?

  113. Avatar
    bhagath May 16, 2023 at 7:31 pm #

    Do we need to create scaler on all available data as part of preparing final model ?

  114. Avatar
    Dennis Pawlowski September 11, 2023 at 6:18 am #

    Hello, first of all I would like to thank you for the really useful posts! However, I still have one question that remains open for me.

    I have a dataset split into train, val and test dataset.

    If I now want to use k-fold cross-validation for evaluations, then consequently I would have to merge the train and val subset to form N folds from it later. If I now find that the model works well (after cross-validation), then the final training is done with the data from train and val. This means that all the data is used for training and any intermediate validation during training is omitted (there is no data for it). However, at the end, the model is evaluated with the test data set that was not used during the cross-validation. Did I understand this correctly? Is it even necessary to evaluate the final model if it has already proven to be good before? If not, then I could also use the data from the test data set to train the final model.

  115. Avatar
    Mirognal Baye March 15, 2024 at 9:57 pm #

    I am Mirognal and I am Msc student in Engineering Hydrology program. My research title on Evaluating effectiveness of land use land cover and soil and water conservation practices impact on soil loss and sediment yielg using machine learning. Therefore I need support for all of you how to start my research using machine learning. If you volunteer, Please post step by step procedure from the start to the end what I will do.
    Thank you for what you have done and what you will do for us.

    • Avatar
      James Carmichael March 16, 2024 at 9:15 am #

      Hi Mirongnal…I have some research degrees and I loved research then and love it now. But I don’t think I’m good at it.

      In fact, I like reading and figuring out what others have learned perhaps more than devising my own research program. Perhaps I’m more engineer and scholar than academic researcher.

      As such, I don’t feel qualified to give you advice on your research.

      I recommend talking things over with your research advisor. After all, this is exactly their job, you chose them, and they chose you.

      Also, very clever people have written up their advice.

      I recommend reading:

      Principles of Effective Research, Michael Nielsen, 2004.
      An Opinionated Guide to ML Research, John Schulman, 2017.
      Also these classics:

      You and Your Research, Richard Hamming, 1986.
      Pathological Science, Irving Langmuir, 1953.
      Cargo Cult Science, Richard Feynman, 1974.
      Why Most Published Research Findings Are False, John Ioannidis, 2005.

Leave a Reply