How to Train a Final Machine Learning Model

The machine learning model that we use to make predictions on new data is called the final model.

There can be confusion in applied machine learning about how to train a final model.

This error is seen with beginners to the field who ask questions such as:

  • How do I predict with cross validation?
  • Which model do I choose from cross-validation?
  • Do I use the model after preparing it on the training dataset?

This post will clear up the confusion.

In this post, you will discover how to finalize your machine learning model in order to make predictions on new data.

Let’s get started.

How to Train a Final Machine Learning Model

How to Train a Final Machine Learning Model
Photo by Camera Eye Photography, some rights reserved.

What is a Final Model?

A final machine learning model is a model that you use to make predictions on new data.

That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value).

For example, whether the photo is a picture of a dog or a cat, or the estimated number of sales for tomorrow.

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

  • Data: the historical data that you have available.
  • Time: the time you have to spend on the project.
  • Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations.

In your project, you gather the data, spend the time you have, and discover the data preparation procedures, algorithm to use, and how to configure it.

The final model is the pinnacle of this process, the end you seek in order to start actually making predictions.

The Purpose of Train/Test Sets

Why do we use train and test sets?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem.

The training dataset is used to prepare a model, to train it.

We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

Let’s unpack this further

When we evaluate an algorithm, we are in fact evaluating all steps in the procedure, including how the training data was prepared (e.g. scaling), the choice of algorithm (e.g. kNN), and how the chosen algorithm was configured (e.g. k=3).

The performance measure calculated on the predictions is an estimate of the skill of the whole procedure.

We generalize the performance measure from:

  • the skill of the procedure on the test set

to

  • the skill of the procedure on unseen data“.

This is quite a leap and requires that:

  • The procedure is sufficiently robust that the estimate of skill is close to what we actually expect on unseen data.
  • The choice of performance measure accurately captures what we are interested in measuring in predictions on unseen data.
  • The choice of data preparation is well understood and repeatable on new data, and reversible if predictions need to be returned to their original scale or related to the original input values.
  • The choice of algorithm makes sense for its intended use and operational environment (e.g. complexity or chosen programming language).

A lot rides on the estimated skill of the whole procedure on the test set.

In fact, using the train/test method of estimating the skill of the procedure on unseen data often has a high variance (unless we have a heck of a lot of data to split). This means that when it is repeated, it gives different results, often very different results.

The outcome is that we may be quite uncertain about how well the procedure actually performs on unseen data and how one procedure compares to another.

Often, time permitting, we prefer to use k-fold cross-validation instead.

The Purpose of k-fold Cross Validation

Why do we use k-fold cross validation?

Cross-validation is another method to estimate the skill of a method on unseen data. Like using a train-test split.

Cross-validation systematically creates and evaluates multiple models on multiple subsets of the dataset.

This, in turn, provides a population of performance measures.

  • We can calculate the mean of these measures to get an idea of how well the procedure performs on average.
  • We can calculate the standard deviation of these measures to get an idea of how much the skill of the procedure is expected to vary in practice.

This is also helpful for providing a more nuanced comparison of one procedure to another when you are trying to choose which algorithm and data preparation procedures to use.

Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.

Both train-test splits and k-fold cross validation are examples of resampling methods.

Why do we use Resampling Methods?

The problem with applied machine learning is that we are trying to model the unknown.

On a given predictive modeling problem, the ideal model is one that performs the best when making predictions on new data.

We don’t have new data, so we have to pretend with statistical tricks.

The train-test split and k-fold cross validation are called resampling methods. Resampling methods are statistical procedures for sampling a dataset and estimating an unknown quantity.

In the case of applied machine learning, we are interested in estimating the skill of a machine learning procedure on unseen data. More specifically, the skill of the predictions made by a machine learning procedure.

Once we have the estimated skill, we are finished with the resampling method.

  • If you are using a train-test split, that means you can discard the split datasets and the trained model.
  • If you are using k-fold cross-validation, that means you can throw away all of the trained models.

They have served their purpose and are no longer needed.

You are now ready to finalize your model.

How to Finalize a Model?

You finalize a model by applying the chosen machine learning procedure on all of your data.

That’s it.

With the finalized model, you can:

  • Save the model for later or operational use.
  • Make predictions on new data.

What about the cross-validation models or the train-test datasets?

They’ve been discarded. They are no longer needed. They have served their purpose to help you choose a procedure to finalize.

Common Questions

This section lists some common questions you might have.

Why not keep the model trained on the training dataset?

and

Why not keep the best model from the cross-validation?

You can if you like.

You may save time and effort by reusing one of the models trained during skill estimation.

This can be a big deal if it takes days, weeks, or months to train a model.

Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.

This is why we prefer to train the final model on all available data.

Won’t the performance of the model trained on all of the data be different?

I think this question drives most of the misunderstanding around model finalization.

Put another way:

  • If you train a model on all of the available data, then how do you know how well the model will perform?

You have already answered this question using the resampling procedure.

If well designed, the performance measures you calculate using train-test or k-fold cross validation suitably describe how well the finalized model trained on all available historical data will perform in general.

If you used k-fold cross validation, you will have an estimate of how “wrong” (or conversely, how “right”) the model will be on average, and the expected spread of that wrongness or rightness.

This is why the careful design of your test harness is so absolutely critical in applied machine learning. A more robust test harness will allow you to lean on the estimated performance all the more.

Each time I train the model, I get a different performance score; should I pick the model with the best score?

Machine learning algorithms are stochastic and this behavior of different performance on the same data is to be expected.

Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method.

If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.

I talk more about this in the post:

Summary

In this post, you discovered how to train a final machine learning model for operational use.

You have overcome obstacles to finalizing your model, such as:

  • Understanding the goal of resampling procedures such as train-test splits and k-fold cross validation.
  • Model finalization as training a new model on all available data.
  • Separating the concern of estimating performance from finalizing the model.

Do you have another question or concern about finalizing your model that I have not addressed?
Ask in the comments and I will do my best to help.

86 Responses to How to Train a Final Machine Learning Model

  1. Elie Kawerk March 17, 2017 at 7:06 am #

    Hi Jason,

    Thank you for this very informative post. I have a question regarding the train-test split for classification problems: Can we perform a rain/test split in a stratified way for classification or does this introduce what is called data snooping (a biased estimate of test error)?

    Thanks
    Elie

  2. Dan March 18, 2017 at 5:59 am #

    “Also, this information is invaluable as you can use the mean and spread to give a confidence interval on the expected performance on a machine learning procedure in practice.”

    I have to assume a normal distribution for that right? But is this the always the case? Or should i normalize my data in a preprocessing step and then it would be correct to assume that? Thanks

    • Jason Brownlee March 18, 2017 at 7:53 am #

      Hi Dan, great question!

      Yes, we are assuming results are Gaussian to report results using mean and standard deviation.

      Repeating experiments and gathering info on the min, max and central tendency (median, percentiles) regardless of the distribution of results is a valuable exercise in reporting on model performance.

  3. Kleyn Guerreiro March 20, 2017 at 10:36 pm #

    Great post….my little experience teached me that:
    a) for classification you can use your final trained model with no risk
    b) for regression, you have to rerun your model againt all data (using the parameters tuned during training)
    b) specifically for time series regression, you can’t use normal cross validation – it should respect the cronology of the data (from old to new always) and you have to rerun your model againt all data (using the parameters tuned during training) as well, as the latest data are the crucial ones for the model to learn.
    Cheers!

  4. Hank May 12, 2017 at 5:16 am #

    Great post! I really learned a lot from your post and applied it to my academic project. However, there are few questions still in my mind. In our project, we want to compare different machine algorithms with and without 10-fold cv, including logistics regression, SVM, random forest, and ANN. We can get the cv score of each model with 10-fold cross validation, but the problem is how can we get the final model with 10-fold? Does the cross-validation function as finding best parameter of the different model? (such determine k in kNN?) I am still a little bit confused about the purpose of cross-validation. Thanks

    • Jason Brownlee May 12, 2017 at 7:49 am #

      Hi Hank, the above directly answers this question.

      Cross-validation is a tool to help you estimate the skill of models. We calculate these estimates so we can compare models and configs.

      After we have chosen a model and it’s config, we throw away all of the CV models. We’re done estimating.

      We can now fit the “final model” on all available data and use it to make predictions.

      Does that make sense?
      Please ask more questions if this in not clear. This is really important to understand and I thought I answered all of this in the post.

      • Hank May 12, 2017 at 3:20 pm #

        Hi Jason,

        Thank you so much! Does that mean cross-validation is just a tool to help us compare different models based on cross-validation score?
        After we are done with evaluation, we would apply original model to whole dataset and make predictions. Since I read a paper where the author compare auc, true positive rate, true negative rate, false positive rate and false negative rate between those models with and without cross-validation. It turns out that logistic regression with 10fold perform best. So I though we will apply logistics regression with 10-fold to test data. Is my understanding incorrectly? Thanks!

        • Jason Brownlee May 13, 2017 at 6:12 am #

          Yes, CV is just a tool to compare configs for a model or compare models.

  5. Petros May 12, 2017 at 10:55 am #

    Hi Jason,

    Great post.

    It took me awhile to get this but when the penny dropped about 18 months ago it was liberating. I liken cross validation to experimenting a process which you want to emulate against all your train data. One idea though.

    When you cross validate you might say 10 folds of 3 repeats for each combination of parameters. Now say with whatever measure you are taking for accuracy you typically taken the mean from these 30. Is it sensible to bootstrap with replacement, particularly if it is not Gaussian, from this sample of 30 say 1000 times and from their calculate the median and 2.5/97.5 percentiles?

    What does everyone else think!

    PK

    • Jason Brownlee May 13, 2017 at 6:09 am #

      Yes, I like to use the bootstrap + empirical confidence intervals to report final model skill.

      I have a post showing how to do this scheduled for later in the month.

  6. Warren van Niekerk May 12, 2017 at 2:56 pm #

    Thanks for the very informative post. Just one question: When you train the final model, are you learning a completely new model or is some or all of the value of the previously learned models somehow retained?

    • Jason Brownlee May 13, 2017 at 6:11 am #

      Yes, generally, you are training an entirely new model. All the CV models are discarded.

  7. Muralidhar SJ May 12, 2017 at 6:53 pm #

    Thanks Jason. Very Useful info & insight , helping lot to take right approach .

  8. Imene May 14, 2017 at 6:57 am #

    Thank you very much Jason. I found in this post answers to many questions.

  9. issam May 14, 2017 at 8:42 am #

    Hi Jason
    I want tank you for this informative post . I m working in project “emotion recognition on image” I want to know how can I create my model and train it.

    thanks in advance

  10. EN MO May 14, 2017 at 5:37 pm #

    Very informative, thanks alot, am also trying to see if this will be useful in a project I would like to do, and how it can be applied in biometrics and pattern recognition

  11. Ras May 17, 2017 at 10:38 pm #

    Thanks for the article. What about the parameters. You will likely do tunning on the a development set or via cross-validation. The optimum parameter set you find is the best for that particular split or fold. Wouldn’t it be left to chance for our optimized parameters to be the optimum with the whole training data as well?

    • Jason Brownlee May 18, 2017 at 8:37 am #

      Hi Ras,

      k-fold cross-validation is generally the best practice for using the training dataset to find a “good” configuration of a model.

      Does that help? Is that clearer?

  12. lalneirem May 23, 2017 at 7:28 pm #

    thanks for this post
    i know this may be useful but i don’t know what we do in training phase using KNN
    if u can write the details step that is done during training phase
    i will be so grateful

  13. aquaq June 1, 2017 at 6:29 pm #

    Thanks for this post, it has given a clear explanation for most of my questions. However, I still have one question: if I have used undersampling duting CV, how should apply it to my whole data. To be clearer
    – I have a training set of around 1 million positive (+) and 130 thousand negative (-) examples. I also have an independent test data set with a hundred thousand (+) and 4000 (-) examples.
    – I have estimated performance with 10-fold CV and applied undersampling (I have used R gmlnet package, logit regression with LASSO, training for AUC). It gave me super results for the CV.

    And now I’m lost a bit. Training for all data would mean to randomly select 130 thousand (+) from the 1 million and only use this ~260 thousand examples? Should I evaluate my model after training on my test data set?

    Thank you for your help!

    • Jason Brownlee June 2, 2017 at 12:56 pm #

      If you can, I would suggest evaluating the model on all data and see if skill improves.

      In fact, it is a good idea to understand the data set size and model skill relationship to find the point of diminishing returns.

  14. Chayanika Mudiar June 19, 2017 at 10:19 pm #

    I have a question. In the training process using gausion naive bayes, can you say what are the steps to be taken to train the model.

  15. Tyrone July 7, 2017 at 5:40 pm #

    Hi Jason. Thanks for a great article!

    When you say that “You finalize a model by applying the chosen machine learning procedure on all of your data”, does this mean that before deploying the model you should train a completely new model with the best hyperparameters from the validation phase, but now using training data + validation data + testing data, i.e. including the completely unseen testing data that you had never touched before?

    This is how I interpret it, and it makes sense to me given that the whole the whole point of validation is to estimate the performance of a method of generating a model, rather than the performance of the model itself. Some people may argue, though, that because you’re now training on previously unseen data, it is impossible to know how the new trained model is actually performing and whether or not the new, real-world results will be in line with those estimated during validation and testing.

    If I am interpreting this correctly, is there a good technical description anywhere for why this works in theory, or a good explanation for convincing people that this is the correct approach?

    • Jason Brownlee July 9, 2017 at 10:37 am #

      Yes. Correct.

      Yes. The prior results are estimates of the performance of the final model in practice.

      • Tyrone July 10, 2017 at 11:19 pm #

        Thanks Jason. It’s great to have confirmation of that. Do you know of any published papers or sources out there that spell this out explicitly or go into the theory as to why this is theoretically sound?

  16. aquaq July 25, 2017 at 10:56 pm #

    Thanks Jason for this explanation. I would like to ask how to deal with test sets when I would like to compare the performance of my model to existing models. Do I have to hold out a test set, train my model on the remaining data and compare all models using my test set?
    After that, can I merge this held out set to my original training set and use all data for training a final model?
    What other solutions can be used?

    • Jason Brownlee July 26, 2017 at 7:54 am #

      Yes. Choose models based on skill on the test set. Then re-fit the model on all available data (if this makes sense for your chosen model and data).

      Does that make sense?

  17. Paul August 3, 2017 at 10:21 am #

    Thank you for the great post Jason.
    I have a question about forecasting unseen data in RNN with LSTM.
    I’ve built complete model using RNN with LSTM by using the post(http://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras).
    How can we forecast unseen data(like ahead of current) from complete model?
    I mean we don’t have any base data except time though.

    I already saw some comments that you replied “You can make predictions on new data by calling Y = model.predict(X)” on that post. However, I couldn’t understand.. :'(

    • Paul August 3, 2017 at 10:30 am #

      I mean in real-time. 🙂

      Thanks in advance.

      Best,
      Paul

      • Jason Brownlee August 4, 2017 at 6:47 am #

        In real time, the same applies, but you can decide whether you re-train a new model, update the model or do nothing and just make predictions.

    • Jason Brownlee August 4, 2017 at 6:46 am #

      You can predict the next step beyond the available data by training the model on all current data, then calling predict with whatever input your model takes taken from the end of the training data.

      Does that help?

      Which part is confusing?

  18. S H September 8, 2017 at 11:45 pm #

    Hi Jason.

    Thanks a lot for this great and informative post. I have 2 questions I would be thankful if you can help me with them:

    1- Is that possible to refresh (update) a model without retraining it in full? To elaborate more, I have a model built using 9 weeks of data (weekly snapshots). As the size of the dataset is very large, when I want to update the model on a weekly basis, it takes a lot of time. Is that possible to update the model with the new snapshot (say for week 10), without retraining the model on the whole 10 weeks (9 old snapshots + 1 new snapshot)?

    2- When I train my model and evaluate it using cross-validation, I get errors (or alternatively, I get AUCs) which are consistently better than what I get when I score serving data and test the real performance of the model. Why is that so, and how can I treat it? To elaborate more, taking the 9 snapshots explained in the first question, I use snapshot_date column as cross-validation fold column. Therefore, in each round, the algorithm uses 8 weeks of data for training, and test the model on the remaining unseen week. Therefore, I would end up with 9 different models and 9 different AUCs on the validation frame. All the AUCs are between 0.83 to 0.91. So I would expect that the real performance of the model built using whole data should be at minimum AUC 0.83. However, when I score the serving data, and the next week I assess the performance of the model, I see no better than AUC 0.78. I have experienced it for 3 weeks (3 times), so I don’t think it’s just random variation. Additionally, I am quite sure there is no data leakage and there is no future variable in my data. Also, I tune the model quite well and there is no overfitting.

    Your help is highly appreciated.

    • Jason Brownlee September 9, 2017 at 11:57 am #

      You can update a model. The amount of updating depends on your data and domain. I have some posts on this and more in my LSTM book.

      Model evaluation on test data is biased and optimistic generally. You may want to further refine your test harness to make the reported scores less biases for your specific dataset (e.g. fewer folds, more folds, more data, etc. depends on your dataset).

  19. Kenny October 12, 2017 at 1:05 am #

    Hello Jason, very interesting this post! what do you get

    I finished my model (with a score of 91%), but how do I do or how to evaluate this model with the new dataset?

    I have saved the model in model.pkl but in my new data (for example iris.csv) How to predict the field “species”? (in my datasets do I need to put this field blank?) how is this step?

    Thks for your help because I’m confused

    • Jason Brownlee October 12, 2017 at 5:33 am #

      Load the model and call model.predict(X) where X is your new input data.

  20. Prasshanth VP January 16, 2018 at 10:05 am #

    Hi Jason – Great post. This cleared things for me in settling with a final model.

    fitControl <- trainControl(
    method = "repeatedcv",
    number = 10,
    savePredictions = 'final',
    verboseIter = T,
    summaryFunction = twoClassSummary,
    classProbs = T)

    glm_fit <- caret::train(dv ~. , data = dataset
    ,method = "glm", family=binomial, trControl = fitControl, metric = "ROC")

    It says that the glm_fit now becomes the final model as it runs 10 fold based on trControl and finally trains model using entire data. Setting verboseIter = T, gives me a summary during this run a message at the end – "Fitting final model on full training set". So can I use this as a final model?

  21. Martin Main January 18, 2018 at 2:38 am #

    Hi there,
    This article makes a lot of sense, but one thing I am surprised was not addressed was the problem of over-fitting. If there is no test/validation data used in the final model generation, and if the model being used has been seen to over-fit the data in testing, then we need to know when to stop training without over-fitting. A simple approach would be to guess from the ‘correct’ training times from the previous tests, but of course the final model with all data will naturally need longer training times. Is there a statistical approach we could use to determine the best time to stop training without using a validation set?

    • Jason Brownlee January 18, 2018 at 10:12 am #

      Concerns of overfitting are addressed prior to finalizing the model as part of model selection.

      • Megat Haziq March 12, 2018 at 7:06 pm #

        If I was using early stopping during k-fold cross-validation, is it correct to average the number of epoch and apply it to the finalized model? Since there is no validation set in the finalized model for early stopping, so I thought of using average number of epoch to train the final model. Please help me 🙂

  22. Tata March 7, 2018 at 5:13 am #

    Hi Jason! Thank you so much for this informative post.

    A little question though. If we don’t have the luxury to acquire another dataset (because it’s only for a little college project, for example), how do you apply k-fold cross validation (or test-training split) to evaluate models then?

    My understanding is that once you apply, let’s say, k-fold cross validation for choosing which model to use and then tuning the parameters to suit your need, you will run your model on another different dataset hoping the model you have built and tuned will give you your expected result.

    • Jason Brownlee March 7, 2018 at 6:17 am #

      You can split your original dataset prior to using CV.

  23. Rose March 21, 2018 at 8:46 am #

    Hi Jason,
    Glad to meet with your tutorial as these are one of the best in teaching deep wuth keras.
    I have already read the notes which people asked you questions about using k-fold cv for training a final deep model but as I am a naive in working with deep learning models I could not understand some things.
    I wanna train (or finalized) CNN,LSTM & RNN for text dataset (it is a sentiment analysis). In fact my teacher told me to apply k-fold cross validation for training=finalizing model to be able to predict the proability of belonging unseen data to each class (binary class).
    my question is this:[ is it wrong to apply k-fold cross validation to train a final deep model?]
    as I wrote commands 15 epoches run in each fold. is there any thing wrong with it?
    I am so sorry for my naive question as i am not a english native to understand perfect the above comments U all wrote about it.
    my written code is like this:
    [
    from sklearn.model_selection import KFold
    kf = KFold(10)
    f1_lstm_kfld=[]
    oos_y_lstm = []
    oos_pred_lstm = []
    fold = 0
    for train, test in kf.split(x_train):
    fold += 1
    print(“Fold #{}”.format(fold))
    print(‘train’, train)
    print(‘test’, test)
    x_train1 = x_train[train]
    y_train1 = y_train[train]
    x_test1 = x_train[test]
    y_test1 = y_train[test]
    print(x_train1.shape, y_train1.shape, x_test1.shape, y_test1.shape)

    print(‘Build model…’)
    model_lstm = Sequential()
    model_lstm.add(Embedding(vocab_dic_size, 128))
    model_lstm.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
    model_lstm.add(Dense(1, activation=’sigmoid’))

    model_lstm.compile(loss=’binary_crossentropy’,
    optimizer=’adam’,
    metrics=[‘accuracy’])

    print(‘Train…’)
    model_lstm.fit(x_train1, y_train1,
    batch_size=32,
    epochs=15,
    validation_data=(x_test1, y_test1))
    score_lstm, acc_lstm = model_lstm.evaluate(x_test1, y_test1,
    batch_size=32)

    sum_f1_lstm_kfld=0
    for i in f1_lstm_kfld:
    sum_f1_lstm_kfld =sum_f1_lstm_kfld+i
    print (‘sum_f1_lstm_kfld’,sum_f1_lstm_kfld)
    mean_f1_lstm_kfld=(sum_f1_lstm_kfld)/10
    print (‘mean_f1_lstm_kfld’, mean_f1_lstm_kfld)
    Please guide me as i get confused.
    Thank U in advanced.
    Rose

    • Jason Brownlee March 21, 2018 at 3:06 pm #

      You cannot train a final model via CV.

      I recommend re-reading the above post as to why.

      • Rose March 23, 2018 at 7:29 am #

        Hi Jason,
        I am afraid to ask about this issue again but when I re-read the above post , I saw this sentence: Resampling methods like repeated train/test or repeated k-fold cross-validation will help to get a handle on how much variance there is in the method. If it is a real concern, you can create multiple final models and take the mean from an ensemble of predictions in order to reduce the variance.
        What do U mean by this sentence: “create multiple final models” do you mean applying k-fold cross-validation to achieve multiple models??
        and also about this one: take the mean from an ensemble of predictions, you mean can we use “this take the mean” in finalizing model?
        I wanna train a model=finalizing model.
        You mentioned: “Why not keep the best model from the cross-validation?
        You can if you like.You may save time and effort by reusing one of the models trained during skill estimation.Your model will likely perform better when trained on all of the available data than just the subset used to estimate the performance of the model.
        what do u mean by saying above three sentences especially this one:”when trained on all of the available data than just the subset used to estimate the performance of the model?
        you mean by “training on all of the available data” the procedure which we do not use k-fold cross validation?? and you mean applying k-fold cross validation from this sentence:”just the subset used to estimate the performance of the model”?
        If I want to ask my question clearly I shold say in this way: [ I wanna train a CNN ,LSTM and RNN deep model to define a deep model inorder to estimating the proability of unseen data, what should I do? applying splitting data set into train and test or any other procedures?

        Any guidance will be appreciate.

        • Jason Brownlee March 23, 2018 at 8:30 am #

          I mean, if there is a lot of variance in the model predictions (e.g. due to randomness in the model itself), you can train multiple final models on all training data and use them in an ensemble.

          Sometimes training a model can take days or weeks. You may not want to retrain a model, hence, reuse a model from the time when you estimated model skill.

          Does that help?

  24. moses April 6, 2018 at 4:07 pm #

    can u provide a sample code for prediction

    • Jason Brownlee April 7, 2018 at 6:08 am #

      I have many examples on my blog for different platforms.

      Try a search and let me know if you don’t find what you’re looking for.

  25. Mariano April 26, 2018 at 12:44 pm #

    Do you happen to have an example with code where you train with all your data and then predict unknown future data?

  26. Casey May 17, 2018 at 4:46 am #

    Thanks for this post Jason, and two additional questions:

    1) Is there a peer-reviewed article that can be cited to demonstrate the validity of this approach?

    2) Do I understand correctly that if the uncertainty in the relationship derived for the training data is correctly propagated to the test data set, the “best” model can be selected based solely on cross-validation statistics? That is, goodness of fit measure for the training relationship don’t really matter?

    Thanks!

    • Jason Brownlee May 17, 2018 at 6:39 am #

      Of finalizing a model? There may be, I don’t know sorry. It might be tacit knowledge.

      Yes, skill estimated using a well configured k-fold cross-validation may be sufficient, but i the score is reviewed too often (e.g. to tune hyperparams), you can still overfit.

  27. AKBAR HIDAYATULOH May 23, 2018 at 2:41 pm #

    this post is very helpful for my final project to get better understanding,

    i have question, after done with train/test split, and next step is training with all available dataset. Do i have to use all of the dataset for data train no need to split again? and using the best hyperparameter or configurations from train/test split or cross validation before?

    Thank you very much

    • Jason Brownlee May 23, 2018 at 2:43 pm #

      I’m glad to hear that.

      Correct, you would use all available data with hyperparameters chosen via testing on your train/test/validation sets.

  28. Debasish Ghosh June 10, 2018 at 1:25 am #

    Thanks Jason for the great post. I have one question though ..

    During training I pre-process data e.g. scaling, feature reengineering etc. And then I train the model using train / validation /test set. Now I have the final model which I would like to use for prediction.

    Now my prediction system is different (written using Java and TF) and there I import the trained model – incidentally all my training code are in Keras and Python. But in my prediction system I get the data points one at a time and I have to do prediction.

    My question is how can I do the data pre-processing during prediction ? Pre-processing like scaling and feature extraction do not make sense on a single data point. With my use case prediction looks good if I accumulate all the data that I receive (unseen before), do similar pre-processing as in training, once I have quite a bit of them and then submit to the trained model for scoring. Otherwise I get very different and inaccurate results.

    Would love to hear some suggestions on how to tackle this issue.

    • Jason Brownlee June 10, 2018 at 6:05 am #

      Excellent question!

      The single data point must be prepared using the same methods used to prepare the training data.

      Specifically, the coefficients for scaling (min/max or mean/stdev) are calculated on the training dataset, used to scale the training dataset, then used going forward to scale any points that you are predicting.

      Does that help?

  29. Maria June 25, 2018 at 11:26 pm #

    Hi Jason, Thank you for the awesome tutorial.
    As I see, you emphasize on training a neural network on the entire data set without taking apart a sub-set of the whole dataset as a Test data set in order to train and finalizing a data set.
    I have already trained cnn_model on the entire data set (( I mean I did not separate some samples for test set)) but I separate 20% of whole data set as the validation set via this statement:

    ‘model_cnn.fit(x_datasetpad, y_datasetpad, validation_split=0.2, epochs=5, batch_size=32)’

    I think I made a mistake about putting ((validation_split=0.2)) in fitting network process.
    Do I remove validation set to finalizing the cnn network??
    Should I train a network on the entire data set { I mean Should I delete validation_split=0.2???}

    • Jason Brownlee June 26, 2018 at 6:38 am #

      Yes, remove the validation split for the final model.

      • Maria June 26, 2018 at 7:54 am #

        Hi Jason,
        I am so grateful for the quick answer.

  30. K.D.I. Madhuwantha July 4, 2018 at 10:02 pm #

    How to save final model in Tenorflow and use it in Tenorflow.js

    • Jason Brownlee July 5, 2018 at 7:43 am #

      Sorry, I don’t have tensorflow or tensorflow.js examples.

  31. Vaddi Ajay Kumar July 4, 2018 at 11:54 pm #

    I Read Post, all Questions & answers.So finally want to summarise and approve from you 🙂

    Ex: I have Training data 100k values, test data: 50k values.

    1. We try various models like linear regr, decision tree, random forest, neural net on K fold validation with 150k values and see what model gives better performance measure(mean error etc…). We now decided what algorithm/procedure works best on data .

    Ex: Decision Tree.

    2. Now let us run K fold validation with 150k values on Decision tree with different hyperparamter values and check what value gives better performance measure.

    3. we know what model and what hyperparameter “generally” works across the data.

    4. Let us use all the data 150K values and train final Decision tree (FDT) with hyperparameter that we selected(which worked best) previously.

    As model and hyperparameters are checked previously , the above post believes they will and should works best on the unseen data.

    My thoughts: I might take a safer approach at the end by double checking, which means rather than train on all the data that i have i will keep 5% for testing (Unseen Data) and 95% for training.

    Thank You for the Great Post . I thought this might help people who are concerned about hyper parameter tuning post model/ML procedure selection.

    • Jason Brownlee July 5, 2018 at 7:47 am #

      All good except the final check is redundant and could be misleading. What if skill on the 5% is poor, what do you do and why?

      • Vaddi Ajay Kumar July 5, 2018 at 10:02 pm #

        I thought to keep 5% as a double check but after your question i began to ponder what if skill is poor – I have two things to say.

        1. This 5% is a sample that is not representative of data . i.e.. Occurred by chance. So i should have other approach to test on representative of the data.

        2. Model is not good enough or over-fitted – Even this time i cannot come to conclusion as 5% sample may not be representative of data.

        Understood finally that Cross Fold validation is solution for above 2 points which we already did on the whole data prior and so ” final check is redundant”.

        Thank You So much Jason Brownlee.

        • Jason Brownlee July 6, 2018 at 6:42 am #

          Nice reasoning!

          Keep an open mind and adapt methods for your specific problem. There are no “rules”, it is an empirical discipline.

          • Vaddi Ajay Kumar July 6, 2018 at 9:58 pm #

            Thank You. Understood.

  32. Ed O July 9, 2018 at 10:06 pm #

    Thank you Jason. I am trying to get probabilities of whether an employee is going to leave or stay with the company. I have 1,500 records of individuals that left the company and 500 that are currently with us. I need to get probabilities for all 500 associates that are still with us.

    The issue is that the model is technically seeing all the data that it is training on in order to get probabilities for the entire data set. I don’t have “new” data I can apply the trained model to since we know all the employees that are currently with us. How do I get probabilities for the 500 current associates without overfitting? Is it as simple as making the predictors more generic? Thank you for your advice in advance.

    • Jason Brownlee July 10, 2018 at 6:47 am #

      You can fit models on some of your data and evaluate it on the rest.

      Once you find a model that works, you can train it on all of your data and use it to make predictions on new data.

      I assume you have historical records for people that have stayed or left, you train on that. Then you have people now that you want to know if they will stay or leave, this is new data for which you want to generate a prediction.

  33. Thusitha Deepal July 15, 2018 at 2:59 am #

    I have as ome problem.I am trying to predict currency exchange rte using historical data.I’m trying to predict tomorrow exchange rate using yesterday rate..I am littlebit confused.What artifical neural network should i used?And ilike to use K-fold cross validation to sampling..I like to know your ideas.

Leave a Reply