How to Check-Point Deep Learning Models in Keras

Last Updated on

Deep learning models can take hours, days or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library.

Discover how to develop deep learning models for a range of predictive modeling problems with just a few lines of code in my new book, with 18 step-by-step tutorials and 9 projects.

Let’s get started.

  • Update Mar/2017: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
  • Update March/2018: Added alternate link to download the dataset.
  • Update Sep/2019: Updated for Keras 2.2.5 API.
  • Update Oct/2019: Updated for Keras 2.3.0 API.
How to Check-Point Deep Learning Models in Keras

How to Check-Point Deep Learning Models in Keras
Photo by saragoldsmith, some rights reserved.

Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long running processes.

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how the file should named and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename that you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note, you may need to install the h5py library to output network weights in HDF5 format.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assume that the pima-indians-diabetes.csv file is in your working directory.

You can download the dataset from here:

The example uses 33% of the data for validation.

Checkpointing is setup to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_accuracy’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_accuracy=.2f}.hdf5).

Running the example produces the following output (truncated for brevity):

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

This is a very simple checkpointing strategy. It may create a lot of unnecessary check-point files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

Checkpoint Best Neural Network Model Only

A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not include score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

Running this example provides the following output (truncated for brevity):

You should see the weight file in your local directory.

This is a handy checkpoint strategy to always use during your experiments. It will ensure that your best model is saved for the run for you to use later if you wish. It avoids you needing to include code to manually keep track and serialize the best model when training.

Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This too can be serialize to file in JSON or YAML format.

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

Running the example produces the following output:

Summary

In this post you have discovered the importance of checkpointing deep learning models for long training runs.

You learned two checkpointing strategies that you can use on your next deep learning project:

  1. Checkpoint Model Improvements.
  2. Checkpoint Best Model Only.

You also learned how to load a checkpointed model and make predictions.

Do you have any questions about checkpointing deep learning models or about this post? Ask your questions in the comments and I will do my best to answer.

Develop Deep Learning Projects with Python!

Deep Learning with Python

 What If You Could Develop A Network in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
Deep Learning With Python

It covers end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

180 Responses to How to Check-Point Deep Learning Models in Keras

  1. Gerrit Govaerts October 5, 2016 at 1:07 am #

    79.56% with a 3 hidden layer architecture , 24 neurons each
    Great blog , learned a lot

  2. Lau MingFei October 21, 2016 at 1:02 pm #

    Hi Jason, how can I checkpoint a model with my custom metric? The example codes you gave above is monitor = ‘val_acc’ , but when I replace it with monitor = my_metric , it displays the following warning message:

    /usr/local/lib/python3.5/dist-packages/keras/callbacks.py:286: RuntimeWarning: Can save best model only with available, skipping.
    ‘skipping.’ % (self.monitor), RuntimeWarning)

    So how should I do with this?

    • Jason Brownlee October 22, 2016 at 6:55 am #

      Great question Lau,

      I have not tried to check point with a custom metric, sorry. I cannot give you good advice.

    • Jakub July 26, 2019 at 11:18 pm #

      Hi,
      Try something like this:

      model = load_model( “your.model.h5”,
      custom_objects={‘my_metric’: my_metric })

  3. Xu Zhang November 8, 2016 at 11:21 am #

    Hi Jason,
    A great post!

    I saved the model and weights using callbacks, ModelCheckpoint. If I want to train it continuously from the last epoch, how to set the model.fit() command to start from the previous epoch? Sometimes, we need to change the learning rates after several epochs and to continue training from the last epoch. Your advice is highly appreciated.

    • Jason Brownlee November 9, 2016 at 9:48 am #

      Great question Xu Zhang,

      You can load the model weights back into your network and then start a new training process.

      • Lopez GG December 30, 2016 at 1:01 pm #

        Thank you Jason. I ran a epoch and got the loss down to 353.6888. The session got disconnected so I used the weights as follows. However, I dont see a change in loss. I am loading the weights correctly ? Here is my code

        >>> filename = “weights-improvement-01–353.6888-embedding.hdf5”
        >>> model.load_weights(filename)
        >>> model.compile(loss=’binary_crossentropy’, optimizer=’adam’)
        >>> model.fit(dataX_Arr, dataY_Arr, batch_size=batch_size, nb_epoch=15, callbacks=callbacks_list)
        Epoch 1/15
        147744/147771 [============================>.] – ETA: 1s – loss: -353.6892Epoch 00000: loss improved from inf to -353.68885, saving model to weights-

        • Jason Brownlee December 31, 2016 at 7:02 am #

          It looks like you are loading the weights correctly.

          • Akshaya July 15, 2019 at 7:28 pm #

            Hi Jason, in continuation to the point above of continuing training from the saved checkpoint, is it required that I set the random seed initially and use the same when I train the second time? What I have noticed is that, the first time I train, the loss seems to reduce. But when I load the checkpoint and continue training, the performance suddenly becomes very poor.

          • Jason Brownlee July 16, 2019 at 8:14 am #

            No.

            You can expect variance in the model across runs, more here:
            https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code

  4. Anthony November 29, 2016 at 9:58 pm #

    Great post again Jason, thank you!

  5. Nasarudin January 31, 2017 at 1:48 pm #

    Hi Jason, thank you for your tutorial. I want to implement this checkpoint function in iris-flower model script but failed to do it. It keeps showing this error and I do not know how to solve it.

    I put the ‘model checkpoint’ line after the ‘baseline model’ function and add ‘callbacks’

    RuntimeError: Cannot clone object , as the constructor does not seem to set parameter callbacks

    Thank you for your help

    • Jason Brownlee February 1, 2017 at 10:41 am #

      Hi Nasarudin, sorry I am not sure of the cause of this error.

      I believe you cannot use callbacks like the checkpointing when using the KerasClassifier as is the case in the iris tutorial:
      http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

      • Nasarudin February 6, 2017 at 1:50 am #

        Hi Jason. Thank you for your reply. I have some dataset that looks same with the iris dataset but it is a lot bigger. I thought maybe I can use the callbacks when I train the dataset.

        Since training a large dataset might taking a lot of time, do you have any suggestion to implement other function that can do checkpoint when using KerasClassifier?nk for the

        By the way, you can refer to this link for the script that I wrote to implement the checkpoint function with KerasClassifier.

        http://stackoverflow.com/questions/41937719/checkpoint-deep-learning-models-in-keras

        Thank you.

        • Jason Brownlee February 6, 2017 at 9:44 am #

          Hi Nasarudin,

          I would recommend using a standalone Keras model rather than the sklearn wrapper if you wish to use checkpointing.

  6. Abner February 3, 2017 at 3:21 pm #

    Great tutorials love your page. I got a question: I am trying to optimize a model based on val_acc using ModelCheckpoint. However, I get great results rather quickly (example: val_acc = .98). This is my max validation, therefore, the model that will be saved (save_best_only). However several epochs gave me the same max validation and the latter epochs have higher training accuracy. Example Epoch 3: acc = .70, val_acc = .98, Epoch 50: acc = 1.00, val_acc = .98. Clearly, I would want to save the second one which generalizes on the data plus shows great training. How do I do this without having to save every epoch? Basically, I want to pass a second sorting parameter to monitor (monitor=val_acc,acc).

    Thanks.

    • Jason Brownlee February 4, 2017 at 9:57 am #

      Great question, off the cuff, I think you can pass multiple metrics for check-pointing.

      Try multiple checkpoint objects?

      Here’s the docs if that helps:
      https://keras.io/callbacks/#modelcheckpoint

      Ideally, you do want a balance between performance on the training dataset and on the validation dataset.

      • Abner February 7, 2017 at 5:13 am #

        Yes, it’s not letting me pass an array list or multiple parameters it’s only expecting 1 parameter base on literature, for now, I have to settle for using val_acc for bottlenecks/top layer and val_loss for the final model, though I would prefer more control. maybe I’ll ask for it in a feature request.

        • Jason Brownlee February 7, 2017 at 10:23 am #

          Great idea, add an issue on the Keras project.

        • Damily March 22, 2017 at 10:12 pm #

          Hi Abner, do you solve this problem now?
          hope to receive your reply.

          • Schmax July 29, 2019 at 8:36 am #

            You could define a custom metric that encorporates both val_acc and val_loss

  7. Fatma February 9, 2017 at 3:31 pm #

    Hi Jason, thank you for your tutorial. I need to ask one question, if my input contains two images with different labels (the label represents the popularity of the image). I need to know how to feed this pair of images such that the first image pass through CNN1 and the second one pass through CNN2 Then I can merge them using the merge layer to classify which one is more popular than the other one. How can I use the library in order to handle the two different inputs?

    • Jason Brownlee February 10, 2017 at 9:48 am #

      Hi Fatma,

      Perhaps you can reframe your problem to output the popularity for one image and use an “if” statement to rank the relative popularities of each image separately?

      • Fatma February 10, 2017 at 1:53 pm #

        I need to compare between the popularity value of the two input images such that the output will be the first image is high popular than the second image or vice versa then when I feed one test image (unlabeled) it should be compared with some baseline of the training data to compute its popularity

  8. Nasarudin February 21, 2017 at 3:30 pm #

    Hi Jason, great tutorial as always.

    I want to ask regarding ‘validation_split’ in the script. What is the difference between this script that has ‘validation_script’ variable and the one from here which did not have ‘validation_split’ variable http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ ?

    • Jason Brownlee February 22, 2017 at 9:56 am #

      Great question Nasarudin.

      If you have a validation_split, it will give you some feedback during training about the performance on the model on “unseen” data. When using a check-point, this can be useful as you can stop training when performance on validation data stops improving (a sign of possible overfitting).

  9. jerpint March 18, 2017 at 7:46 am #

    thanks!

  10. Aleksandra Nabożny April 10, 2017 at 11:54 pm #

    Hi Jason! Thank you very much for the tutorial!

    Can you provide a sample code to visualize results as follows: plot some images from validation dataset with their 5 top label predictions?

    • Jason Brownlee April 11, 2017 at 9:33 am #

      Yes you could do this using the probabilistic output from the network.

      Sorry, I do not have examples at this stage.

  11. Pratik April 19, 2017 at 6:39 pm #

    Thanks for the tutorial Jason 🙂

  12. Krishan Chopra April 25, 2017 at 6:33 am #

    Hello Sir, i want to thank you for this cool tutorial,
    Currently i am checkpointing my model every 50 epochs. I also want to checkpoint any epoch whose val_acc is better but is not the 50th epoch. For ex. i have checkpointed 250th epoch, but val_acc for 282th epoch is besser and i have to save it but as i have specified the period to be 50, i cant save the 282th epoch.
    Should i implement both attributes save_best_only=true and period=50 in ModelCheckPoint ?

    • Jason Brownlee April 25, 2017 at 7:53 am #

      Sorry, I’m not sure I follow.

      You can use the checkpoint to save any improvement to the model regardless of epoch. Perhaps that would be an approach to try?

  13. Zhihao Chen May 19, 2017 at 11:18 am #

    Hi Jason,

    I love your tutorial very much but just find a little bit confused about how to combine the check point with the cross validation. As is shown in your other tutorial, we don’t explicitly call model.fit when doing cross validation. Could you give me an example how to add check-point into this?

    • Jason Brownlee May 20, 2017 at 5:33 am #

      You may have to run the k-fold cross-validation process manually (e.g. loop the folds yourself).

  14. chiyuan May 28, 2017 at 3:36 pm #

    Love your articles. Learnt lost of things here.

    Please keep your tutorials going!

  15. Abolfazl June 22, 2017 at 3:00 am #

    Jason, your blog is amazing. Thanks for helping us out in learning this awesome field.

  16. Xiufeng Yang June 29, 2017 at 3:43 pm #

    Hi, great post! I want to save the training model not the validation model, how to set the parameters in checkpoint()?

    • Jason Brownlee June 30, 2017 at 8:08 am #

      The trained model is saved, there is no validation model. Just an evaluation of the model using validation data.

  17. Mike September 19, 2017 at 5:55 pm #

    Is it possible to checkpoint in between epochs?
    I have an enormous dataset that takes 20hrs per epoch and it’s failed before it finished an epoch. It would be great if I could checkpoint every fifth of an epoch or so.

    • Jason Brownlee September 20, 2017 at 5:55 am #

      I would recommend using a data generator and then using a checkpoint after a fixed number of batches. This might work.

  18. Mounir October 31, 2017 at 10:30 pm #

    Hi Jason, great blog. Do you also happen to know how to save/store the epoch number of the last observed improvement during training? That would be very useful to study overfitting

    • Jason Brownlee November 1, 2017 at 5:45 am #

      Yes, you could add the “epoch” variable to the checkpoint filename.

      For example:

  19. Jes November 3, 2017 at 8:04 am #

    is it possible to use Keras checkpoints together with Gridsearch? in case Gridsearch crashes?

    • Jason Brownlee November 3, 2017 at 2:15 pm #

      Not a good idea Jes. I’d recommend doing a custom grid search.

  20. Abad December 3, 2017 at 10:10 pm #

    Receiving following error:
    TypeError Traceback (most recent call last)
    in ()
    72
    73 filepath=”weights.best.hdf5″
    —> 74 checkpoint = ModelCheckpoint(filepath, monitor=’val_acc’, verbose=0, save_best_only=True,node=’max’)
    75 callbacks_checkpt = [checkpoint]
    76

    TypeError: __init__() got an unexpected keyword argument ‘node’

    • Jason Brownlee December 4, 2017 at 7:47 am #

      Looks like a typo.

      Double check that you have copied the code exactly from the tutorial?

  21. Aditya December 24, 2017 at 8:21 am #

    Hi Jason!
    Thanks for the informative post.

    I have one question – in case of unexpected stoppage of the run, we have the best model weights for the epochs DONE SO FAR. How can we use this checkpoint as a STARTING point to continue with the remaining epochs?

    • Jason Brownlee December 25, 2017 at 5:22 am #

      You can load the weights, see the example in the tutorial of exactly this.

  22. Liaqat Ali December 27, 2017 at 3:53 am #

    Thanks for such a nice explanation. I want to ask if we are performing some experiments & want the neural network model to achieve high accuracy on for test set. Can we use this method to find the best tuned network or the highest possible accuracy.?????

    • Jason Brownlee December 27, 2017 at 5:21 am #

      This method can help find and save a well performing model during training.

  23. davenso January 8, 2018 at 4:54 pm #

    very cool.

  24. davenso January 8, 2018 at 4:57 pm #

    Based on this example, how long does it take typically to save or retrieve a checkpoint model?

    • Jason Brownlee January 9, 2018 at 5:24 am #

      Very fast, just a few seconds for large models.

      • davenso January 9, 2018 at 3:43 pm #

        Thanks, Jason. Could you kindly give an estimate, more than 10 secs?

  25. kszyniu January 12, 2018 at 5:59 am #

    Nice post.
    However, I have one question. Can I get number of epochs that model was trained for in other way than reading its filename?
    Here’s my use case: upon loading my model I want to restore the training exactly from the point I saved it (by passing a value to initial_epoch argument in .fit()) for the sake of better-looking graphs in TensorBoard.
    For example, I trained my model for 2 epochs (got 2 messages: “Epoch 1/5”, “Epoch 2/5”) and saved it. Now, I want to load that model and continue training from 3rd epoch (I expect getting message “Epoch 3/5” and so on).
    Is there a better way than saving epochs to filename and then getting it from there (which seems kinda messy)?

    • Jason Brownlee January 12, 2018 at 11:48 am #

      You could read this from the filename. It’s just a string.

      You could also have a callback that writes the last epoch completed to a file, and overwrite this file each time.

  26. Max Jansen January 16, 2018 at 9:21 pm #

    Great post and a fantastic blog! I can’t thank you enough!

  27. Gabriel January 26, 2018 at 3:18 pm #

    Hi there. Great posts! Quick question: have you run into this problem? “callbacks.py:405: RuntimeWarning: Can save best model only with val_acc available, skipping.”

    Running on AWS GPU Compute instance, fyi. I am not going straight from the Keras.Sequence() model… instead, I am using the SearchGridCV as I am trying to perform some tuning tasks but want to save the best model. Any suggestions?

    • Jason Brownlee January 27, 2018 at 5:53 am #

      I would recommend not combining checkpointing with cross validation or grid search.

  28. Rob February 28, 2018 at 2:16 am #

    Hi Jason,

    Thanks a lot for all your posts, really helpful. Can you explain to me why some epochs improve validation accuracy whilst previous epochs did not? If you do not use any random sampling in your dataset (e.g. no dropout), how can it be that epoch 12 increases validation score while epoch 11 does not? Aren’t they based on the same starting point (the model outputted by epoch 10)?

    thanks!

    • Jason Brownlee February 28, 2018 at 6:07 am #

      The algorithm is stochastic where not every update improve the scores across all of the data.

      This is a property of the learning algorithm, gradient descent, that is doing its best, but cannot “see” the whole state of the problem, but instead operates piece-wise and incrementally. This is a feature, not a bug. It often leads to better outcomes.

      Does that help?

  29. Pete March 2, 2018 at 11:50 pm #

    I was curiosity what the different of these two version???
    It was seemed that just filepath was different??

    • Jason Brownlee March 3, 2018 at 8:11 am #

      One keeps every improvement, one keeps only the best.

      • Pete March 4, 2018 at 11:40 pm #

        Why the filepath variable was so magic? Just change the filepath variable can make these different result. How is this achieved?

  30. Pete March 3, 2018 at 1:20 am #

    In the last, if I add one code model.save_weights('weights.hdf5'), what the difference of this weights and ModelCheckpoint best weight??

    • Jason Brownlee March 3, 2018 at 8:18 am #

      The difference is you are saving the weights from the final model after the last epoch instead of when a change in model skill has occurred during training.

  31. Pete March 3, 2018 at 9:23 pm #

    When I add one code model.save_weights('weights.hdf5') to save weight from the final model; and I also save ModelCheckpoint best weight, I found that these two hdf5 file were not the same. I was confusing that why the final model weight I saved wasn’t the best model weight.

    • Jason Brownlee March 4, 2018 at 6:02 am #

      Yes, that is the idea of checkpointing, that it the model at the last epoch may not be the best model, in fact it often is not.

      Checkpointing will save the best model along the way.

      Does that help? Perhaps re-read the tutorial to make this clearer?

      • Pete March 4, 2018 at 11:32 pm #

        Ok, thanks. In a sence, when we have used checkpoint, that is meaningless to use model.save_weights('weights.hdf5') again.

  32. Kaushal Shetty March 29, 2018 at 6:19 am #

    Hi Jason,
    How do I checkpoint a regression model. Is my metric accuracy or mse in such a case? And what should I monitor in such case in the modelcheckpoint? I am training a time series model.

    Thanks

  33. Kakoli May 27, 2018 at 8:39 pm #

    Hi Jason
    While running for 50 epochs, I am checkpointing and saving the weights after every 5 epochs.
    Now after 27th, VM disconn.
    Then I reconnect and compile the model after loading the saved weights. Now if I evaluate, I shall get the score on the best weight till 27th epoch. But since only 25 epochs are considered, accuracy will not be good, right?
    In that case, how do I continue the remaining 25 epochs with the saved weights?
    Thanks

    • Jason Brownlee May 28, 2018 at 5:57 am #

      You can load a set of weights and continue training the model.

  34. Rahmad ars May 30, 2018 at 7:02 am #

    Hi jason, thanks for the tutorial. If i want to extract weight values of all layer from my model, how to do that? thanks

    • Jason Brownlee May 30, 2018 at 3:06 pm #

      I believe there is a get_weights() function on each layer.

  35. michael alex June 18, 2018 at 2:23 pm #

    Hi Jason,
    Thanks for all the great tutorials, including this one. Is it possible to Grid Search model hyper-parameters and check-point the model at the same time? I can do either one independently, but not both at the same time. Can you show us?

    Thanks,
    Michael

    • Jason Brownlee June 18, 2018 at 3:13 pm #

      Yes, but you will have to write the for loops for the grid search yourself. sklearn won’t help.

  36. Hung June 18, 2018 at 6:40 pm #

    Thanks a lot Jason for excellent tutorial.

  37. Adam July 21, 2018 at 12:06 am #

    Where are the key values “02d” for epoch and “.2f” for val_acc coming from?

  38. IGOR STAVNITSER July 23, 2018 at 11:27 am #

    Is there a way to checkpoint model weights to memory instead of a file?
    Also in your example you are maximizing val_acc. What are the merits of maximizing val_acc vs minimizing val_loss?
    Thank you for a great post!

    • Jason Brownlee July 23, 2018 at 2:25 pm #

      I’m sure you could write a custom call back to do this. I don’t have an example.

      You can choose what is important to you, accuracy might be more relevant to you when using the model to make predictions.

  39. Abhijeet Gokar August 9, 2018 at 9:27 pm #

    Thanks a lot, it is good practise to conciously design and implement checkpoints , in our model.

  40. Yari August 14, 2018 at 3:19 am #

    Hi Jason, thanks for the post (as well as many others I have read!).
    I was wondering: after fitting with model.fit(....) and using checkpoint you will have the weights saved in an external file. But what’s the specific state of model instance after the training? Will it have the best weights or it will have the last weights calculated during the training?

    So, to sum up. If I want to do a prediction on a test set immediately after the training/fitting should I load the best weights from the external file and then do the prediction or I could directly use model.predict(...) immediately after model.fit(...) ?

    Thanks a lot for your support!

    • Jason Brownlee August 14, 2018 at 6:24 am #

      The file will have the model weights at the time of the checkpoint. The weights can be loaded and used directly for making predictions. This is the whole point of checkpointing – to save the best model found during training.

      • Yari August 14, 2018 at 3:40 pm #

        Hey Jason, thanks for getting back to me. Yes that was clear to me. What’s not clear is what weights model has at the end of the training.

        Lets suppose I’m using

        callbacks = [
        EarlyStopping(patience=15, monitor='val_loss', min_delta=0, mode='min'),
        ModelCheckpoint('best-weights.h5', monitor='val_loss', save_best_only=True, save_weights_only=True)
        ]

        after training is done I’ll have the best weights saved on best-weights.h5 but what are the weights stored in the model instance? If I do model.evaluate(...) (without loading best-weights.h5) will it use the best weights or just the weights corresponding to the last epoch?

        • Jason Brownlee August 15, 2018 at 5:57 am #

          They are the weights at the time the checkpoint is triggered. If it is triggered many times, you may have many weight files saved.

        • Anupam Singh September 26, 2018 at 6:01 am #

          model instance will have the weights of last epoch not the best one

  41. Mohammed September 13, 2018 at 5:21 pm #

    I have my owned pretrained model (.ckpt and .meta files). How to use such files to extract features of my dataset in form of matrix which rows represent samples and columns represent features?

    • Jason Brownlee September 14, 2018 at 6:34 am #

      Perhaps load in Python manually then try to populate a defined Keras manually?

  42. chamith November 4, 2018 at 7:28 am #

    Thanks a lot for the great description. But I have to clarify one thing regarding this. When I train the model using two callback functions model *ModelCheckpoint* and

  43. Daniel Penalva November 27, 2018 at 10:23 pm #

    Hi Jason,

    Thanks for tutorial. Iam wondering if this approach works with GridSearch and how can i put the checkpoint to track the results. Also, iam working with Colab Research Notebook right now, is there a way to detect process interruption and use a checkpoint to save the model ?

    Thank you for your help again!!

    • Jason Brownlee November 28, 2018 at 7:41 am #

      No, a grid search and checkpointing are at odds.

      I’m not familiar with “Colab Research Notebook”, what is it?

      • Daniel Penalva November 28, 2018 at 11:17 pm #

        https://colab.research.google.com . Google initiative to promove notebooks with free virtual machines, kernels with GPU and TPU processing (yet experimental). But it disconnects from the kernel after a short time of no use, and the virtual machine can be unmounted after some hours. So its fundamental for deeplearning applications that you checkpoint and save the state to keep going after reconnecting.

        Too bad for Grid Search. How can you do Fine Tuning (hyperparameters) and process babysitting (verify vizualization of results, possible overfits, variance-bias so on …) without checkpointing on Grid Search ?

        thank you !

        • Jason Brownlee November 29, 2018 at 7:41 am #

          When you grid search, you want to know about what hyperparameters give good performance. The models are discarded – no checkpointing needed. Later, you can use good config to fit a new model.

          • Daniel Penalva November 29, 2018 at 10:12 pm #

            Yah, thats right, but still, without being able to finish without being disconnected from the server i will never know the model, so still the checkpoint comes in hand. But if you have in mind any other think to do the tuning together with checkpoints, please let us know !

            Thanks ! 🙂

          • Jason Brownlee November 30, 2018 at 6:33 am #

            Why do you need the model?

            The models during a grid search are discarded. You only need the hyperparameters of the best performing model.

          • Daniel Penalva November 30, 2018 at 12:50 am #

            Update
            There seems to be a issue with Keras Cross Validation and Checkpointing that requires some gimmick turn-around:

            https://github.com/keras-team/keras/issues/836

            Cant understand why they closed the issue without a solution, just dropped my question there.

            Will try to figure out how to do this …

          • Daniel Penalva November 30, 2018 at 12:17 pm #

            “The models during a grid search are discarded. You only need the hyperparameters of the best performing model.”

            Sorry i wasnt clear, i cant run Keras on my laptop, its more than 12 years old. The only free infra i found is Colab Notebook, to fully run through all models in GridSearchCV i need to checkpoint and restart the computing since the persistence of the process in google’s virtual machine wont last few hours, sometimes less than hours.

  44. William December 4, 2018 at 9:32 am #

    Hi Jason

    I have a question regarding compiling the model after weights are loaded.

    # load weights
    model.load_weights(“weights.best.hdf5”)
    # Compile model (required to make predictions)
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    When I looked at the Keras API https://keras.io/models/model/#compile the description for the compile function says “Configures the model for training”. I am confused on why we need to compile the model to make predictions?

    Also, thought you might like to know that the Pima Indians onset of diabetes binary classification problem data set is no longer available.

    Thanks

    • Jason Brownlee December 4, 2018 at 2:35 pm #

      Thanks, you might not need to compile the model after loading any longer, perhaps the API has changed.

  45. kadir sharif December 25, 2018 at 6:54 pm #

    i am training a model about 100 epochs.. Now suppose the electricity gone. and i have a model checkpoints that is saved in hdf5 format… and the model run 30 epochs… but i have the model checkpoints saved with val_acc monitor.

    In this kind of situation how can i load the checkpoint in the same model to continue the training where it interrupted… and is it gonna continue training the model after 30 epochs… it will be a great help it you answer my questions.
    Thanks in advance.

  46. Riccardo January 10, 2019 at 3:06 am #

    Hi Jason thanks for your tutorials, they are very helpful. I’ve implemented a model and then saved it with a checkpoint correctly. Unfortunately when I reload the model (the same structure) with the weights saved I can’t obtain the same predictions as before, there are slightly worse. My model is a simple model
    model = Sequential()
    model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
    model.add(Dense(100, activation=’relu’))
    model.add(Dense(n_outputs))
    model.compile(loss=’mse’, optimizer=’adam’)
    The checkpoint is:
    checkpoint = ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’)
    callbacks_list = [checkpoint]
    model.fit(train_x, train_y, validation_split=0.15, epochs=epochs, batch_size=batch_size,
    callbacks=callbacks_list, verbose=verbose)
    Then i rebuild the same structure and then call:
    model.load_weights(“filename”)
    and the predictions are a little different. Thanks in advance.

    • Jason Brownlee January 10, 2019 at 7:56 am #

      That is surprising, the model should make identical predictions before and after a save. Anything else would be a bug either in your code or in Keras. Try narrowing it down, here are some ideas:
      https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

      • Riccardo January 10, 2019 at 7:34 pm #

        Maybe it can be that when I reload the weights I reload the best result saved to file, while in the first run the weights are different at the end of model.fit()?

        • Riccardo January 10, 2019 at 8:24 pm #

          The way I can reload exactly the model trained is without using ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’). I create the model, train it and then use model.save(file). Then with model.load(file) I have the same result. But here a question: it is the best model? Because now I don’t say to it monitor=’val_loss’, mode=’min’, save_best_only ecc.. I simply fit the model and then save it.

  47. Jinwen Xi February 7, 2019 at 12:21 pm #

    Hi, Jason,

    Very helpful tutorial. Can I save the model after each mini-batch instead of each epoch?

    • Jason Brownlee February 7, 2019 at 2:07 pm #

      Yes, you could achieve that will a custom callback.

      • Jinwen Xi February 7, 2019 at 2:28 pm #

        Thanks for the reply.
        Is there any template about how this custom callback will look like?

  48. Jinwen Xi February 11, 2019 at 6:16 pm #

    Thanks. I followed the instructions and created a custom callback to:

    (1) at the end of each epoch, save the whole model to a file(model_pre.h5), and post-process the file and dump it to model_post.h5
    (2) at the begin of next epoch, load the model from the post-processed file model_post.h5

    The main part of the code implementing (1) and (2) is below:

    But it seems like the model was not correctly updated using the ‘./model_post.h5’ file when I check the parameter values using HDF5View. Could you let me know if I did it in the correct way? Thanks.

    -Jinwen

    • Jason Brownlee February 12, 2019 at 7:55 am #

      Wow!

      I’m surprised that it is possible to load the model prior to each batch or epoch. Perhaps this is not working as expected?

  49. Wonbin February 15, 2019 at 1:29 am #

    Hi Jason, thanks for this great tutorial!
    In a regression task, among ‘val_loss’ and other metrics (like ‘mse’, ‘mape’ and so on) for the monitor argument, which one would be more important to finalize the model?

    Maybe I’m basically asking about the fundamental difference between loss and metrics..?
    I’ve been just guessing that I might have to choose a metric not the ‘val_loss’ to see the prediction performance of the final model. Would this be correct?

    • Jason Brownlee February 15, 2019 at 8:09 am #

      Loss and metrics can be the same thing or different.

      It comes down to what you and project stakeholders value in a final model (metric) and what must be optimized to achieve that (loss).

  50. Judson February 19, 2019 at 11:40 am #

    Hello Jason thanks for the posts.

    as a suggestion could you show us how to checkpoint using Xgboost. Having difficulty figuring it out on my own.

  51. Fredrick Ughimi February 27, 2019 at 2:51 am #

    Hello Jason,

    Straight as an arrow. You made it so easy to follow. This is really less abstract.

    Thank you.

    • Jason Brownlee February 27, 2019 at 7:34 am #

      Thanks, I’m happy it helped.

      • Matt July 12, 2019 at 8:35 am #

        I used this post as the base for a model. I’ve been using precision and recall from the keras_metrics module and can’t seem to get it working as the monitored metric for the checkpoint function. Just tells me:

        RuntimeWarning: Can save best model only when val_recall: available: skipping.

        But after each epoch the val_recall is calculated and displayed so I’m not quite sure what is wrong?

  52. Magnus Wik May 6, 2019 at 7:54 pm #

    Dear Jason,

    I am confused about Keras callback. You use save_best_only=True to save the weights, but according to Keras this is the setting for saving the latest best model. For weights it should be save_weights_only=True. By default, save_weights_only is set to false. Am I missing something?

    Also, in one of my Jupyter notebooks, my weights are not saved at all, but in another they are, although the code is the same. It is weird.

    • Magnus May 6, 2019 at 11:39 pm #

      Dear Jason,

      Now I understand. When you load “weights.best.hdf5”, you are actually loading the full model including the weights. So there is no need to create the model before.

      In section “Loading a Check-Pointed Neural Network Model” I skipped lines 11-14 and then:
      from keras.models import load_model
      model = load_model(“weights.best.hdf5”)

      and I got the same result.

      • Jason Brownlee May 7, 2019 at 6:17 am #

        Nice work.

        • Magnus Wik May 7, 2019 at 8:29 pm #

          Thanks.
          Maybe you should update the text, since now you write “The checkpoint only includes the model weights. It assumes you know the network structure.”, which is incorrect. The checkpoint includes the full model.
          If only the weights are saved it should be ‘save_weights_only=True’.
          What is confusing is that it is possible to treat the full model files as weights, using model.load_weights. Personally I think it should throw a warning.

    • Jason Brownlee May 7, 2019 at 6:16 am #

      Perhaps try running the example from the command line?

  53. Steven Gonzalez May 17, 2019 at 11:45 am #

    Thanks for a great tutorial!

  54. hayj May 27, 2019 at 12:56 am #

    Hello Jason,

    Great tutorial again!
    I wanted to ask you how to monitor my val_top_k_categorical_accuracy instead of val_acc ?
    I can’t find anything about it on internet.
    I tryed differents things change the position of my metrics metrics=['top_k_categorical_accuracy', 'accuracy'], try monitor="top_k_categorical_accuracy"… but nothing works

    • Jason Brownlee May 27, 2019 at 6:51 am #

      Interesting. If you add the metric to the list of metrics does it appear in the history dict?

      If so, you can use that name.

  55. hayj May 28, 2019 at 1:37 am #

    Anyway I solved the problem by defining my own callback which save a model when I observe any “val_*” improvement

  56. Dang Nguyen Hong June 27, 2019 at 6:23 am #

    Each time i search for the answer of a question, your blog solves it ! Great work, thanks so much!

  57. zeinab July 22, 2019 at 5:08 am #

    As usual, great tutorial

  58. alilouche August 6, 2019 at 8:10 am #

    I can’t save my model using your instructions, it does not inform me an error but does not register
    and thanks for your help

    • Jason Brownlee August 6, 2019 at 2:00 pm #

      Sorry to hear that. Perhaps try reducing your example to the simplest possible code example?

      Perhaps try posting your code and issue to stackoverflow?

  59. zeinab August 13, 2019 at 12:40 pm #

    How can I load weights when I use a custom metrics?

    • Jason Brownlee August 13, 2019 at 2:37 pm #

      As follows:

      • zeinab August 14, 2019 at 6:55 pm #

        Thank you, Jason.

        But what about the case of load_weights:
        model.load_weights("bestweights.hdf5")
        unfortunately, this function, cannot see the custom metric.

        • Jason Brownlee August 15, 2019 at 7:59 am #

          Yes, you must define the metric and load the weights in the same scope.

          • zeinab August 15, 2019 at 8:13 am #

            sorry, but I donot understand what do you mean?

  60. zeinab August 14, 2019 at 9:32 pm #

    For a certain problem, I try to solve it using CNN model on 5 cross validation, I find its performance through the average of the loss of the 5 folds and the average of the 5 folds accuracy. (I stop training when i reach the minimum validation loss)
    1- Is this is a right way for calculating a model performance?

    Then I evaluate the same problem on the same dataset but using LSTM.

    2- Now, I want to select the best model, How?

    3- Does the best model is the model with the lowest loss or the highest accuracy or what?

  61. zeinab August 15, 2019 at 2:43 am #

    My problem is regression not classification?

  62. zeinab August 15, 2019 at 8:19 am #

    What about choosing the best model (CNN, LSTM, …) applied on the same problem and the same dataset?

  63. zeinab August 15, 2019 at 8:35 am #

    I apply more than one algorithm(CNN, LSTM) to solve my problem (text similarity). How can I select the best algorithm?

    Does the best algorithm is the algorithm that has the lowest validation loss?

    • Jason Brownlee August 15, 2019 at 2:18 pm #

      It is common to choose a model based both on complexity (minimized) and on skill (optimizing a domain-specific metric).

      Loss is a good proxy for the domain specific metric, but hard to communicate to subject matter experts/stakeholders.

  64. ylnhari August 15, 2019 at 9:27 pm #

    Hi,

    If i want to check point both the best model and model at last epoch when the training halted in middle because of some other reason , how to do that ?

    • Jason Brownlee August 16, 2019 at 7:51 am #

      Perhaps configure a different ModelCheckpoint instance for each case?

  65. kumar August 16, 2019 at 9:12 pm #

    Dear Jason,

    We are using the following statements to save the model.

    model.compile(optimizer=’adam’, loss=’mse’, metrics=[‘accuracy’])
    filepath=”model_lstm_10M_{epoch:02d}.h5″
    checkpoint = ModelCheckpoint(filepath, period=1000, verbose=1, save_best_only=False)
    #tbCallBack = TensorBoard(log_dir=’./log’, histogram_freq=0, write_graph=True, write_images=True)
    callbacks_list = [checkpoint]

    # fit model
    model.fit(train_data, target, epochs=10000,batch_size=4,callbacks=callbacks_list,verbose=2)
    model.save(‘model_10M_lstm_100.h5’)

    We stop training at 2300 epochs, when we start again, it starts from epoch 1 I want to continue from previous epochs (2300) . What are the changes we have to do to achieve this?

    Thanks inadvance

    • Jason Brownlee August 17, 2019 at 5:42 am #

      Training will always start at epoch 0.

      If you load the model and start training again, it will start with weights you from the end of the last run. Only the epoch number will reset, not the weights.

      If you know how many epochs were completed, you can subtract that from the number of epochs you wish to use in the second run.

  66. Jacky QIN October 17, 2019 at 1:47 pm #

    Hi, Jason. Your course is pretty good, I get a lot. But there is a little bug, I guess that’s the API have been updated. The code:

    checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

    The argument ‘monitor’ should be ‘val_acc’ not ‘val_accuracy’

    • Jason Brownlee October 17, 2019 at 1:54 pm #

      The examples assume Keras 2.3 or higher where you must use val_accuracy.

      For older versions of Keras, you must use val_acc.

  67. Eduardo October 28, 2019 at 2:32 am #

    Hi Jason, do you know of a way to also save the epoch number of the last checkpointed model?

    What I want to do is, after training, evaluate the difference between Train and Validation loss of the training epoch corresponding to the “best checkpointed model” to have a reference of the error gap at that exact point.

    Right now I’m calculating that “error gap” from the results of the model validation: train loss vs test loss.

    Thank you in advance!

  68. Walid November 7, 2019 at 3:03 am #

    Great clear post
    is not ModelCheckpoint saving the full model with weights?

    I think “save_weights_only” is by default false

Leave a Reply