How to Check-Point Deep Learning Models in Keras

Deep learning models can take hours, days or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library.

Let’s get started.

  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How to Check-Point Deep Learning Models in Keras

How to Check-Point Deep Learning Models in Keras
Photo by saragoldsmith, some rights reserved.

Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long running processes.

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how the file should named and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename that you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note, you may need to install the h5py library to output network weights in HDF5 format.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assume that the pima-indians-diabetes.csv file is in your working directory. You can download this dataset from the UCI machine learning repository (update: download from here). The example uses 33% of the data for validation.

Checkpointing is setup to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_acc’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_acc=.2f}.hdf5).

Running the example produces the following output (truncated for brevity):

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

This is a very simple checkpointing strategy. It may create a lot of unnecessary check-point files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

Checkpoint Best Neural Network Model Only

A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not include score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

Running this example provides the following output (truncated for brevity):

You should see the weight file in your local directory.

This is a handy checkpoint strategy to always use during your experiments. It will ensure that your best model is saved for the run for you to use later if you wish. It avoids you needing to include code to manually keep track and serialize the best model when training.

Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This too can be serialize to file in JSON or YAML format.

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

Running the example produces the following output:

Summary

In this post you have discovered the importance of checkpointing deep learning models for long training runs.

You learned two checkpointing strategies that you can use on your next deep learning project:

  1. Checkpoint Model Improvements.
  2. Checkpoint Best Model Only.

You also learned how to load a checkpointed model and make predictions.

Do you have any questions about checkpointing deep learning models or about this post? Ask your questions in the comments and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

Deep Learning with Python

 What If You Could Develop A Network in Minutes

…with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more…

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

86 Responses to How to Check-Point Deep Learning Models in Keras

  1. Gerrit Govaerts October 5, 2016 at 1:07 am #

    79.56% with a 3 hidden layer architecture , 24 neurons each
    Great blog , learned a lot

  2. Lau MingFei October 21, 2016 at 1:02 pm #

    Hi Jason, how can I checkpoint a model with my custom metric? The example codes you gave above is monitor = ‘val_acc’ , but when I replace it with monitor = my_metric , it displays the following warning message:

    /usr/local/lib/python3.5/dist-packages/keras/callbacks.py:286: RuntimeWarning: Can save best model only with available, skipping.
    ‘skipping.’ % (self.monitor), RuntimeWarning)

    So how should I do with this?

    • Jason Brownlee October 22, 2016 at 6:55 am #

      Great question Lau,

      I have not tried to check point with a custom metric, sorry. I cannot give you good advice.

  3. Xu Zhang November 8, 2016 at 11:21 am #

    Hi Jason,
    A great post!

    I saved the model and weights using callbacks, ModelCheckpoint. If I want to train it continuously from the last epoch, how to set the model.fit() command to start from the previous epoch? Sometimes, we need to change the learning rates after several epochs and to continue training from the last epoch. Your advice is highly appreciated.

    • Jason Brownlee November 9, 2016 at 9:48 am #

      Great question Xu Zhang,

      You can load the model weights back into your network and then start a new training process.

      • Lopez GG December 30, 2016 at 1:01 pm #

        Thank you Jason. I ran a epoch and got the loss down to 353.6888. The session got disconnected so I used the weights as follows. However, I dont see a change in loss. I am loading the weights correctly ? Here is my code

        >>> filename = “weights-improvement-01–353.6888-embedding.hdf5”
        >>> model.load_weights(filename)
        >>> model.compile(loss=’binary_crossentropy’, optimizer=’adam’)
        >>> model.fit(dataX_Arr, dataY_Arr, batch_size=batch_size, nb_epoch=15, callbacks=callbacks_list)
        Epoch 1/15
        147744/147771 [============================>.] – ETA: 1s – loss: -353.6892Epoch 00000: loss improved from inf to -353.68885, saving model to weights-

        • Jason Brownlee December 31, 2016 at 7:02 am #

          It looks like you are loading the weights correctly.

  4. Anthony November 29, 2016 at 9:58 pm #

    Great post again Jason, thank you!

  5. Nasarudin January 31, 2017 at 1:48 pm #

    Hi Jason, thank you for your tutorial. I want to implement this checkpoint function in iris-flower model script but failed to do it. It keeps showing this error and I do not know how to solve it.

    I put the ‘model checkpoint’ line after the ‘baseline model’ function and add ‘callbacks’

    RuntimeError: Cannot clone object , as the constructor does not seem to set parameter callbacks

    Thank you for your help

    • Jason Brownlee February 1, 2017 at 10:41 am #

      Hi Nasarudin, sorry I am not sure of the cause of this error.

      I believe you cannot use callbacks like the checkpointing when using the KerasClassifier as is the case in the iris tutorial:
      http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

      • Nasarudin February 6, 2017 at 1:50 am #

        Hi Jason. Thank you for your reply. I have some dataset that looks same with the iris dataset but it is a lot bigger. I thought maybe I can use the callbacks when I train the dataset.

        Since training a large dataset might taking a lot of time, do you have any suggestion to implement other function that can do checkpoint when using KerasClassifier?nk for the

        By the way, you can refer to this link for the script that I wrote to implement the checkpoint function with KerasClassifier.

        http://stackoverflow.com/questions/41937719/checkpoint-deep-learning-models-in-keras

        Thank you.

        • Jason Brownlee February 6, 2017 at 9:44 am #

          Hi Nasarudin,

          I would recommend using a standalone Keras model rather than the sklearn wrapper if you wish to use checkpointing.

  6. Abner February 3, 2017 at 3:21 pm #

    Great tutorials love your page. I got a question: I am trying to optimize a model based on val_acc using ModelCheckpoint. However, I get great results rather quickly (example: val_acc = .98). This is my max validation, therefore, the model that will be saved (save_best_only). However several epochs gave me the same max validation and the latter epochs have higher training accuracy. Example Epoch 3: acc = .70, val_acc = .98, Epoch 50: acc = 1.00, val_acc = .98. Clearly, I would want to save the second one which generalizes on the data plus shows great training. How do I do this without having to save every epoch? Basically, I want to pass a second sorting parameter to monitor (monitor=val_acc,acc).

    Thanks.

    • Jason Brownlee February 4, 2017 at 9:57 am #

      Great question, off the cuff, I think you can pass multiple metrics for check-pointing.

      Try multiple checkpoint objects?

      Here’s the docs if that helps:
      https://keras.io/callbacks/#modelcheckpoint

      Ideally, you do want a balance between performance on the training dataset and on the validation dataset.

      • Abner February 7, 2017 at 5:13 am #

        Yes, it’s not letting me pass an array list or multiple parameters it’s only expecting 1 parameter base on literature, for now, I have to settle for using val_acc for bottlenecks/top layer and val_loss for the final model, though I would prefer more control. maybe I’ll ask for it in a feature request.

        • Jason Brownlee February 7, 2017 at 10:23 am #

          Great idea, add an issue on the Keras project.

        • Damily March 22, 2017 at 10:12 pm #

          Hi Abner, do you solve this problem now?
          hope to receive your reply.

  7. Fatma February 9, 2017 at 3:31 pm #

    Hi Jason, thank you for your tutorial. I need to ask one question, if my input contains two images with different labels (the label represents the popularity of the image). I need to know how to feed this pair of images such that the first image pass through CNN1 and the second one pass through CNN2 Then I can merge them using the merge layer to classify which one is more popular than the other one. How can I use the library in order to handle the two different inputs?

    • Jason Brownlee February 10, 2017 at 9:48 am #

      Hi Fatma,

      Perhaps you can reframe your problem to output the popularity for one image and use an “if” statement to rank the relative popularities of each image separately?

      • Fatma February 10, 2017 at 1:53 pm #

        I need to compare between the popularity value of the two input images such that the output will be the first image is high popular than the second image or vice versa then when I feed one test image (unlabeled) it should be compared with some baseline of the training data to compute its popularity

  8. Nasarudin February 21, 2017 at 3:30 pm #

    Hi Jason, great tutorial as always.

    I want to ask regarding ‘validation_split’ in the script. What is the difference between this script that has ‘validation_script’ variable and the one from here which did not have ‘validation_split’ variable http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ ?

    • Jason Brownlee February 22, 2017 at 9:56 am #

      Great question Nasarudin.

      If you have a validation_split, it will give you some feedback during training about the performance on the model on “unseen” data. When using a check-point, this can be useful as you can stop training when performance on validation data stops improving (a sign of possible overfitting).

  9. jerpint March 18, 2017 at 7:46 am #

    thanks!

  10. Aleksandra Nabożny April 10, 2017 at 11:54 pm #

    Hi Jason! Thank you very much for the tutorial!

    Can you provide a sample code to visualize results as follows: plot some images from validation dataset with their 5 top label predictions?

    • Jason Brownlee April 11, 2017 at 9:33 am #

      Yes you could do this using the probabilistic output from the network.

      Sorry, I do not have examples at this stage.

  11. Pratik April 19, 2017 at 6:39 pm #

    Thanks for the tutorial Jason 🙂

  12. Krishan Chopra April 25, 2017 at 6:33 am #

    Hello Sir, i want to thank you for this cool tutorial,
    Currently i am checkpointing my model every 50 epochs. I also want to checkpoint any epoch whose val_acc is better but is not the 50th epoch. For ex. i have checkpointed 250th epoch, but val_acc for 282th epoch is besser and i have to save it but as i have specified the period to be 50, i cant save the 282th epoch.
    Should i implement both attributes save_best_only=true and period=50 in ModelCheckPoint ?

    • Jason Brownlee April 25, 2017 at 7:53 am #

      Sorry, I’m not sure I follow.

      You can use the checkpoint to save any improvement to the model regardless of epoch. Perhaps that would be an approach to try?

  13. Zhihao Chen May 19, 2017 at 11:18 am #

    Hi Jason,

    I love your tutorial very much but just find a little bit confused about how to combine the check point with the cross validation. As is shown in your other tutorial, we don’t explicitly call model.fit when doing cross validation. Could you give me an example how to add check-point into this?

    • Jason Brownlee May 20, 2017 at 5:33 am #

      You may have to run the k-fold cross-validation process manually (e.g. loop the folds yourself).

  14. chiyuan May 28, 2017 at 3:36 pm #

    Love your articles. Learnt lost of things here.

    Please keep your tutorials going!

  15. Abolfazl June 22, 2017 at 3:00 am #

    Jason, your blog is amazing. Thanks for helping us out in learning this awesome field.

  16. Xiufeng Yang June 29, 2017 at 3:43 pm #

    Hi, great post! I want to save the training model not the validation model, how to set the parameters in checkpoint()?

    • Jason Brownlee June 30, 2017 at 8:08 am #

      The trained model is saved, there is no validation model. Just an evaluation of the model using validation data.

  17. Mike September 19, 2017 at 5:55 pm #

    Is it possible to checkpoint in between epochs?
    I have an enormous dataset that takes 20hrs per epoch and it’s failed before it finished an epoch. It would be great if I could checkpoint every fifth of an epoch or so.

    • Jason Brownlee September 20, 2017 at 5:55 am #

      I would recommend using a data generator and then using a checkpoint after a fixed number of batches. This might work.

  18. Mounir October 31, 2017 at 10:30 pm #

    Hi Jason, great blog. Do you also happen to know how to save/store the epoch number of the last observed improvement during training? That would be very useful to study overfitting

    • Jason Brownlee November 1, 2017 at 5:45 am #

      Yes, you could add the “epoch” variable to the checkpoint filename.

      For example:

  19. Jes November 3, 2017 at 8:04 am #

    is it possible to use Keras checkpoints together with Gridsearch? in case Gridsearch crashes?

    • Jason Brownlee November 3, 2017 at 2:15 pm #

      Not a good idea Jes. I’d recommend doing a custom grid search.

  20. Abad December 3, 2017 at 10:10 pm #

    Receiving following error:
    TypeError Traceback (most recent call last)
    in ()
    72
    73 filepath=”weights.best.hdf5″
    —> 74 checkpoint = ModelCheckpoint(filepath, monitor=’val_acc’, verbose=0, save_best_only=True,node=’max’)
    75 callbacks_checkpt = [checkpoint]
    76

    TypeError: __init__() got an unexpected keyword argument ‘node’

    • Jason Brownlee December 4, 2017 at 7:47 am #

      Looks like a typo.

      Double check that you have copied the code exactly from the tutorial?

  21. Aditya December 24, 2017 at 8:21 am #

    Hi Jason!
    Thanks for the informative post.

    I have one question – in case of unexpected stoppage of the run, we have the best model weights for the epochs DONE SO FAR. How can we use this checkpoint as a STARTING point to continue with the remaining epochs?

    • Jason Brownlee December 25, 2017 at 5:22 am #

      You can load the weights, see the example in the tutorial of exactly this.

  22. Liaqat Ali December 27, 2017 at 3:53 am #

    Thanks for such a nice explanation. I want to ask if we are performing some experiments & want the neural network model to achieve high accuracy on for test set. Can we use this method to find the best tuned network or the highest possible accuracy.?????

    • Jason Brownlee December 27, 2017 at 5:21 am #

      This method can help find and save a well performing model during training.

  23. davenso January 8, 2018 at 4:54 pm #

    very cool.

  24. davenso January 8, 2018 at 4:57 pm #

    Based on this example, how long does it take typically to save or retrieve a checkpoint model?

    • Jason Brownlee January 9, 2018 at 5:24 am #

      Very fast, just a few seconds for large models.

      • davenso January 9, 2018 at 3:43 pm #

        Thanks, Jason. Could you kindly give an estimate, more than 10 secs?

  25. kszyniu January 12, 2018 at 5:59 am #

    Nice post.
    However, I have one question. Can I get number of epochs that model was trained for in other way than reading its filename?
    Here’s my use case: upon loading my model I want to restore the training exactly from the point I saved it (by passing a value to initial_epoch argument in .fit()) for the sake of better-looking graphs in TensorBoard.
    For example, I trained my model for 2 epochs (got 2 messages: “Epoch 1/5”, “Epoch 2/5”) and saved it. Now, I want to load that model and continue training from 3rd epoch (I expect getting message “Epoch 3/5” and so on).
    Is there a better way than saving epochs to filename and then getting it from there (which seems kinda messy)?

    • Jason Brownlee January 12, 2018 at 11:48 am #

      You could read this from the filename. It’s just a string.

      You could also have a callback that writes the last epoch completed to a file, and overwrite this file each time.

  26. Max Jansen January 16, 2018 at 9:21 pm #

    Great post and a fantastic blog! I can’t thank you enough!

  27. Gabriel January 26, 2018 at 3:18 pm #

    Hi there. Great posts! Quick question: have you run into this problem? “callbacks.py:405: RuntimeWarning: Can save best model only with val_acc available, skipping.”

    Running on AWS GPU Compute instance, fyi. I am not going straight from the Keras.Sequence() model… instead, I am using the SearchGridCV as I am trying to perform some tuning tasks but want to save the best model. Any suggestions?

    • Jason Brownlee January 27, 2018 at 5:53 am #

      I would recommend not combining checkpointing with cross validation or grid search.

  28. Rob February 28, 2018 at 2:16 am #

    Hi Jason,

    Thanks a lot for all your posts, really helpful. Can you explain to me why some epochs improve validation accuracy whilst previous epochs did not? If you do not use any random sampling in your dataset (e.g. no dropout), how can it be that epoch 12 increases validation score while epoch 11 does not? Aren’t they based on the same starting point (the model outputted by epoch 10)?

    thanks!

    • Jason Brownlee February 28, 2018 at 6:07 am #

      The algorithm is stochastic where not every update improve the scores across all of the data.

      This is a property of the learning algorithm, gradient descent, that is doing its best, but cannot “see” the whole state of the problem, but instead operates piece-wise and incrementally. This is a feature, not a bug. It often leads to better outcomes.

      Does that help?

  29. Pete March 2, 2018 at 11:50 pm #

    I was curiosity what the different of these two version???
    It was seemed that just filepath was different??

    • Jason Brownlee March 3, 2018 at 8:11 am #

      One keeps every improvement, one keeps only the best.

      • Pete March 4, 2018 at 11:40 pm #

        Why the filepath variable was so magic? Just change the filepath variable can make these different result. How is this achieved?

  30. Pete March 3, 2018 at 1:20 am #

    In the last, if I add one code model.save_weights('weights.hdf5'), what the difference of this weights and ModelCheckpoint best weight??

    • Jason Brownlee March 3, 2018 at 8:18 am #

      The difference is you are saving the weights from the final model after the last epoch instead of when a change in model skill has occurred during training.

  31. Pete March 3, 2018 at 9:23 pm #

    When I add one code model.save_weights('weights.hdf5') to save weight from the final model; and I also save ModelCheckpoint best weight, I found that these two hdf5 file were not the same. I was confusing that why the final model weight I saved wasn’t the best model weight.

    • Jason Brownlee March 4, 2018 at 6:02 am #

      Yes, that is the idea of checkpointing, that it the model at the last epoch may not be the best model, in fact it often is not.

      Checkpointing will save the best model along the way.

      Does that help? Perhaps re-read the tutorial to make this clearer?

      • Pete March 4, 2018 at 11:32 pm #

        Ok, thanks. In a sence, when we have used checkpoint, that is meaningless to use model.save_weights('weights.hdf5') again.

  32. Kaushal Shetty March 29, 2018 at 6:19 am #

    Hi Jason,
    How do I checkpoint a regression model. Is my metric accuracy or mse in such a case? And what should I monitor in such case in the modelcheckpoint? I am training a time series model.

    Thanks

  33. Kakoli May 27, 2018 at 8:39 pm #

    Hi Jason
    While running for 50 epochs, I am checkpointing and saving the weights after every 5 epochs.
    Now after 27th, VM disconn.
    Then I reconnect and compile the model after loading the saved weights. Now if I evaluate, I shall get the score on the best weight till 27th epoch. But since only 25 epochs are considered, accuracy will not be good, right?
    In that case, how do I continue the remaining 25 epochs with the saved weights?
    Thanks

    • Jason Brownlee May 28, 2018 at 5:57 am #

      You can load a set of weights and continue training the model.

  34. Rahmad ars May 30, 2018 at 7:02 am #

    Hi jason, thanks for the tutorial. If i want to extract weight values of all layer from my model, how to do that? thanks

    • Jason Brownlee May 30, 2018 at 3:06 pm #

      I believe there is a get_weights() function on each layer.

  35. michael alex June 18, 2018 at 2:23 pm #

    Hi Jason,
    Thanks for all the great tutorials, including this one. Is it possible to Grid Search model hyper-parameters and check-point the model at the same time? I can do either one independently, but not both at the same time. Can you show us?

    Thanks,
    Michael

    • Jason Brownlee June 18, 2018 at 3:13 pm #

      Yes, but you will have to write the for loops for the grid search yourself. sklearn won’t help.

  36. Hung June 18, 2018 at 6:40 pm #

    Thanks a lot Jason for excellent tutorial.

Leave a Reply