How to Check-Point Deep Learning Models in Keras

Deep learning models can take hours, days or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library.

Let’s get started.

  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
How to Check-Point Deep Learning Models in Keras

How to Check-Point Deep Learning Models in Keras
Photo by saragoldsmith, some rights reserved.

Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long running processes.

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how the file should named and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename that you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note, you may need to install the h5py library to output network weights in HDF5 format.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assume that the pima-indians-diabetes.csv file is in your working directory.

You can download the dataset from here:

The example uses 33% of the data for validation.

Checkpointing is setup to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_acc’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_acc=.2f}.hdf5).

Running the example produces the following output (truncated for brevity):

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

This is a very simple checkpointing strategy. It may create a lot of unnecessary check-point files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

Checkpoint Best Neural Network Model Only

A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not include score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

Running this example provides the following output (truncated for brevity):

You should see the weight file in your local directory.

This is a handy checkpoint strategy to always use during your experiments. It will ensure that your best model is saved for the run for you to use later if you wish. It avoids you needing to include code to manually keep track and serialize the best model when training.

Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This too can be serialize to file in JSON or YAML format.

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

Running the example produces the following output:

Summary

In this post you have discovered the importance of checkpointing deep learning models for long training runs.

You learned two checkpointing strategies that you can use on your next deep learning project:

  1. Checkpoint Model Improvements.
  2. Checkpoint Best Model Only.

You also learned how to load a checkpointed model and make predictions.

Do you have any questions about checkpointing deep learning models or about this post? Ask your questions in the comments and I will do my best to answer.

Frustrated With Your Progress In Deep Learning?

Deep Learning with Python

 What If You Could Develop A Network in Minutes

…with just a few lines of Python

Discover how in my new Ebook: Deep Learning With Python

It covers self-study tutorials and end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more…

Finally Bring Deep Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

128 Responses to How to Check-Point Deep Learning Models in Keras

  1. Gerrit Govaerts October 5, 2016 at 1:07 am #

    79.56% with a 3 hidden layer architecture , 24 neurons each
    Great blog , learned a lot

  2. Lau MingFei October 21, 2016 at 1:02 pm #

    Hi Jason, how can I checkpoint a model with my custom metric? The example codes you gave above is monitor = ‘val_acc’ , but when I replace it with monitor = my_metric , it displays the following warning message:

    /usr/local/lib/python3.5/dist-packages/keras/callbacks.py:286: RuntimeWarning: Can save best model only with available, skipping.
    ‘skipping.’ % (self.monitor), RuntimeWarning)

    So how should I do with this?

    • Jason Brownlee October 22, 2016 at 6:55 am #

      Great question Lau,

      I have not tried to check point with a custom metric, sorry. I cannot give you good advice.

  3. Xu Zhang November 8, 2016 at 11:21 am #

    Hi Jason,
    A great post!

    I saved the model and weights using callbacks, ModelCheckpoint. If I want to train it continuously from the last epoch, how to set the model.fit() command to start from the previous epoch? Sometimes, we need to change the learning rates after several epochs and to continue training from the last epoch. Your advice is highly appreciated.

    • Jason Brownlee November 9, 2016 at 9:48 am #

      Great question Xu Zhang,

      You can load the model weights back into your network and then start a new training process.

      • Lopez GG December 30, 2016 at 1:01 pm #

        Thank you Jason. I ran a epoch and got the loss down to 353.6888. The session got disconnected so I used the weights as follows. However, I dont see a change in loss. I am loading the weights correctly ? Here is my code

        >>> filename = “weights-improvement-01–353.6888-embedding.hdf5”
        >>> model.load_weights(filename)
        >>> model.compile(loss=’binary_crossentropy’, optimizer=’adam’)
        >>> model.fit(dataX_Arr, dataY_Arr, batch_size=batch_size, nb_epoch=15, callbacks=callbacks_list)
        Epoch 1/15
        147744/147771 [============================>.] – ETA: 1s – loss: -353.6892Epoch 00000: loss improved from inf to -353.68885, saving model to weights-

        • Jason Brownlee December 31, 2016 at 7:02 am #

          It looks like you are loading the weights correctly.

  4. Anthony November 29, 2016 at 9:58 pm #

    Great post again Jason, thank you!

  5. Nasarudin January 31, 2017 at 1:48 pm #

    Hi Jason, thank you for your tutorial. I want to implement this checkpoint function in iris-flower model script but failed to do it. It keeps showing this error and I do not know how to solve it.

    I put the ‘model checkpoint’ line after the ‘baseline model’ function and add ‘callbacks’

    RuntimeError: Cannot clone object , as the constructor does not seem to set parameter callbacks

    Thank you for your help

    • Jason Brownlee February 1, 2017 at 10:41 am #

      Hi Nasarudin, sorry I am not sure of the cause of this error.

      I believe you cannot use callbacks like the checkpointing when using the KerasClassifier as is the case in the iris tutorial:
      http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

      • Nasarudin February 6, 2017 at 1:50 am #

        Hi Jason. Thank you for your reply. I have some dataset that looks same with the iris dataset but it is a lot bigger. I thought maybe I can use the callbacks when I train the dataset.

        Since training a large dataset might taking a lot of time, do you have any suggestion to implement other function that can do checkpoint when using KerasClassifier?nk for the

        By the way, you can refer to this link for the script that I wrote to implement the checkpoint function with KerasClassifier.

        http://stackoverflow.com/questions/41937719/checkpoint-deep-learning-models-in-keras

        Thank you.

        • Jason Brownlee February 6, 2017 at 9:44 am #

          Hi Nasarudin,

          I would recommend using a standalone Keras model rather than the sklearn wrapper if you wish to use checkpointing.

  6. Abner February 3, 2017 at 3:21 pm #

    Great tutorials love your page. I got a question: I am trying to optimize a model based on val_acc using ModelCheckpoint. However, I get great results rather quickly (example: val_acc = .98). This is my max validation, therefore, the model that will be saved (save_best_only). However several epochs gave me the same max validation and the latter epochs have higher training accuracy. Example Epoch 3: acc = .70, val_acc = .98, Epoch 50: acc = 1.00, val_acc = .98. Clearly, I would want to save the second one which generalizes on the data plus shows great training. How do I do this without having to save every epoch? Basically, I want to pass a second sorting parameter to monitor (monitor=val_acc,acc).

    Thanks.

    • Jason Brownlee February 4, 2017 at 9:57 am #

      Great question, off the cuff, I think you can pass multiple metrics for check-pointing.

      Try multiple checkpoint objects?

      Here’s the docs if that helps:
      https://keras.io/callbacks/#modelcheckpoint

      Ideally, you do want a balance between performance on the training dataset and on the validation dataset.

      • Abner February 7, 2017 at 5:13 am #

        Yes, it’s not letting me pass an array list or multiple parameters it’s only expecting 1 parameter base on literature, for now, I have to settle for using val_acc for bottlenecks/top layer and val_loss for the final model, though I would prefer more control. maybe I’ll ask for it in a feature request.

        • Jason Brownlee February 7, 2017 at 10:23 am #

          Great idea, add an issue on the Keras project.

        • Damily March 22, 2017 at 10:12 pm #

          Hi Abner, do you solve this problem now?
          hope to receive your reply.

  7. Fatma February 9, 2017 at 3:31 pm #

    Hi Jason, thank you for your tutorial. I need to ask one question, if my input contains two images with different labels (the label represents the popularity of the image). I need to know how to feed this pair of images such that the first image pass through CNN1 and the second one pass through CNN2 Then I can merge them using the merge layer to classify which one is more popular than the other one. How can I use the library in order to handle the two different inputs?

    • Jason Brownlee February 10, 2017 at 9:48 am #

      Hi Fatma,

      Perhaps you can reframe your problem to output the popularity for one image and use an “if” statement to rank the relative popularities of each image separately?

      • Fatma February 10, 2017 at 1:53 pm #

        I need to compare between the popularity value of the two input images such that the output will be the first image is high popular than the second image or vice versa then when I feed one test image (unlabeled) it should be compared with some baseline of the training data to compute its popularity

  8. Nasarudin February 21, 2017 at 3:30 pm #

    Hi Jason, great tutorial as always.

    I want to ask regarding ‘validation_split’ in the script. What is the difference between this script that has ‘validation_script’ variable and the one from here which did not have ‘validation_split’ variable http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ ?

    • Jason Brownlee February 22, 2017 at 9:56 am #

      Great question Nasarudin.

      If you have a validation_split, it will give you some feedback during training about the performance on the model on “unseen” data. When using a check-point, this can be useful as you can stop training when performance on validation data stops improving (a sign of possible overfitting).

  9. jerpint March 18, 2017 at 7:46 am #

    thanks!

  10. Aleksandra Nabożny April 10, 2017 at 11:54 pm #

    Hi Jason! Thank you very much for the tutorial!

    Can you provide a sample code to visualize results as follows: plot some images from validation dataset with their 5 top label predictions?

    • Jason Brownlee April 11, 2017 at 9:33 am #

      Yes you could do this using the probabilistic output from the network.

      Sorry, I do not have examples at this stage.

  11. Pratik April 19, 2017 at 6:39 pm #

    Thanks for the tutorial Jason 🙂

  12. Krishan Chopra April 25, 2017 at 6:33 am #

    Hello Sir, i want to thank you for this cool tutorial,
    Currently i am checkpointing my model every 50 epochs. I also want to checkpoint any epoch whose val_acc is better but is not the 50th epoch. For ex. i have checkpointed 250th epoch, but val_acc for 282th epoch is besser and i have to save it but as i have specified the period to be 50, i cant save the 282th epoch.
    Should i implement both attributes save_best_only=true and period=50 in ModelCheckPoint ?

    • Jason Brownlee April 25, 2017 at 7:53 am #

      Sorry, I’m not sure I follow.

      You can use the checkpoint to save any improvement to the model regardless of epoch. Perhaps that would be an approach to try?

  13. Zhihao Chen May 19, 2017 at 11:18 am #

    Hi Jason,

    I love your tutorial very much but just find a little bit confused about how to combine the check point with the cross validation. As is shown in your other tutorial, we don’t explicitly call model.fit when doing cross validation. Could you give me an example how to add check-point into this?

    • Jason Brownlee May 20, 2017 at 5:33 am #

      You may have to run the k-fold cross-validation process manually (e.g. loop the folds yourself).

  14. chiyuan May 28, 2017 at 3:36 pm #

    Love your articles. Learnt lost of things here.

    Please keep your tutorials going!

  15. Abolfazl June 22, 2017 at 3:00 am #

    Jason, your blog is amazing. Thanks for helping us out in learning this awesome field.

  16. Xiufeng Yang June 29, 2017 at 3:43 pm #

    Hi, great post! I want to save the training model not the validation model, how to set the parameters in checkpoint()?

    • Jason Brownlee June 30, 2017 at 8:08 am #

      The trained model is saved, there is no validation model. Just an evaluation of the model using validation data.

  17. Mike September 19, 2017 at 5:55 pm #

    Is it possible to checkpoint in between epochs?
    I have an enormous dataset that takes 20hrs per epoch and it’s failed before it finished an epoch. It would be great if I could checkpoint every fifth of an epoch or so.

    • Jason Brownlee September 20, 2017 at 5:55 am #

      I would recommend using a data generator and then using a checkpoint after a fixed number of batches. This might work.

  18. Mounir October 31, 2017 at 10:30 pm #

    Hi Jason, great blog. Do you also happen to know how to save/store the epoch number of the last observed improvement during training? That would be very useful to study overfitting

    • Jason Brownlee November 1, 2017 at 5:45 am #

      Yes, you could add the “epoch” variable to the checkpoint filename.

      For example:

  19. Jes November 3, 2017 at 8:04 am #

    is it possible to use Keras checkpoints together with Gridsearch? in case Gridsearch crashes?

    • Jason Brownlee November 3, 2017 at 2:15 pm #

      Not a good idea Jes. I’d recommend doing a custom grid search.

  20. Abad December 3, 2017 at 10:10 pm #

    Receiving following error:
    TypeError Traceback (most recent call last)
    in ()
    72
    73 filepath=”weights.best.hdf5″
    —> 74 checkpoint = ModelCheckpoint(filepath, monitor=’val_acc’, verbose=0, save_best_only=True,node=’max’)
    75 callbacks_checkpt = [checkpoint]
    76

    TypeError: __init__() got an unexpected keyword argument ‘node’

    • Jason Brownlee December 4, 2017 at 7:47 am #

      Looks like a typo.

      Double check that you have copied the code exactly from the tutorial?

  21. Aditya December 24, 2017 at 8:21 am #

    Hi Jason!
    Thanks for the informative post.

    I have one question – in case of unexpected stoppage of the run, we have the best model weights for the epochs DONE SO FAR. How can we use this checkpoint as a STARTING point to continue with the remaining epochs?

    • Jason Brownlee December 25, 2017 at 5:22 am #

      You can load the weights, see the example in the tutorial of exactly this.

  22. Liaqat Ali December 27, 2017 at 3:53 am #

    Thanks for such a nice explanation. I want to ask if we are performing some experiments & want the neural network model to achieve high accuracy on for test set. Can we use this method to find the best tuned network or the highest possible accuracy.?????

    • Jason Brownlee December 27, 2017 at 5:21 am #

      This method can help find and save a well performing model during training.

  23. davenso January 8, 2018 at 4:54 pm #

    very cool.

  24. davenso January 8, 2018 at 4:57 pm #

    Based on this example, how long does it take typically to save or retrieve a checkpoint model?

    • Jason Brownlee January 9, 2018 at 5:24 am #

      Very fast, just a few seconds for large models.

      • davenso January 9, 2018 at 3:43 pm #

        Thanks, Jason. Could you kindly give an estimate, more than 10 secs?

  25. kszyniu January 12, 2018 at 5:59 am #

    Nice post.
    However, I have one question. Can I get number of epochs that model was trained for in other way than reading its filename?
    Here’s my use case: upon loading my model I want to restore the training exactly from the point I saved it (by passing a value to initial_epoch argument in .fit()) for the sake of better-looking graphs in TensorBoard.
    For example, I trained my model for 2 epochs (got 2 messages: “Epoch 1/5”, “Epoch 2/5”) and saved it. Now, I want to load that model and continue training from 3rd epoch (I expect getting message “Epoch 3/5” and so on).
    Is there a better way than saving epochs to filename and then getting it from there (which seems kinda messy)?

    • Jason Brownlee January 12, 2018 at 11:48 am #

      You could read this from the filename. It’s just a string.

      You could also have a callback that writes the last epoch completed to a file, and overwrite this file each time.

  26. Max Jansen January 16, 2018 at 9:21 pm #

    Great post and a fantastic blog! I can’t thank you enough!

  27. Gabriel January 26, 2018 at 3:18 pm #

    Hi there. Great posts! Quick question: have you run into this problem? “callbacks.py:405: RuntimeWarning: Can save best model only with val_acc available, skipping.”

    Running on AWS GPU Compute instance, fyi. I am not going straight from the Keras.Sequence() model… instead, I am using the SearchGridCV as I am trying to perform some tuning tasks but want to save the best model. Any suggestions?

    • Jason Brownlee January 27, 2018 at 5:53 am #

      I would recommend not combining checkpointing with cross validation or grid search.

  28. Rob February 28, 2018 at 2:16 am #

    Hi Jason,

    Thanks a lot for all your posts, really helpful. Can you explain to me why some epochs improve validation accuracy whilst previous epochs did not? If you do not use any random sampling in your dataset (e.g. no dropout), how can it be that epoch 12 increases validation score while epoch 11 does not? Aren’t they based on the same starting point (the model outputted by epoch 10)?

    thanks!

    • Jason Brownlee February 28, 2018 at 6:07 am #

      The algorithm is stochastic where not every update improve the scores across all of the data.

      This is a property of the learning algorithm, gradient descent, that is doing its best, but cannot “see” the whole state of the problem, but instead operates piece-wise and incrementally. This is a feature, not a bug. It often leads to better outcomes.

      Does that help?

  29. Pete March 2, 2018 at 11:50 pm #

    I was curiosity what the different of these two version???
    It was seemed that just filepath was different??

    • Jason Brownlee March 3, 2018 at 8:11 am #

      One keeps every improvement, one keeps only the best.

      • Pete March 4, 2018 at 11:40 pm #

        Why the filepath variable was so magic? Just change the filepath variable can make these different result. How is this achieved?

  30. Pete March 3, 2018 at 1:20 am #

    In the last, if I add one code model.save_weights('weights.hdf5'), what the difference of this weights and ModelCheckpoint best weight??

    • Jason Brownlee March 3, 2018 at 8:18 am #

      The difference is you are saving the weights from the final model after the last epoch instead of when a change in model skill has occurred during training.

  31. Pete March 3, 2018 at 9:23 pm #

    When I add one code model.save_weights('weights.hdf5') to save weight from the final model; and I also save ModelCheckpoint best weight, I found that these two hdf5 file were not the same. I was confusing that why the final model weight I saved wasn’t the best model weight.

    • Jason Brownlee March 4, 2018 at 6:02 am #

      Yes, that is the idea of checkpointing, that it the model at the last epoch may not be the best model, in fact it often is not.

      Checkpointing will save the best model along the way.

      Does that help? Perhaps re-read the tutorial to make this clearer?

      • Pete March 4, 2018 at 11:32 pm #

        Ok, thanks. In a sence, when we have used checkpoint, that is meaningless to use model.save_weights('weights.hdf5') again.

  32. Kaushal Shetty March 29, 2018 at 6:19 am #

    Hi Jason,
    How do I checkpoint a regression model. Is my metric accuracy or mse in such a case? And what should I monitor in such case in the modelcheckpoint? I am training a time series model.

    Thanks

  33. Kakoli May 27, 2018 at 8:39 pm #

    Hi Jason
    While running for 50 epochs, I am checkpointing and saving the weights after every 5 epochs.
    Now after 27th, VM disconn.
    Then I reconnect and compile the model after loading the saved weights. Now if I evaluate, I shall get the score on the best weight till 27th epoch. But since only 25 epochs are considered, accuracy will not be good, right?
    In that case, how do I continue the remaining 25 epochs with the saved weights?
    Thanks

    • Jason Brownlee May 28, 2018 at 5:57 am #

      You can load a set of weights and continue training the model.

  34. Rahmad ars May 30, 2018 at 7:02 am #

    Hi jason, thanks for the tutorial. If i want to extract weight values of all layer from my model, how to do that? thanks

    • Jason Brownlee May 30, 2018 at 3:06 pm #

      I believe there is a get_weights() function on each layer.

  35. michael alex June 18, 2018 at 2:23 pm #

    Hi Jason,
    Thanks for all the great tutorials, including this one. Is it possible to Grid Search model hyper-parameters and check-point the model at the same time? I can do either one independently, but not both at the same time. Can you show us?

    Thanks,
    Michael

    • Jason Brownlee June 18, 2018 at 3:13 pm #

      Yes, but you will have to write the for loops for the grid search yourself. sklearn won’t help.

  36. Hung June 18, 2018 at 6:40 pm #

    Thanks a lot Jason for excellent tutorial.

  37. Adam July 21, 2018 at 12:06 am #

    Where are the key values “02d” for epoch and “.2f” for val_acc coming from?

  38. IGOR STAVNITSER July 23, 2018 at 11:27 am #

    Is there a way to checkpoint model weights to memory instead of a file?
    Also in your example you are maximizing val_acc. What are the merits of maximizing val_acc vs minimizing val_loss?
    Thank you for a great post!

    • Jason Brownlee July 23, 2018 at 2:25 pm #

      I’m sure you could write a custom call back to do this. I don’t have an example.

      You can choose what is important to you, accuracy might be more relevant to you when using the model to make predictions.

  39. Abhijeet Gokar August 9, 2018 at 9:27 pm #

    Thanks a lot, it is good practise to conciously design and implement checkpoints , in our model.

  40. Yari August 14, 2018 at 3:19 am #

    Hi Jason, thanks for the post (as well as many others I have read!).
    I was wondering: after fitting with model.fit(....) and using checkpoint you will have the weights saved in an external file. But what’s the specific state of model instance after the training? Will it have the best weights or it will have the last weights calculated during the training?

    So, to sum up. If I want to do a prediction on a test set immediately after the training/fitting should I load the best weights from the external file and then do the prediction or I could directly use model.predict(...) immediately after model.fit(...) ?

    Thanks a lot for your support!

    • Jason Brownlee August 14, 2018 at 6:24 am #

      The file will have the model weights at the time of the checkpoint. The weights can be loaded and used directly for making predictions. This is the whole point of checkpointing – to save the best model found during training.

      • Yari August 14, 2018 at 3:40 pm #

        Hey Jason, thanks for getting back to me. Yes that was clear to me. What’s not clear is what weights model has at the end of the training.

        Lets suppose I’m using

        callbacks = [
        EarlyStopping(patience=15, monitor='val_loss', min_delta=0, mode='min'),
        ModelCheckpoint('best-weights.h5', monitor='val_loss', save_best_only=True, save_weights_only=True)
        ]

        after training is done I’ll have the best weights saved on best-weights.h5 but what are the weights stored in the model instance? If I do model.evaluate(...) (without loading best-weights.h5) will it use the best weights or just the weights corresponding to the last epoch?

        • Jason Brownlee August 15, 2018 at 5:57 am #

          They are the weights at the time the checkpoint is triggered. If it is triggered many times, you may have many weight files saved.

        • Anupam Singh September 26, 2018 at 6:01 am #

          model instance will have the weights of last epoch not the best one

  41. Mohammed September 13, 2018 at 5:21 pm #

    I have my owned pretrained model (.ckpt and .meta files). How to use such files to extract features of my dataset in form of matrix which rows represent samples and columns represent features?

    • Jason Brownlee September 14, 2018 at 6:34 am #

      Perhaps load in Python manually then try to populate a defined Keras manually?

  42. chamith November 4, 2018 at 7:28 am #

    Thanks a lot for the great description. But I have to clarify one thing regarding this. When I train the model using two callback functions model *ModelCheckpoint* and

  43. Daniel Penalva November 27, 2018 at 10:23 pm #

    Hi Jason,

    Thanks for tutorial. Iam wondering if this approach works with GridSearch and how can i put the checkpoint to track the results. Also, iam working with Colab Research Notebook right now, is there a way to detect process interruption and use a checkpoint to save the model ?

    Thank you for your help again!!

    • Jason Brownlee November 28, 2018 at 7:41 am #

      No, a grid search and checkpointing are at odds.

      I’m not familiar with “Colab Research Notebook”, what is it?

      • Daniel Penalva November 28, 2018 at 11:17 pm #

        https://colab.research.google.com . Google initiative to promove notebooks with free virtual machines, kernels with GPU and TPU processing (yet experimental). But it disconnects from the kernel after a short time of no use, and the virtual machine can be unmounted after some hours. So its fundamental for deeplearning applications that you checkpoint and save the state to keep going after reconnecting.

        Too bad for Grid Search. How can you do Fine Tuning (hyperparameters) and process babysitting (verify vizualization of results, possible overfits, variance-bias so on …) without checkpointing on Grid Search ?

        thank you !

        • Jason Brownlee November 29, 2018 at 7:41 am #

          When you grid search, you want to know about what hyperparameters give good performance. The models are discarded – no checkpointing needed. Later, you can use good config to fit a new model.

          • Daniel Penalva November 29, 2018 at 10:12 pm #

            Yah, thats right, but still, without being able to finish without being disconnected from the server i will never know the model, so still the checkpoint comes in hand. But if you have in mind any other think to do the tuning together with checkpoints, please let us know !

            Thanks ! 🙂

          • Jason Brownlee November 30, 2018 at 6:33 am #

            Why do you need the model?

            The models during a grid search are discarded. You only need the hyperparameters of the best performing model.

          • Daniel Penalva November 30, 2018 at 12:50 am #

            Update
            There seems to be a issue with Keras Cross Validation and Checkpointing that requires some gimmick turn-around:

            https://github.com/keras-team/keras/issues/836

            Cant understand why they closed the issue without a solution, just dropped my question there.

            Will try to figure out how to do this …

          • Daniel Penalva November 30, 2018 at 12:17 pm #

            “The models during a grid search are discarded. You only need the hyperparameters of the best performing model.”

            Sorry i wasnt clear, i cant run Keras on my laptop, its more than 12 years old. The only free infra i found is Colab Notebook, to fully run through all models in GridSearchCV i need to checkpoint and restart the computing since the persistence of the process in google’s virtual machine wont last few hours, sometimes less than hours.

  44. William December 4, 2018 at 9:32 am #

    Hi Jason

    I have a question regarding compiling the model after weights are loaded.

    # load weights
    model.load_weights(“weights.best.hdf5”)
    # Compile model (required to make predictions)
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    When I looked at the Keras API https://keras.io/models/model/#compile the description for the compile function says “Configures the model for training”. I am confused on why we need to compile the model to make predictions?

    Also, thought you might like to know that the Pima Indians onset of diabetes binary classification problem data set is no longer available.

    Thanks

    • Jason Brownlee December 4, 2018 at 2:35 pm #

      Thanks, you might not need to compile the model after loading any longer, perhaps the API has changed.

  45. kadir sharif December 25, 2018 at 6:54 pm #

    i am training a model about 100 epochs.. Now suppose the electricity gone. and i have a model checkpoints that is saved in hdf5 format… and the model run 30 epochs… but i have the model checkpoints saved with val_acc monitor.

    In this kind of situation how can i load the checkpoint in the same model to continue the training where it interrupted… and is it gonna continue training the model after 30 epochs… it will be a great help it you answer my questions.
    Thanks in advance.

  46. Riccardo January 10, 2019 at 3:06 am #

    Hi Jason thanks for your tutorials, they are very helpful. I’ve implemented a model and then saved it with a checkpoint correctly. Unfortunately when I reload the model (the same structure) with the weights saved I can’t obtain the same predictions as before, there are slightly worse. My model is a simple model
    model = Sequential()
    model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
    model.add(Dense(100, activation=’relu’))
    model.add(Dense(n_outputs))
    model.compile(loss=’mse’, optimizer=’adam’)
    The checkpoint is:
    checkpoint = ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’)
    callbacks_list = [checkpoint]
    model.fit(train_x, train_y, validation_split=0.15, epochs=epochs, batch_size=batch_size,
    callbacks=callbacks_list, verbose=verbose)
    Then i rebuild the same structure and then call:
    model.load_weights(“filename”)
    and the predictions are a little different. Thanks in advance.

    • Jason Brownlee January 10, 2019 at 7:56 am #

      That is surprising, the model should make identical predictions before and after a save. Anything else would be a bug either in your code or in Keras. Try narrowing it down, here are some ideas:
      https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

      • Riccardo January 10, 2019 at 7:34 pm #

        Maybe it can be that when I reload the weights I reload the best result saved to file, while in the first run the weights are different at the end of model.fit()?

        • Riccardo January 10, 2019 at 8:24 pm #

          The way I can reload exactly the model trained is without using ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’). I create the model, train it and then use model.save(file). Then with model.load(file) I have the same result. But here a question: it is the best model? Because now I don’t say to it monitor=’val_loss’, mode=’min’, save_best_only ecc.. I simply fit the model and then save it.

  47. Jinwen Xi February 7, 2019 at 12:21 pm #

    Hi, Jason,

    Very helpful tutorial. Can I save the model after each mini-batch instead of each epoch?

    • Jason Brownlee February 7, 2019 at 2:07 pm #

      Yes, you could achieve that will a custom callback.

      • Jinwen Xi February 7, 2019 at 2:28 pm #

        Thanks for the reply.
        Is there any template about how this custom callback will look like?

  48. Jinwen Xi February 11, 2019 at 6:16 pm #

    Thanks. I followed the instructions and created a custom callback to:

    (1) at the end of each epoch, save the whole model to a file(model_pre.h5), and post-process the file and dump it to model_post.h5
    (2) at the begin of next epoch, load the model from the post-processed file model_post.h5

    The main part of the code implementing (1) and (2) is below:

    But it seems like the model was not correctly updated using the ‘./model_post.h5’ file when I check the parameter values using HDF5View. Could you let me know if I did it in the correct way? Thanks.

    -Jinwen

    • Jason Brownlee February 12, 2019 at 7:55 am #

      Wow!

      I’m surprised that it is possible to load the model prior to each batch or epoch. Perhaps this is not working as expected?

  49. Wonbin February 15, 2019 at 1:29 am #

    Hi Jason, thanks for this great tutorial!
    In a regression task, among ‘val_loss’ and other metrics (like ‘mse’, ‘mape’ and so on) for the monitor argument, which one would be more important to finalize the model?

    Maybe I’m basically asking about the fundamental difference between loss and metrics..?
    I’ve been just guessing that I might have to choose a metric not the ‘val_loss’ to see the prediction performance of the final model. Would this be correct?

    • Jason Brownlee February 15, 2019 at 8:09 am #

      Loss and metrics can be the same thing or different.

      It comes down to what you and project stakeholders value in a final model (metric) and what must be optimized to achieve that (loss).

  50. Judson February 19, 2019 at 11:40 am #

    Hello Jason thanks for the posts.

    as a suggestion could you show us how to checkpoint using Xgboost. Having difficulty figuring it out on my own.

Leave a Reply