[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

How to Checkpoint Deep Learning Models in Keras

Deep learning models can take hours, days, or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post, you will discover how to checkpoint your deep learning models during training in Python using the Keras library.

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Jun/2016: First published
  • Update Mar/2017: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0
  • Update Mar/2018: Added alternate link to download the dataset
  • Update Sep/2019: Updated for Keras 2.2.5 API
  • Update Oct/2019: Updated for Keras 2.3.0 API
  • Update Jul/2022: Updated for TensorFlow 2.x API and mentioned about EarlyStopping
How to Check-Point Deep Learning Models in Keras

How to checkpoint deep learning models in Keras
Photo by saragoldsmith, some rights reserved.

Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long-running processes.

In this approach, a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly or as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is at the weights of the model. These weights can be used to make predictions as is or as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how to name the file, and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note that you may need to install the h5py library to output network weights in HDF5 format.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assumes that the pima-indians-diabetes.csv file is in your working directory.

You can download the dataset from here:

The example uses 33% of the data for validation.

Checkpointing is set up to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_accuracy’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_accuracy=.2f}.hdf5).

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output (truncated for brevity):

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

This is a very simple checkpointing strategy.

It may create a lot of unnecessary checkpoint files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure you have a snapshot of the best model discovered during your run.

Checkpoint Best Neural Network Model Only

A simpler checkpoint strategy is to save the model weights to the same file if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not including score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output (truncated for brevity):

You should see the weight file in your local directory.

This is a handy checkpoint strategy to always use during your experiments.

It will ensure that your best model is saved for the run for you to use later if you wish. It avoids needing to include any code to manually keep track and serialize the best model when training.

Use EarlyStopping Together with Checkpoint

In the examples above, an attempt was made to fit your model with 150 epochs. In reality, it is not easy to tell how many epochs you need to train your model. One way to address this problem is to overestimate the number of epochs. But this may take significant time. After all, if you are checkpointing the best model only, you may find that over the several thousand epochs run, you already achieved the best model in the first hundred epochs, and no more checkpoints are made afterward.

It is quite common to use the ModelCheckpoint callback together with EarlyStopping. It helps to stop the training once no metric improvement is seen for several epochs. The example below adds the callback es to make the training stop early once it does not see the validation accuracy improve for five consecutive epochs:

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output:

This training process stopped after epoch 22 as no better accuracy was achieved for the last five epochs.

Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a check-pointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This, too, can be serialized to a file in JSON or YAML format.

In the example below, the model structure is known, and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output:

Summary

In this post, you discovered the importance of checkpointing deep learning models for long training runs.

You learned two check-pointing strategies that you can use on your next deep learning project:

  1. Checkpoint Model Improvements
  2. Checkpoint Best Model Only

You also learned how to load a check-pointed model and make predictions.

Do you have any questions about checkpointing deep learning models or this post? Ask your questions in the comments, and I will do my best to answer.

213 Responses to How to Checkpoint Deep Learning Models in Keras

  1. Avatar
    Gerrit Govaerts October 5, 2016 at 1:07 am #

    79.56% with a 3 hidden layer architecture , 24 neurons each
    Great blog , learned a lot

    • Avatar
      Jason Brownlee October 5, 2016 at 8:31 am #

      Thanks Gerrit, I’m glad you found it useful.

    • Avatar
      ENAS March 2, 2021 at 7:54 pm #

      excuse me i want to ask about what the accuracy_score in multilabel classification mean? ,and which metric i should consider in multilabel classification
      thanks

      • Avatar
        Jason Brownlee March 3, 2021 at 5:30 am #

        Good question!

        Not sure I have good advice for you. I have not thought deeply about the implications of accuracy for multilabel – sorry.

  2. Avatar
    Lau MingFei October 21, 2016 at 1:02 pm #

    Hi Jason, how can I checkpoint a model with my custom metric? The example codes you gave above is monitor = ‘val_acc’ , but when I replace it with monitor = my_metric , it displays the following warning message:

    /usr/local/lib/python3.5/dist-packages/keras/callbacks.py:286: RuntimeWarning: Can save best model only with available, skipping.
    ‘skipping.’ % (self.monitor), RuntimeWarning)

    So how should I do with this?

    • Avatar
      Jason Brownlee October 22, 2016 at 6:55 am #

      Great question Lau,

      I have not tried to check point with a custom metric, sorry. I cannot give you good advice.

    • Avatar
      Jakub July 26, 2019 at 11:18 pm #

      Hi,
      Try something like this:

      model = load_model( “your.model.h5”,
      custom_objects={‘my_metric’: my_metric })

    • Avatar
      Volkan Yurtseven September 13, 2020 at 5:29 am #

      how about model.add_loss() & model.add_metric()?

  3. Avatar
    Xu Zhang November 8, 2016 at 11:21 am #

    Hi Jason,
    A great post!

    I saved the model and weights using callbacks, ModelCheckpoint. If I want to train it continuously from the last epoch, how to set the model.fit() command to start from the previous epoch? Sometimes, we need to change the learning rates after several epochs and to continue training from the last epoch. Your advice is highly appreciated.

    • Avatar
      Jason Brownlee November 9, 2016 at 9:48 am #

      Great question Xu Zhang,

      You can load the model weights back into your network and then start a new training process.

      • Avatar
        Lopez GG December 30, 2016 at 1:01 pm #

        Thank you Jason. I ran a epoch and got the loss down to 353.6888. The session got disconnected so I used the weights as follows. However, I dont see a change in loss. I am loading the weights correctly ? Here is my code

        >>> filename = “weights-improvement-01–353.6888-embedding.hdf5”
        >>> model.load_weights(filename)
        >>> model.compile(loss=’binary_crossentropy’, optimizer=’adam’)
        >>> model.fit(dataX_Arr, dataY_Arr, batch_size=batch_size, nb_epoch=15, callbacks=callbacks_list)
        Epoch 1/15
        147744/147771 [============================>.] – ETA: 1s – loss: -353.6892Epoch 00000: loss improved from inf to -353.68885, saving model to weights-

        • Avatar
          Jason Brownlee December 31, 2016 at 7:02 am #

          It looks like you are loading the weights correctly.

          • Avatar
            Akshaya July 15, 2019 at 7:28 pm #

            Hi Jason, in continuation to the point above of continuing training from the saved checkpoint, is it required that I set the random seed initially and use the same when I train the second time? What I have noticed is that, the first time I train, the loss seems to reduce. But when I load the checkpoint and continue training, the performance suddenly becomes very poor.

          • Avatar
            Jason Brownlee July 16, 2019 at 8:14 am #

            No.

            You can expect variance in the model across runs, more here:
            https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code

  4. Avatar
    Anthony November 29, 2016 at 9:58 pm #

    Great post again Jason, thank you!

  5. Avatar
    Nasarudin January 31, 2017 at 1:48 pm #

    Hi Jason, thank you for your tutorial. I want to implement this checkpoint function in iris-flower model script but failed to do it. It keeps showing this error and I do not know how to solve it.

    I put the ‘model checkpoint’ line after the ‘baseline model’ function and add ‘callbacks’

    RuntimeError: Cannot clone object , as the constructor does not seem to set parameter callbacks

    Thank you for your help

    • Avatar
      Jason Brownlee February 1, 2017 at 10:41 am #

      Hi Nasarudin, sorry I am not sure of the cause of this error.

      I believe you cannot use callbacks like the checkpointing when using the KerasClassifier as is the case in the iris tutorial:
      https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

      • Avatar
        Nasarudin February 6, 2017 at 1:50 am #

        Hi Jason. Thank you for your reply. I have some dataset that looks same with the iris dataset but it is a lot bigger. I thought maybe I can use the callbacks when I train the dataset.

        Since training a large dataset might taking a lot of time, do you have any suggestion to implement other function that can do checkpoint when using KerasClassifier?nk for the

        By the way, you can refer to this link for the script that I wrote to implement the checkpoint function with KerasClassifier.

        http://stackoverflow.com/questions/41937719/checkpoint-deep-learning-models-in-keras

        Thank you.

        • Avatar
          Jason Brownlee February 6, 2017 at 9:44 am #

          Hi Nasarudin,

          I would recommend using a standalone Keras model rather than the sklearn wrapper if you wish to use checkpointing.

  6. Avatar
    Abner February 3, 2017 at 3:21 pm #

    Great tutorials love your page. I got a question: I am trying to optimize a model based on val_acc using ModelCheckpoint. However, I get great results rather quickly (example: val_acc = .98). This is my max validation, therefore, the model that will be saved (save_best_only). However several epochs gave me the same max validation and the latter epochs have higher training accuracy. Example Epoch 3: acc = .70, val_acc = .98, Epoch 50: acc = 1.00, val_acc = .98. Clearly, I would want to save the second one which generalizes on the data plus shows great training. How do I do this without having to save every epoch? Basically, I want to pass a second sorting parameter to monitor (monitor=val_acc,acc).

    Thanks.

    • Avatar
      Jason Brownlee February 4, 2017 at 9:57 am #

      Great question, off the cuff, I think you can pass multiple metrics for check-pointing.

      Try multiple checkpoint objects?

      Here’s the docs if that helps:
      https://keras.io/callbacks/#modelcheckpoint

      Ideally, you do want a balance between performance on the training dataset and on the validation dataset.

      • Avatar
        Abner February 7, 2017 at 5:13 am #

        Yes, it’s not letting me pass an array list or multiple parameters it’s only expecting 1 parameter base on literature, for now, I have to settle for using val_acc for bottlenecks/top layer and val_loss for the final model, though I would prefer more control. maybe I’ll ask for it in a feature request.

        • Avatar
          Jason Brownlee February 7, 2017 at 10:23 am #

          Great idea, add an issue on the Keras project.

        • Avatar
          Damily March 22, 2017 at 10:12 pm #

          Hi Abner, do you solve this problem now?
          hope to receive your reply.

          • Avatar
            Schmax July 29, 2019 at 8:36 am #

            You could define a custom metric that encorporates both val_acc and val_loss

  7. Avatar
    Fatma February 9, 2017 at 3:31 pm #

    Hi Jason, thank you for your tutorial. I need to ask one question, if my input contains two images with different labels (the label represents the popularity of the image). I need to know how to feed this pair of images such that the first image pass through CNN1 and the second one pass through CNN2 Then I can merge them using the merge layer to classify which one is more popular than the other one. How can I use the library in order to handle the two different inputs?

    • Avatar
      Jason Brownlee February 10, 2017 at 9:48 am #

      Hi Fatma,

      Perhaps you can reframe your problem to output the popularity for one image and use an “if” statement to rank the relative popularities of each image separately?

      • Avatar
        Fatma February 10, 2017 at 1:53 pm #

        I need to compare between the popularity value of the two input images such that the output will be the first image is high popular than the second image or vice versa then when I feed one test image (unlabeled) it should be compared with some baseline of the training data to compute its popularity

  8. Avatar
    Nasarudin February 21, 2017 at 3:30 pm #

    Hi Jason, great tutorial as always.

    I want to ask regarding ‘validation_split’ in the script. What is the difference between this script that has ‘validation_script’ variable and the one from here which did not have ‘validation_split’ variable https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ ?

    • Avatar
      Jason Brownlee February 22, 2017 at 9:56 am #

      Great question Nasarudin.

      If you have a validation_split, it will give you some feedback during training about the performance on the model on “unseen” data. When using a check-point, this can be useful as you can stop training when performance on validation data stops improving (a sign of possible overfitting).

  9. Avatar
    jerpint March 18, 2017 at 7:46 am #

    thanks!

  10. Avatar
    Aleksandra Nabożny April 10, 2017 at 11:54 pm #

    Hi Jason! Thank you very much for the tutorial!

    Can you provide a sample code to visualize results as follows: plot some images from validation dataset with their 5 top label predictions?

    • Avatar
      Jason Brownlee April 11, 2017 at 9:33 am #

      Yes you could do this using the probabilistic output from the network.

      Sorry, I do not have examples at this stage.

  11. Avatar
    Pratik April 19, 2017 at 6:39 pm #

    Thanks for the tutorial Jason 🙂

  12. Avatar
    Krishan Chopra April 25, 2017 at 6:33 am #

    Hello Sir, i want to thank you for this cool tutorial,
    Currently i am checkpointing my model every 50 epochs. I also want to checkpoint any epoch whose val_acc is better but is not the 50th epoch. For ex. i have checkpointed 250th epoch, but val_acc for 282th epoch is besser and i have to save it but as i have specified the period to be 50, i cant save the 282th epoch.
    Should i implement both attributes save_best_only=true and period=50 in ModelCheckPoint ?

    • Avatar
      Jason Brownlee April 25, 2017 at 7:53 am #

      Sorry, I’m not sure I follow.

      You can use the checkpoint to save any improvement to the model regardless of epoch. Perhaps that would be an approach to try?

  13. Avatar
    Zhihao Chen May 19, 2017 at 11:18 am #

    Hi Jason,

    I love your tutorial very much but just find a little bit confused about how to combine the check point with the cross validation. As is shown in your other tutorial, we don’t explicitly call model.fit when doing cross validation. Could you give me an example how to add check-point into this?

    • Avatar
      Jason Brownlee May 20, 2017 at 5:33 am #

      You may have to run the k-fold cross-validation process manually (e.g. loop the folds yourself).

  14. Avatar
    chiyuan May 28, 2017 at 3:36 pm #

    Love your articles. Learnt lost of things here.

    Please keep your tutorials going!

  15. Avatar
    Abolfazl June 22, 2017 at 3:00 am #

    Jason, your blog is amazing. Thanks for helping us out in learning this awesome field.

  16. Avatar
    Xiufeng Yang June 29, 2017 at 3:43 pm #

    Hi, great post! I want to save the training model not the validation model, how to set the parameters in checkpoint()?

    • Avatar
      Jason Brownlee June 30, 2017 at 8:08 am #

      The trained model is saved, there is no validation model. Just an evaluation of the model using validation data.

  17. Avatar
    Mike September 19, 2017 at 5:55 pm #

    Is it possible to checkpoint in between epochs?
    I have an enormous dataset that takes 20hrs per epoch and it’s failed before it finished an epoch. It would be great if I could checkpoint every fifth of an epoch or so.

    • Avatar
      Jason Brownlee September 20, 2017 at 5:55 am #

      I would recommend using a data generator and then using a checkpoint after a fixed number of batches. This might work.

  18. Avatar
    Mounir October 31, 2017 at 10:30 pm #

    Hi Jason, great blog. Do you also happen to know how to save/store the epoch number of the last observed improvement during training? That would be very useful to study overfitting

    • Avatar
      Jason Brownlee November 1, 2017 at 5:45 am #

      Yes, you could add the “epoch” variable to the checkpoint filename.

      For example:

  19. Avatar
    Jes November 3, 2017 at 8:04 am #

    is it possible to use Keras checkpoints together with Gridsearch? in case Gridsearch crashes?

    • Avatar
      Jason Brownlee November 3, 2017 at 2:15 pm #

      Not a good idea Jes. I’d recommend doing a custom grid search.

  20. Avatar
    Abad December 3, 2017 at 10:10 pm #

    Receiving following error:
    TypeError Traceback (most recent call last)
    in ()
    72
    73 filepath=”weights.best.hdf5″
    —> 74 checkpoint = ModelCheckpoint(filepath, monitor=’val_acc’, verbose=0, save_best_only=True,node=’max’)
    75 callbacks_checkpt = [checkpoint]
    76

    TypeError: __init__() got an unexpected keyword argument ‘node’

    • Avatar
      Jason Brownlee December 4, 2017 at 7:47 am #

      Looks like a typo.

      Double check that you have copied the code exactly from the tutorial?

  21. Avatar
    Aditya December 24, 2017 at 8:21 am #

    Hi Jason!
    Thanks for the informative post.

    I have one question – in case of unexpected stoppage of the run, we have the best model weights for the epochs DONE SO FAR. How can we use this checkpoint as a STARTING point to continue with the remaining epochs?

    • Avatar
      Jason Brownlee December 25, 2017 at 5:22 am #

      You can load the weights, see the example in the tutorial of exactly this.

  22. Avatar
    Liaqat Ali December 27, 2017 at 3:53 am #

    Thanks for such a nice explanation. I want to ask if we are performing some experiments & want the neural network model to achieve high accuracy on for test set. Can we use this method to find the best tuned network or the highest possible accuracy.?????

    • Avatar
      Jason Brownlee December 27, 2017 at 5:21 am #

      This method can help find and save a well performing model during training.

  23. Avatar
    davenso January 8, 2018 at 4:54 pm #

    very cool.

  24. Avatar
    davenso January 8, 2018 at 4:57 pm #

    Based on this example, how long does it take typically to save or retrieve a checkpoint model?

    • Avatar
      Jason Brownlee January 9, 2018 at 5:24 am #

      Very fast, just a few seconds for large models.

      • Avatar
        davenso January 9, 2018 at 3:43 pm #

        Thanks, Jason. Could you kindly give an estimate, more than 10 secs?

  25. Avatar
    kszyniu January 12, 2018 at 5:59 am #

    Nice post.
    However, I have one question. Can I get number of epochs that model was trained for in other way than reading its filename?
    Here’s my use case: upon loading my model I want to restore the training exactly from the point I saved it (by passing a value to initial_epoch argument in .fit()) for the sake of better-looking graphs in TensorBoard.
    For example, I trained my model for 2 epochs (got 2 messages: “Epoch 1/5”, “Epoch 2/5”) and saved it. Now, I want to load that model and continue training from 3rd epoch (I expect getting message “Epoch 3/5” and so on).
    Is there a better way than saving epochs to filename and then getting it from there (which seems kinda messy)?

    • Avatar
      Jason Brownlee January 12, 2018 at 11:48 am #

      You could read this from the filename. It’s just a string.

      You could also have a callback that writes the last epoch completed to a file, and overwrite this file each time.

  26. Avatar
    Max Jansen January 16, 2018 at 9:21 pm #

    Great post and a fantastic blog! I can’t thank you enough!

  27. Avatar
    Gabriel January 26, 2018 at 3:18 pm #

    Hi there. Great posts! Quick question: have you run into this problem? “callbacks.py:405: RuntimeWarning: Can save best model only with val_acc available, skipping.”

    Running on AWS GPU Compute instance, fyi. I am not going straight from the Keras.Sequence() model… instead, I am using the SearchGridCV as I am trying to perform some tuning tasks but want to save the best model. Any suggestions?

    • Avatar
      Jason Brownlee January 27, 2018 at 5:53 am #

      I would recommend not combining checkpointing with cross validation or grid search.

  28. Avatar
    Rob February 28, 2018 at 2:16 am #

    Hi Jason,

    Thanks a lot for all your posts, really helpful. Can you explain to me why some epochs improve validation accuracy whilst previous epochs did not? If you do not use any random sampling in your dataset (e.g. no dropout), how can it be that epoch 12 increases validation score while epoch 11 does not? Aren’t they based on the same starting point (the model outputted by epoch 10)?

    thanks!

    • Avatar
      Jason Brownlee February 28, 2018 at 6:07 am #

      The algorithm is stochastic where not every update improve the scores across all of the data.

      This is a property of the learning algorithm, gradient descent, that is doing its best, but cannot “see” the whole state of the problem, but instead operates piece-wise and incrementally. This is a feature, not a bug. It often leads to better outcomes.

      Does that help?

  29. Avatar
    Pete March 2, 2018 at 11:50 pm #

    I was curiosity what the different of these two version???
    It was seemed that just filepath was different??

    • Avatar
      Jason Brownlee March 3, 2018 at 8:11 am #

      One keeps every improvement, one keeps only the best.

      • Avatar
        Pete March 4, 2018 at 11:40 pm #

        Why the filepath variable was so magic? Just change the filepath variable can make these different result. How is this achieved?

  30. Avatar
    Pete March 3, 2018 at 1:20 am #

    In the last, if I add one code model.save_weights('weights.hdf5'), what the difference of this weights and ModelCheckpoint best weight??

    • Avatar
      Jason Brownlee March 3, 2018 at 8:18 am #

      The difference is you are saving the weights from the final model after the last epoch instead of when a change in model skill has occurred during training.

  31. Avatar
    Pete March 3, 2018 at 9:23 pm #

    When I add one code model.save_weights('weights.hdf5') to save weight from the final model; and I also save ModelCheckpoint best weight, I found that these two hdf5 file were not the same. I was confusing that why the final model weight I saved wasn’t the best model weight.

    • Avatar
      Jason Brownlee March 4, 2018 at 6:02 am #

      Yes, that is the idea of checkpointing, that it the model at the last epoch may not be the best model, in fact it often is not.

      Checkpointing will save the best model along the way.

      Does that help? Perhaps re-read the tutorial to make this clearer?

      • Avatar
        Pete March 4, 2018 at 11:32 pm #

        Ok, thanks. In a sence, when we have used checkpoint, that is meaningless to use model.save_weights('weights.hdf5') again.

      • Avatar
        Demeke June 29, 2022 at 7:20 pm #

        Hello dear Jason. this blog is amazing keep it up
        I have a question. The question is how to run jupyter code(i.e. next word prediction) to browser? and how to made user interface?

        Thanks a lot!

        • Avatar
          James Carmichael June 30, 2022 at 12:17 pm #

          Hi Demeke…Are you interested in how to use Jupyter Notebooks?

  32. Avatar
    Kaushal Shetty March 29, 2018 at 6:19 am #

    Hi Jason,
    How do I checkpoint a regression model. Is my metric accuracy or mse in such a case? And what should I monitor in such case in the modelcheckpoint? I am training a time series model.

    Thanks

  33. Avatar
    Kakoli May 27, 2018 at 8:39 pm #

    Hi Jason
    While running for 50 epochs, I am checkpointing and saving the weights after every 5 epochs.
    Now after 27th, VM disconn.
    Then I reconnect and compile the model after loading the saved weights. Now if I evaluate, I shall get the score on the best weight till 27th epoch. But since only 25 epochs are considered, accuracy will not be good, right?
    In that case, how do I continue the remaining 25 epochs with the saved weights?
    Thanks

    • Avatar
      Jason Brownlee May 28, 2018 at 5:57 am #

      You can load a set of weights and continue training the model.

  34. Avatar
    Rahmad ars May 30, 2018 at 7:02 am #

    Hi jason, thanks for the tutorial. If i want to extract weight values of all layer from my model, how to do that? thanks

    • Avatar
      Jason Brownlee May 30, 2018 at 3:06 pm #

      I believe there is a get_weights() function on each layer.

  35. Avatar
    michael alex June 18, 2018 at 2:23 pm #

    Hi Jason,
    Thanks for all the great tutorials, including this one. Is it possible to Grid Search model hyper-parameters and check-point the model at the same time? I can do either one independently, but not both at the same time. Can you show us?

    Thanks,
    Michael

    • Avatar
      Jason Brownlee June 18, 2018 at 3:13 pm #

      Yes, but you will have to write the for loops for the grid search yourself. sklearn won’t help.

  36. Avatar
    Hung June 18, 2018 at 6:40 pm #

    Thanks a lot Jason for excellent tutorial.

  37. Avatar
    Adam July 21, 2018 at 12:06 am #

    Where are the key values “02d” for epoch and “.2f” for val_acc coming from?

  38. Avatar
    IGOR STAVNITSER July 23, 2018 at 11:27 am #

    Is there a way to checkpoint model weights to memory instead of a file?
    Also in your example you are maximizing val_acc. What are the merits of maximizing val_acc vs minimizing val_loss?
    Thank you for a great post!

    • Avatar
      Jason Brownlee July 23, 2018 at 2:25 pm #

      I’m sure you could write a custom call back to do this. I don’t have an example.

      You can choose what is important to you, accuracy might be more relevant to you when using the model to make predictions.

  39. Avatar
    Abhijeet Gokar August 9, 2018 at 9:27 pm #

    Thanks a lot, it is good practise to conciously design and implement checkpoints , in our model.

  40. Avatar
    Yari August 14, 2018 at 3:19 am #

    Hi Jason, thanks for the post (as well as many others I have read!).
    I was wondering: after fitting with model.fit(....) and using checkpoint you will have the weights saved in an external file. But what’s the specific state of model instance after the training? Will it have the best weights or it will have the last weights calculated during the training?

    So, to sum up. If I want to do a prediction on a test set immediately after the training/fitting should I load the best weights from the external file and then do the prediction or I could directly use model.predict(...) immediately after model.fit(...) ?

    Thanks a lot for your support!

    • Avatar
      Jason Brownlee August 14, 2018 at 6:24 am #

      The file will have the model weights at the time of the checkpoint. The weights can be loaded and used directly for making predictions. This is the whole point of checkpointing – to save the best model found during training.

      • Avatar
        Yari August 14, 2018 at 3:40 pm #

        Hey Jason, thanks for getting back to me. Yes that was clear to me. What’s not clear is what weights model has at the end of the training.

        Lets suppose I’m using

        callbacks = [
        EarlyStopping(patience=15, monitor='val_loss', min_delta=0, mode='min'),
        ModelCheckpoint('best-weights.h5', monitor='val_loss', save_best_only=True, save_weights_only=True)
        ]

        after training is done I’ll have the best weights saved on best-weights.h5 but what are the weights stored in the model instance? If I do model.evaluate(...) (without loading best-weights.h5) will it use the best weights or just the weights corresponding to the last epoch?

        • Avatar
          Jason Brownlee August 15, 2018 at 5:57 am #

          They are the weights at the time the checkpoint is triggered. If it is triggered many times, you may have many weight files saved.

        • Avatar
          Anupam Singh September 26, 2018 at 6:01 am #

          model instance will have the weights of last epoch not the best one

  41. Avatar
    Mohammed September 13, 2018 at 5:21 pm #

    I have my owned pretrained model (.ckpt and .meta files). How to use such files to extract features of my dataset in form of matrix which rows represent samples and columns represent features?

    • Avatar
      Jason Brownlee September 14, 2018 at 6:34 am #

      Perhaps load in Python manually then try to populate a defined Keras manually?

  42. Avatar
    chamith November 4, 2018 at 7:28 am #

    Thanks a lot for the great description. But I have to clarify one thing regarding this. When I train the model using two callback functions model *ModelCheckpoint* and

  43. Avatar
    Daniel Penalva November 27, 2018 at 10:23 pm #

    Hi Jason,

    Thanks for tutorial. Iam wondering if this approach works with GridSearch and how can i put the checkpoint to track the results. Also, iam working with Colab Research Notebook right now, is there a way to detect process interruption and use a checkpoint to save the model ?

    Thank you for your help again!!

    • Avatar
      Jason Brownlee November 28, 2018 at 7:41 am #

      No, a grid search and checkpointing are at odds.

      I’m not familiar with “Colab Research Notebook”, what is it?

      • Avatar
        Daniel Penalva November 28, 2018 at 11:17 pm #

        https://colab.research.google.com . Google initiative to promove notebooks with free virtual machines, kernels with GPU and TPU processing (yet experimental). But it disconnects from the kernel after a short time of no use, and the virtual machine can be unmounted after some hours. So its fundamental for deeplearning applications that you checkpoint and save the state to keep going after reconnecting.

        Too bad for Grid Search. How can you do Fine Tuning (hyperparameters) and process babysitting (verify vizualization of results, possible overfits, variance-bias so on …) without checkpointing on Grid Search ?

        thank you !

        • Avatar
          Jason Brownlee November 29, 2018 at 7:41 am #

          When you grid search, you want to know about what hyperparameters give good performance. The models are discarded – no checkpointing needed. Later, you can use good config to fit a new model.

          • Avatar
            Daniel Penalva November 29, 2018 at 10:12 pm #

            Yah, thats right, but still, without being able to finish without being disconnected from the server i will never know the model, so still the checkpoint comes in hand. But if you have in mind any other think to do the tuning together with checkpoints, please let us know !

            Thanks ! 🙂

          • Avatar
            Jason Brownlee November 30, 2018 at 6:33 am #

            Why do you need the model?

            The models during a grid search are discarded. You only need the hyperparameters of the best performing model.

          • Avatar
            Daniel Penalva November 30, 2018 at 12:50 am #

            Update
            There seems to be a issue with Keras Cross Validation and Checkpointing that requires some gimmick turn-around:

            https://github.com/keras-team/keras/issues/836

            Cant understand why they closed the issue without a solution, just dropped my question there.

            Will try to figure out how to do this …

          • Avatar
            Daniel Penalva November 30, 2018 at 12:17 pm #

            “The models during a grid search are discarded. You only need the hyperparameters of the best performing model.”

            Sorry i wasnt clear, i cant run Keras on my laptop, its more than 12 years old. The only free infra i found is Colab Notebook, to fully run through all models in GridSearchCV i need to checkpoint and restart the computing since the persistence of the process in google’s virtual machine wont last few hours, sometimes less than hours.

  44. Avatar
    William December 4, 2018 at 9:32 am #

    Hi Jason

    I have a question regarding compiling the model after weights are loaded.

    # load weights
    model.load_weights(“weights.best.hdf5”)
    # Compile model (required to make predictions)
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    When I looked at the Keras API https://keras.io/models/model/#compile the description for the compile function says “Configures the model for training”. I am confused on why we need to compile the model to make predictions?

    Also, thought you might like to know that the Pima Indians onset of diabetes binary classification problem data set is no longer available.

    Thanks

    • Avatar
      Jason Brownlee December 4, 2018 at 2:35 pm #

      Thanks, you might not need to compile the model after loading any longer, perhaps the API has changed.

  45. Avatar
    kadir sharif December 25, 2018 at 6:54 pm #

    i am training a model about 100 epochs.. Now suppose the electricity gone. and i have a model checkpoints that is saved in hdf5 format… and the model run 30 epochs… but i have the model checkpoints saved with val_acc monitor.

    In this kind of situation how can i load the checkpoint in the same model to continue the training where it interrupted… and is it gonna continue training the model after 30 epochs… it will be a great help it you answer my questions.
    Thanks in advance.

  46. Avatar
    Riccardo January 10, 2019 at 3:06 am #

    Hi Jason thanks for your tutorials, they are very helpful. I’ve implemented a model and then saved it with a checkpoint correctly. Unfortunately when I reload the model (the same structure) with the weights saved I can’t obtain the same predictions as before, there are slightly worse. My model is a simple model
    model = Sequential()
    model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
    model.add(Dense(100, activation=’relu’))
    model.add(Dense(n_outputs))
    model.compile(loss=’mse’, optimizer=’adam’)
    The checkpoint is:
    checkpoint = ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’)
    callbacks_list = [checkpoint]
    model.fit(train_x, train_y, validation_split=0.15, epochs=epochs, batch_size=batch_size,
    callbacks=callbacks_list, verbose=verbose)
    Then i rebuild the same structure and then call:
    model.load_weights(“filename”)
    and the predictions are a little different. Thanks in advance.

    • Avatar
      Jason Brownlee January 10, 2019 at 7:56 am #

      That is surprising, the model should make identical predictions before and after a save. Anything else would be a bug either in your code or in Keras. Try narrowing it down, here are some ideas:
      https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

      • Avatar
        Riccardo January 10, 2019 at 7:34 pm #

        Maybe it can be that when I reload the weights I reload the best result saved to file, while in the first run the weights are different at the end of model.fit()?

        • Avatar
          Riccardo January 10, 2019 at 8:24 pm #

          The way I can reload exactly the model trained is without using ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’). I create the model, train it and then use model.save(file). Then with model.load(file) I have the same result. But here a question: it is the best model? Because now I don’t say to it monitor=’val_loss’, mode=’min’, save_best_only ecc.. I simply fit the model and then save it.

  47. Avatar
    Jinwen Xi February 7, 2019 at 12:21 pm #

    Hi, Jason,

    Very helpful tutorial. Can I save the model after each mini-batch instead of each epoch?

    • Avatar
      Jason Brownlee February 7, 2019 at 2:07 pm #

      Yes, you could achieve that will a custom callback.

      • Avatar
        Jinwen Xi February 7, 2019 at 2:28 pm #

        Thanks for the reply.
        Is there any template about how this custom callback will look like?

  48. Avatar
    Jinwen Xi February 11, 2019 at 6:16 pm #

    Thanks. I followed the instructions and created a custom callback to:

    (1) at the end of each epoch, save the whole model to a file(model_pre.h5), and post-process the file and dump it to model_post.h5
    (2) at the begin of next epoch, load the model from the post-processed file model_post.h5

    The main part of the code implementing (1) and (2) is below:

    But it seems like the model was not correctly updated using the ‘./model_post.h5’ file when I check the parameter values using HDF5View. Could you let me know if I did it in the correct way? Thanks.

    -Jinwen

    • Avatar
      Jason Brownlee February 12, 2019 at 7:55 am #

      Wow!

      I’m surprised that it is possible to load the model prior to each batch or epoch. Perhaps this is not working as expected?

  49. Avatar
    Wonbin February 15, 2019 at 1:29 am #

    Hi Jason, thanks for this great tutorial!
    In a regression task, among ‘val_loss’ and other metrics (like ‘mse’, ‘mape’ and so on) for the monitor argument, which one would be more important to finalize the model?

    Maybe I’m basically asking about the fundamental difference between loss and metrics..?
    I’ve been just guessing that I might have to choose a metric not the ‘val_loss’ to see the prediction performance of the final model. Would this be correct?

    • Avatar
      Jason Brownlee February 15, 2019 at 8:09 am #

      Loss and metrics can be the same thing or different.

      It comes down to what you and project stakeholders value in a final model (metric) and what must be optimized to achieve that (loss).

  50. Avatar
    Judson February 19, 2019 at 11:40 am #

    Hello Jason thanks for the posts.

    as a suggestion could you show us how to checkpoint using Xgboost. Having difficulty figuring it out on my own.

  51. Avatar
    Fredrick Ughimi February 27, 2019 at 2:51 am #

    Hello Jason,

    Straight as an arrow. You made it so easy to follow. This is really less abstract.

    Thank you.

    • Avatar
      Jason Brownlee February 27, 2019 at 7:34 am #

      Thanks, I’m happy it helped.

      • Avatar
        Matt July 12, 2019 at 8:35 am #

        I used this post as the base for a model. I’ve been using precision and recall from the keras_metrics module and can’t seem to get it working as the monitored metric for the checkpoint function. Just tells me:

        RuntimeWarning: Can save best model only when val_recall: available: skipping.

        But after each epoch the val_recall is calculated and displayed so I’m not quite sure what is wrong?

  52. Avatar
    Magnus Wik May 6, 2019 at 7:54 pm #

    Dear Jason,

    I am confused about Keras callback. You use save_best_only=True to save the weights, but according to Keras this is the setting for saving the latest best model. For weights it should be save_weights_only=True. By default, save_weights_only is set to false. Am I missing something?

    Also, in one of my Jupyter notebooks, my weights are not saved at all, but in another they are, although the code is the same. It is weird.

    • Avatar
      Magnus May 6, 2019 at 11:39 pm #

      Dear Jason,

      Now I understand. When you load “weights.best.hdf5”, you are actually loading the full model including the weights. So there is no need to create the model before.

      In section “Loading a Check-Pointed Neural Network Model” I skipped lines 11-14 and then:
      from keras.models import load_model
      model = load_model(“weights.best.hdf5”)

      and I got the same result.

      • Avatar
        Jason Brownlee May 7, 2019 at 6:17 am #

        Nice work.

        • Avatar
          Magnus Wik May 7, 2019 at 8:29 pm #

          Thanks.
          Maybe you should update the text, since now you write “The checkpoint only includes the model weights. It assumes you know the network structure.”, which is incorrect. The checkpoint includes the full model.
          If only the weights are saved it should be ‘save_weights_only=True’.
          What is confusing is that it is possible to treat the full model files as weights, using model.load_weights. Personally I think it should throw a warning.

    • Avatar
      Jason Brownlee May 7, 2019 at 6:16 am #

      Perhaps try running the example from the command line?

  53. Avatar
    Steven Gonzalez May 17, 2019 at 11:45 am #

    Thanks for a great tutorial!

  54. Avatar
    hayj May 27, 2019 at 12:56 am #

    Hello Jason,

    Great tutorial again!
    I wanted to ask you how to monitor my val_top_k_categorical_accuracy instead of val_acc ?
    I can’t find anything about it on internet.
    I tryed differents things change the position of my metrics metrics=['top_k_categorical_accuracy', 'accuracy'], try monitor="top_k_categorical_accuracy"… but nothing works

    • Avatar
      Jason Brownlee May 27, 2019 at 6:51 am #

      Interesting. If you add the metric to the list of metrics does it appear in the history dict?

      If so, you can use that name.

  55. Avatar
    hayj May 28, 2019 at 1:37 am #

    Anyway I solved the problem by defining my own callback which save a model when I observe any “val_*” improvement

  56. Avatar
    Dang Nguyen Hong June 27, 2019 at 6:23 am #

    Each time i search for the answer of a question, your blog solves it ! Great work, thanks so much!

  57. Avatar
    zeinab July 22, 2019 at 5:08 am #

    As usual, great tutorial

  58. Avatar
    alilouche August 6, 2019 at 8:10 am #

    I can’t save my model using your instructions, it does not inform me an error but does not register
    and thanks for your help

    • Avatar
      Jason Brownlee August 6, 2019 at 2:00 pm #

      Sorry to hear that. Perhaps try reducing your example to the simplest possible code example?

      Perhaps try posting your code and issue to stackoverflow?

  59. Avatar
    zeinab August 13, 2019 at 12:40 pm #

    How can I load weights when I use a custom metrics?

    • Avatar
      Jason Brownlee August 13, 2019 at 2:37 pm #

      As follows:

      • Avatar
        zeinab August 14, 2019 at 6:55 pm #

        Thank you, Jason.

        But what about the case of load_weights:
        model.load_weights("bestweights.hdf5")
        unfortunately, this function, cannot see the custom metric.

        • Avatar
          Jason Brownlee August 15, 2019 at 7:59 am #

          Yes, you must define the metric and load the weights in the same scope.

          • Avatar
            zeinab August 15, 2019 at 8:13 am #

            sorry, but I donot understand what do you mean?

  60. Avatar
    zeinab August 14, 2019 at 9:32 pm #

    For a certain problem, I try to solve it using CNN model on 5 cross validation, I find its performance through the average of the loss of the 5 folds and the average of the 5 folds accuracy. (I stop training when i reach the minimum validation loss)
    1- Is this is a right way for calculating a model performance?

    Then I evaluate the same problem on the same dataset but using LSTM.

    2- Now, I want to select the best model, How?

    3- Does the best model is the model with the lowest loss or the highest accuracy or what?

  61. Avatar
    zeinab August 15, 2019 at 2:43 am #

    My problem is regression not classification?

  62. Avatar
    zeinab August 15, 2019 at 8:19 am #

    What about choosing the best model (CNN, LSTM, …) applied on the same problem and the same dataset?

  63. Avatar
    zeinab August 15, 2019 at 8:35 am #

    I apply more than one algorithm(CNN, LSTM) to solve my problem (text similarity). How can I select the best algorithm?

    Does the best algorithm is the algorithm that has the lowest validation loss?

    • Avatar
      Jason Brownlee August 15, 2019 at 2:18 pm #

      It is common to choose a model based both on complexity (minimized) and on skill (optimizing a domain-specific metric).

      Loss is a good proxy for the domain specific metric, but hard to communicate to subject matter experts/stakeholders.

  64. Avatar
    ylnhari August 15, 2019 at 9:27 pm #

    Hi,

    If i want to check point both the best model and model at last epoch when the training halted in middle because of some other reason , how to do that ?

    • Avatar
      Jason Brownlee August 16, 2019 at 7:51 am #

      Perhaps configure a different ModelCheckpoint instance for each case?

  65. Avatar
    kumar August 16, 2019 at 9:12 pm #

    Dear Jason,

    We are using the following statements to save the model.

    model.compile(optimizer=’adam’, loss=’mse’, metrics=[‘accuracy’])
    filepath=”model_lstm_10M_{epoch:02d}.h5″
    checkpoint = ModelCheckpoint(filepath, period=1000, verbose=1, save_best_only=False)
    #tbCallBack = TensorBoard(log_dir=’./log’, histogram_freq=0, write_graph=True, write_images=True)
    callbacks_list = [checkpoint]

    # fit model
    model.fit(train_data, target, epochs=10000,batch_size=4,callbacks=callbacks_list,verbose=2)
    model.save(‘model_10M_lstm_100.h5’)

    We stop training at 2300 epochs, when we start again, it starts from epoch 1 I want to continue from previous epochs (2300) . What are the changes we have to do to achieve this?

    Thanks inadvance

    • Avatar
      Jason Brownlee August 17, 2019 at 5:42 am #

      Training will always start at epoch 0.

      If you load the model and start training again, it will start with weights you from the end of the last run. Only the epoch number will reset, not the weights.

      If you know how many epochs were completed, you can subtract that from the number of epochs you wish to use in the second run.

  66. Avatar
    Jacky QIN October 17, 2019 at 1:47 pm #

    Hi, Jason. Your course is pretty good, I get a lot. But there is a little bug, I guess that’s the API have been updated. The code:

    checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

    The argument ‘monitor’ should be ‘val_acc’ not ‘val_accuracy’

    • Avatar
      Jason Brownlee October 17, 2019 at 1:54 pm #

      The examples assume Keras 2.3 or higher where you must use val_accuracy.

      For older versions of Keras, you must use val_acc.

  67. Avatar
    Eduardo October 28, 2019 at 2:32 am #

    Hi Jason, do you know of a way to also save the epoch number of the last checkpointed model?

    What I want to do is, after training, evaluate the difference between Train and Validation loss of the training epoch corresponding to the “best checkpointed model” to have a reference of the error gap at that exact point.

    Right now I’m calculating that “error gap” from the results of the model validation: train loss vs test loss.

    Thank you in advance!

  68. Avatar
    Walid November 7, 2019 at 3:03 am #

    Great clear post
    is not ModelCheckpoint saving the full model with weights?

    I think “save_weights_only” is by default false

  69. Avatar
    Yogeeshwari February 11, 2020 at 11:06 pm #

    I am trying to save my model after every epoch. I am monitoring for accuracy. After first epoch, I get an OS error stating “Unable to open file: name = /logs/weights-improvement-01-0.55.hdf5′, errno = 2, error message = ‘No such file or directory’, flags = 13, o_flags = 242)”. Kindly let me know.

    • Avatar
      Jason Brownlee February 12, 2020 at 5:47 am #

      Perhaps try saving to a different location, e.g. /tmp/

  70. Avatar
    David April 16, 2020 at 8:44 pm #

    Hey. Great blog. I wanted to ask how can I do something (e.g save an image) when it is best epoch.

    • Avatar
      Jason Brownlee April 17, 2020 at 6:19 am #

      Thanks!

      You could create your own call back, and perform any action you like on any condition you like.

  71. Avatar
    Yunes April 22, 2020 at 2:43 pm #

    Hi Jason! Thank you very much for the tutorial!
    I have a doubt, selecting the model weights that get the best performance over the validation set (monitor = val_accuracy, mode = ‘max’), wouldn’t that be a somewhat optimistic result?
    On the other hand, will the control points always be applied based on the validation set?

    • Avatar
      Jason Brownlee April 23, 2020 at 5:56 am #

      It could be if the validation dataset is small or not representative.

  72. Avatar
    Maria June 24, 2020 at 3:42 am #

    Hi Jason, and thank you for your contribution.

    I just need to clarify something.

    After designing a DNN, we
    1. compile()
    2. fit()
    3. evaluate()

    If the model at the last epoch is not the one with the best weights, why we do not always use callback to save the best model, and evaluate with those ones:
    1. compile()
    2.fit() with callback to save the best weights
    3. load best weights
    4. compile()
    5. evaluate()

    • Avatar
      Jason Brownlee June 24, 2020 at 6:37 am #

      Sounds right.

    • Avatar
      Durga August 18, 2022 at 4:31 pm #

      Hi @Jason, I see two options to load the model and evaluate it.

      1. if we set restore_best_weights = True in EarlyStopping, we can use model.evaluate as it restores the best weight (not necessarily from the last epoch) or we can model.save(model_path) and realod later to evaluate.

      early_stopping = EarlyStopping(
      …..
      restore_best_weights=True)

      fit1 = model.fit(….., callbacks = [early_stopping])
      eval1 = mode.evaluate(test_data)
      model.save(model_path)
      saved_model = tf.keras.models.load_model(model_path)
      saved_model.evaluate(test_data)

      2. Other option is to use checkpoint
      model_checkpoint = ModelCheckpoint(monitor=’val_loss’, verbose=2, save_best_only=True,
      filepath=checkpoint_path)
      fit2 = model.fit(….., callbacks = [model_checkpoint])

      checkpoint_model = tf.keras.models.load_model(checkpoint_path)
      eval2 = checkpoint_model.evaluate(test_daa)

      Generally we used both callbacks model.fit(….., callbacks = [early_stopping,model_checkpoint])

      In this case if restore_best_weights=True, both eval1 and eval2 will be same.

      However if restore_best_weights=False, eval1 is worse than eval2 if last epoch of fit1 is not the best one.

  73. Avatar
    sei September 18, 2020 at 8:50 pm #

    Hii @Jason …during training does the model improved or only weights…?

    • Avatar
      Jason Brownlee September 19, 2020 at 6:53 am #

      The weights are the model. Training changes the weights.

  74. Avatar
    Antonio February 18, 2021 at 9:22 am #

    Hi Jason,

    Nice tutorial, bu I can’t run it.
    I get the following error in line #15 model.add(Dense(12, input_dim=8, activation=’relu’))

    module ‘tensorflow.python.framework.ops’ has no attribute ‘_TensorLike’
    File “/Users/mact/Projects/src/test.py”, line 18, in get_model
    model.add(Dense(12, input_dim=8, activation=’relu’))

    Thx in advance

  75. Avatar
    tiennguyen February 20, 2021 at 1:53 pm #

    Thank for sharing Jason Brownlee.
    I have a question. Currently, checkpoint saved every each epochs and improvement, If I want to save after N epochs or N batch_size how do I have to do. Can you suggest me an idea? thanks a lot.

    • Avatar
      tiennguyen February 20, 2021 at 2:07 pm #

      And I see the parameter save_weights_only = False, so the last example of topic, you do not have to create model by some following commands: # create model
      model = Sequential()
      model.add(Dense(12, input_dim=8, activation=’relu’))
      model.add(Dense(8, activation=’relu’))
      model.add(Dense(1, activation=’sigmoid’))
      because in file saved model and weights.

      • Avatar
        Jason Brownlee February 21, 2021 at 6:06 am #

        Correct, we can load the architecture and weights in one go.

    • Avatar
      Jason Brownlee February 21, 2021 at 6:06 am #

      You’re welcome.

      You might have to enumerate epochs manually and safe with an if-statement, or write a custom callback.

  76. Avatar
    Asif Munir August 19, 2022 at 8:07 pm #

    Dear Sir,
    May I ask your help as I found performance fluctuations a lot in my model implementation. I used a dataset with 430 samples and 146 predictors.

    Below are the few few epochs showing fluctuations in both validation and training accuracies.

    Epoch 230: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4888 – accuracy: 0.7093 – val_loss: 11.5274 – val_accuracy: 0.5756 – 57ms/epoch – 9ms/step
    Epoch 231/500

    Epoch 231: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4977 – accuracy: 0.6570 – val_loss: 11.3789 – val_accuracy: 0.5930 – 57ms/epoch – 10ms/step
    Epoch 232/500

    Epoch 232: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4876 – accuracy: 0.6744 – val_loss: 11.9553 – val_accuracy: 0.6047 – 55ms/epoch – 9ms/step
    Epoch 233/500

    Epoch 233: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4989 – accuracy: 0.7035 – val_loss: 12.4012 – val_accuracy: 0.6047 – 68ms/epoch – 11ms/step
    Epoch 234/500

    Epoch 234: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.5058 – accuracy: 0.6628 – val_loss: 11.9185 – val_accuracy: 0.6047 – 61ms/epoch – 10ms/step
    Epoch 235/500

    Epoch 235: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4998 – accuracy: 0.6221 – val_loss: 12.0681 – val_accuracy: 0.6570 – 57ms/epoch – 9ms/step
    Epoch 236/500

    Epoch 236: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4847 – accuracy: 0.7093 – val_loss: 12.7811 – val_accuracy: 0.5988 – 53ms/epoch – 9ms/step
    Epoch 237/500

    Epoch 237: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4825 – accuracy: 0.7035 – val_loss: 12.3784 – val_accuracy: 0.6279 – 75ms/epoch – 12ms/step
    Epoch 238/500

    Epoch 238: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4771 – accuracy: 0.7267 – val_loss: 12.3361 – val_accuracy: 0.5814 – 55ms/epoch – 9ms/step
    Epoch 239/500

    Epoch 239: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4811 – accuracy: 0.6919 – val_loss: 11.9446 – val_accuracy: 0.5756 – 60ms/epoch – 10ms/step
    Epoch 240/500

    Epoch 240: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4620 – accuracy: 0.7384 – val_loss: 12.3077 – val_accuracy: 0.6105 – 48ms/epoch – 8ms/step
    Epoch 241/500

    Epoch 241: val_accuracy did not improve from 0.69767
    6/6 – 0s – loss: 0.4947 – accuracy: 0.7093 – val_loss: 12.6721 – val_accuracy: 0.5988 – 57ms/epoch – 10ms/step

  77. Avatar
    Understander September 16, 2022 at 2:57 am #

    thank you Jason for very good explanation.
    Is it possible to use the saved weight for training the model again (by model.fit() ) with these weights ? How?

    Best regards

  78. Avatar
    Understander September 17, 2022 at 11:59 pm #

    Thank you

  79. Avatar
    HN November 20, 2022 at 8:21 am #

    Hi. Thank you for this wonderful blog!

    Is it normal for each epoch to take around 3 hours for image classification? I have a total of 200 epochs so it’ll take 600 hours to train the model. I’m max pooling in 2,2 strides for every layer with an initial image size of 244, 244.

    Also, I’m having to use cloud computing for this and it’s free for up to 12 hours in one go. So once it times out should I simply reload my saved weights and continue the training with a reduced epoch number.

    • Avatar
      James Carmichael November 20, 2022 at 11:39 am #

      Hi HN…Have you tried Google Colab with a GPU option?

  80. Avatar
    understander April 10, 2023 at 10:58 pm #

    Thank you Jason for very good explanation,
    1. in the “Checkpoint Best Neural Network Model Only” section, you say model weights are written to the file “weights.best.hdf5” but in ModelCheckpoint class you don’t set save_weights_only = True ; what is the reason for this?

    2. For saving, is this string: “weights.best.hdf5” fixed?
    for example, can this: “version3.weights.best.hdf5” also be used ?

    3. Can “weights.best.h5” be used instead of “weights.best.hdf5” ?
    Does it give the same result?

  81. Avatar
    Jeff Winchell December 3, 2023 at 9:44 am #

    The point of checkpoints in deep learning is mainly to resume training after a failure stops it before it is done.

    With keras, see this function for a simple method to be able to resume in such a case.

    https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/BackupAndRestore

Leave a Reply