Last Updated on August 27, 2020

Deep learning models can take hours, days or even weeks to train.

If the run is stopped unexpectedly, you can lose a lot of work.

In this post you will discover how you can check-point your deep learning models during training in Python using the Keras library.

**Kick-start your project** with my new book Deep Learning With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Mar/2017**: Updated for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.**Update Mar/2018**: Added alternate link to download the dataset.**Update Sep/2019**: Updated for Keras 2.2.5 API.**Update Oct/2019**: Updated for Keras 2.3.0 API.

## Checkpointing Neural Network Models

Application checkpointing is a fault tolerance technique for long running processes.

It is an approach where a snapshot of the state of the system is taken in case of system failure. If there is a problem, not all is lost. The checkpoint may be used directly, or used as the starting point for a new run, picking up where it left off.

When training deep learning models, the checkpoint is the weights of the model. These weights can be used to make predictions as is, or used as the basis for ongoing training.

The Keras library provides a checkpointing capability by a callback API.

The ModelCheckpoint callback class allows you to define where to checkpoint the model weights, how the file should named and under what circumstances to make a checkpoint of the model.

The API allows you to specify which metric to monitor, such as loss or accuracy on the training or validation dataset. You can specify whether to look for an improvement in maximizing or minimizing the score. Finally, the filename that you use to store the weights can include variables like the epoch number or metric.

The ModelCheckpoint can then be passed to the training process when calling the fit() function on the model.

Note, you may need to install the h5py library to output network weights in HDF5 format.

### Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Checkpoint Neural Network Model Improvements

A good use of checkpointing is to output the model weights each time an improvement is observed during training.

The example below creates a small neural network for the Pima Indians onset of diabetes binary classification problem. The example assume that the *pima-indians-diabetes.csv* file is in your working directory.

You can download the dataset from here:

The example uses 33% of the data for validation.

Checkpointing is setup to save the network weights only when there is an improvement in classification accuracy on the validation dataset (monitor=’val_accuracy’ and mode=’max’). The weights are stored in a file that includes the score in the filename (weights-improvement-{val_accuracy=.2f}.hdf5).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Checkpoint the weights when validation accuracy improves from keras.models import Sequential from keras.layers import Dense from keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy numpy.random.seed(seed) # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # checkpoint filepath="weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max') callbacks_list = [checkpoint] # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output (truncated for brevity).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
... Epoch 00134: val_accuracy did not improve Epoch 00135: val_accuracy did not improve Epoch 00136: val_accuracy did not improve Epoch 00137: val_accuracy did not improve Epoch 00138: val_accuracy did not improve Epoch 00139: val_accuracy did not improve Epoch 00140: val_accuracy improved from 0.83465 to 0.83858, saving model to weights-improvement-140-0.84.hdf5 Epoch 00141: val_accuracy did not improve Epoch 00142: val_accuracy did not improve Epoch 00143: val_accuracy did not improve Epoch 00144: val_accuracy did not improve Epoch 00145: val_accuracy did not improve Epoch 00146: val_accuracy improved from 0.83858 to 0.84252, saving model to weights-improvement-146-0.84.hdf5 Epoch 00147: val_accuracy did not improve Epoch 00148: val_accuracy improved from 0.84252 to 0.84252, saving model to weights-improvement-148-0.84.hdf5 Epoch 00149: val_accuracy did not improve |

You will see a number of files in your working directory containing the network weights in HDF5 format. For example:

1 2 3 4 5 |
... weights-improvement-53-0.76.hdf5 weights-improvement-71-0.76.hdf5 weights-improvement-77-0.78.hdf5 weights-improvement-99-0.78.hdf5 |

This is a very simple checkpointing strategy.

It may create a lot of unnecessary check-point files if the validation accuracy moves up and down over training epochs. Nevertheless, it will ensure that you have a snapshot of the best model discovered during your run.

## Checkpoint Best Neural Network Model Only

A simpler check-point strategy is to save the model weights to the same file, if and only if the validation accuracy improves.

This can be done easily using the same code from above and changing the output filename to be fixed (not include score or epoch information).

In this case, model weights are written to the file “weights.best.hdf5” only if the classification accuracy of the model on the validation dataset improves over the best seen so far.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Checkpoint the weights for best model on validation accuracy from keras.models import Sequential from keras.layers import Dense from keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # checkpoint filepath="weights.best.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max') callbacks_list = [checkpoint] # Fit the model model.fit(X, Y, validation_split=0.33, epochs=150, batch_size=10, callbacks=callbacks_list, verbose=0) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output (truncated for brevity).

1 2 3 4 5 6 7 8 9 10 11 12 |
... Epoch 00139: val_accuracy improved from 0.79134 to 0.79134, saving model to weights.best.hdf5 Epoch 00140: val_accuracy did not improve Epoch 00141: val_accuracy did not improve Epoch 00142: val_accuracy did not improve Epoch 00143: val_accuracy did not improve Epoch 00144: val_accuracy improved from 0.79134 to 0.79528, saving model to weights.best.hdf5 Epoch 00145: val_accuracy improved from 0.79528 to 0.79528, saving model to weights.best.hdf5 Epoch 00146: val_accuracy did not improve Epoch 00147: val_accuracy did not improve Epoch 00148: val_accuracy did not improve Epoch 00149: val_accuracy did not improve |

You should see the weight file in your local directory.

1 |
weights.best.hdf5 |

This is a handy checkpoint strategy to always use during your experiments.

It will ensure that your best model is saved for the run for you to use later if you wish. It avoids you needing to include code to manually keep track and serialize the best model when training.

## Loading a Check-Pointed Neural Network Model

Now that you have seen how to checkpoint your deep learning models during training, you need to review how to load and use a checkpointed model.

The checkpoint only includes the model weights. It assumes you know the network structure. This too can be serialize to file in JSON or YAML format.

In the example below, the model structure is known and the best weights are loaded from the previous experiment, stored in the working directory in the weights.best.hdf5 file.

The model is then used to make predictions on the entire dataset.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# How to load and use weights from a checkpoint from keras.models import Sequential from keras.layers import Dense from keras.callbacks import ModelCheckpoint import matplotlib.pyplot as plt import numpy # create model model = Sequential() model.add(Dense(12, input_dim=8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(1, activation='sigmoid')) # load weights model.load_weights("weights.best.hdf5") # Compile model (required to make predictions) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) print("Created model and loaded weights from file") # load pima indians dataset dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # estimate accuracy on whole dataset using loaded weights scores = model.evaluate(X, Y, verbose=0) print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example produces the following output.

1 2 |
Created model and loaded weights from file acc: 77.73% |

## Summary

In this post you have discovered the importance of checkpointing deep learning models for long training runs.

You learned two checkpointing strategies that you can use on your next deep learning project:

- Checkpoint Model Improvements.
- Checkpoint Best Model Only.

You also learned how to load a checkpointed model and make predictions.

Do you have any questions about checkpointing deep learning models or about this post? Ask your questions in the comments and I will do my best to answer.

79.56% with a 3 hidden layer architecture , 24 neurons each

Great blog , learned a lot

Thanks Gerrit, I’m glad you found it useful.

Hi Jason, how can I checkpoint a model with my custom metric? The example codes you gave above is monitor = ‘val_acc’ , but when I replace it with monitor = my_metric , it displays the following warning message:

/usr/local/lib/python3.5/dist-packages/keras/callbacks.py:286: RuntimeWarning: Can save best model only with available, skipping.

‘skipping.’ % (self.monitor), RuntimeWarning)

So how should I do with this?

Great question Lau,

I have not tried to check point with a custom metric, sorry. I cannot give you good advice.

Hi,

Try something like this:

model = load_model( “your.model.h5”,

custom_objects={‘my_metric’: my_metric })

Exactly! Thanks for sharing.

how about model.add_loss() & model.add_metric()?

Hi Jason,

A great post!

I saved the model and weights using callbacks, ModelCheckpoint. If I want to train it continuously from the last epoch, how to set the model.fit() command to start from the previous epoch? Sometimes, we need to change the learning rates after several epochs and to continue training from the last epoch. Your advice is highly appreciated.

Great question Xu Zhang,

You can load the model weights back into your network and then start a new training process.

Thank you Jason. I ran a epoch and got the loss down to 353.6888. The session got disconnected so I used the weights as follows. However, I dont see a change in loss. I am loading the weights correctly ? Here is my code

>>> filename = “weights-improvement-01–353.6888-embedding.hdf5”

>>> model.load_weights(filename)

>>> model.compile(loss=’binary_crossentropy’, optimizer=’adam’)

>>> model.fit(dataX_Arr, dataY_Arr, batch_size=batch_size, nb_epoch=15, callbacks=callbacks_list)

Epoch 1/15

147744/147771 [============================>.] – ETA: 1s – loss: -353.6892Epoch 00000: loss improved from inf to -353.68885, saving model to weights-

It looks like you are loading the weights correctly.

Hi Jason, in continuation to the point above of continuing training from the saved checkpoint, is it required that I set the random seed initially and use the same when I train the second time? What I have noticed is that, the first time I train, the loss seems to reduce. But when I load the checkpoint and continue training, the performance suddenly becomes very poor.

No.

You can expect variance in the model across runs, more here:

https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code

Great post again Jason, thank you!

I’m glad you found it useful Anthony.

Hi Jason, thank you for your tutorial. I want to implement this checkpoint function in iris-flower model script but failed to do it. It keeps showing this error and I do not know how to solve it.

I put the ‘model checkpoint’ line after the ‘baseline model’ function and add ‘callbacks’

RuntimeError: Cannot clone object , as the constructor does not seem to set parameter callbacks

Thank you for your help

Hi Nasarudin, sorry I am not sure of the cause of this error.

I believe you cannot use callbacks like the checkpointing when using the KerasClassifier as is the case in the iris tutorial:

http://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

Hi Jason. Thank you for your reply. I have some dataset that looks same with the iris dataset but it is a lot bigger. I thought maybe I can use the callbacks when I train the dataset.

Since training a large dataset might taking a lot of time, do you have any suggestion to implement other function that can do checkpoint when using KerasClassifier?nk for the

By the way, you can refer to this link for the script that I wrote to implement the checkpoint function with KerasClassifier.

http://stackoverflow.com/questions/41937719/checkpoint-deep-learning-models-in-keras

Thank you.

Hi Nasarudin,

I would recommend using a standalone Keras model rather than the sklearn wrapper if you wish to use checkpointing.

Great tutorials love your page. I got a question: I am trying to optimize a model based on val_acc using ModelCheckpoint. However, I get great results rather quickly (example: val_acc = .98). This is my max validation, therefore, the model that will be saved (save_best_only). However several epochs gave me the same max validation and the latter epochs have higher training accuracy. Example Epoch 3: acc = .70, val_acc = .98, Epoch 50: acc = 1.00, val_acc = .98. Clearly, I would want to save the second one which generalizes on the data plus shows great training. How do I do this without having to save every epoch? Basically, I want to pass a second sorting parameter to monitor (monitor=val_acc,acc).

Thanks.

Great question, off the cuff, I think you can pass multiple metrics for check-pointing.

Try multiple checkpoint objects?

Here’s the docs if that helps:

https://keras.io/callbacks/#modelcheckpoint

Ideally, you do want a balance between performance on the training dataset and on the validation dataset.

Yes, it’s not letting me pass an array list or multiple parameters it’s only expecting 1 parameter base on literature, for now, I have to settle for using val_acc for bottlenecks/top layer and val_loss for the final model, though I would prefer more control. maybe I’ll ask for it in a feature request.

Great idea, add an issue on the Keras project.

Hi Abner, do you solve this problem now？

hope to receive your reply.

You could define a custom metric that encorporates both val_acc and val_loss

Hi Jason, thank you for your tutorial. I need to ask one question, if my input contains two images with different labels (the label represents the popularity of the image). I need to know how to feed this pair of images such that the first image pass through CNN1 and the second one pass through CNN2 Then I can merge them using the merge layer to classify which one is more popular than the other one. How can I use the library in order to handle the two different inputs?

Hi Fatma,

Perhaps you can reframe your problem to output the popularity for one image and use an “if” statement to rank the relative popularities of each image separately?

I need to compare between the popularity value of the two input images such that the output will be the first image is high popular than the second image or vice versa then when I feed one test image (unlabeled) it should be compared with some baseline of the training data to compute its popularity

Hi Jason, great tutorial as always.

I want to ask regarding ‘validation_split’ in the script. What is the difference between this script that has ‘validation_script’ variable and the one from here which did not have ‘validation_split’ variable http://machinelearningmastery.com/tutorial-first-neural-network-python-keras/ ?

Great question Nasarudin.

If you have a validation_split, it will give you some feedback during training about the performance on the model on “unseen” data. When using a check-point, this can be useful as you can stop training when performance on validation data stops improving (a sign of possible overfitting).

thanks!

You’re very welcome!

Hi Jason! Thank you very much for the tutorial!

Can you provide a sample code to visualize results as follows: plot some images from validation dataset with their 5 top label predictions?

Yes you could do this using the probabilistic output from the network.

Sorry, I do not have examples at this stage.

Thanks for the tutorial Jason 🙂

I’m glad you found it useful Pratik.

Hello Sir, i want to thank you for this cool tutorial,

Currently i am checkpointing my model every 50 epochs. I also want to checkpoint any epoch whose val_acc is better but is not the 50th epoch. For ex. i have checkpointed 250th epoch, but val_acc for 282th epoch is besser and i have to save it but as i have specified the period to be 50, i cant save the 282th epoch.

Should i implement both attributes save_best_only=true and period=50 in ModelCheckPoint ?

Sorry, I’m not sure I follow.

You can use the checkpoint to save any improvement to the model regardless of epoch. Perhaps that would be an approach to try?

Hi Jason,

I love your tutorial very much but just find a little bit confused about how to combine the check point with the cross validation. As is shown in your other tutorial, we don’t explicitly call model.fit when doing cross validation. Could you give me an example how to add check-point into this?

You may have to run the k-fold cross-validation process manually (e.g. loop the folds yourself).

Love your articles. Learnt lost of things here.

Please keep your tutorials going!

Thanks, I’m glad to hear that.

Jason, your blog is amazing. Thanks for helping us out in learning this awesome field.

I’m glad you find it useful Abolfazl.

Hi, great post! I want to save the training model not the validation model, how to set the parameters in checkpoint()?

The trained model is saved, there is no validation model. Just an evaluation of the model using validation data.

Is it possible to checkpoint in between epochs?

I have an enormous dataset that takes 20hrs per epoch and it’s failed before it finished an epoch. It would be great if I could checkpoint every fifth of an epoch or so.

I would recommend using a data generator and then using a checkpoint after a fixed number of batches. This might work.

Hi Jason, great blog. Do you also happen to know how to save/store the epoch number of the last observed improvement during training? That would be very useful to study overfitting

Yes, you could add the “epoch” variable to the checkpoint filename.

For example:

is it possible to use Keras checkpoints together with Gridsearch? in case Gridsearch crashes?

Not a good idea Jes. I’d recommend doing a custom grid search.

Receiving following error:

TypeError Traceback (most recent call last)

in ()

72

73 filepath=”weights.best.hdf5″

—> 74 checkpoint = ModelCheckpoint(filepath, monitor=’val_acc’, verbose=0, save_best_only=True,node=’max’)

75 callbacks_checkpt = [checkpoint]

76

TypeError: __init__() got an unexpected keyword argument ‘node’

Looks like a typo.

Double check that you have copied the code exactly from the tutorial?

Hi Jason!

Thanks for the informative post.

I have one question – in case of unexpected stoppage of the run, we have the best model weights for the epochs DONE SO FAR. How can we use this checkpoint as a STARTING point to continue with the remaining epochs?

You can load the weights, see the example in the tutorial of exactly this.

Thanks for such a nice explanation. I want to ask if we are performing some experiments & want the neural network model to achieve high accuracy on for test set. Can we use this method to find the best tuned network or the highest possible accuracy.?????

This method can help find and save a well performing model during training.

very cool.

Thanks.

Based on this example, how long does it take typically to save or retrieve a checkpoint model?

Very fast, just a few seconds for large models.

Thanks, Jason. Could you kindly give an estimate, more than 10 secs?

Large models can take about a minute.

Nice post.

However, I have one question. Can I get number of epochs that model was trained for in other way than reading its filename?

Here’s my use case: upon loading my model I want to restore the training exactly from the point I saved it (by passing a value to initial_epoch argument in .fit()) for the sake of better-looking graphs in TensorBoard.

For example, I trained my model for 2 epochs (got 2 messages: “Epoch 1/5”, “Epoch 2/5”) and saved it. Now, I want to load that model and continue training from 3rd epoch (I expect getting message “Epoch 3/5” and so on).

Is there a better way than saving epochs to filename and then getting it from there (which seems kinda messy)?

You could read this from the filename. It’s just a string.

You could also have a callback that writes the last epoch completed to a file, and overwrite this file each time.

Great post and a fantastic blog! I can’t thank you enough!

Thanks, I’m glad to hear that.

Hi there. Great posts! Quick question: have you run into this problem? “callbacks.py:405: RuntimeWarning: Can save best model only with val_acc available, skipping.”

Running on AWS GPU Compute instance, fyi. I am not going straight from the Keras.Sequence() model… instead, I am using the SearchGridCV as I am trying to perform some tuning tasks but want to save the best model. Any suggestions?

I would recommend not combining checkpointing with cross validation or grid search.

Hi Jason,

Thanks a lot for all your posts, really helpful. Can you explain to me why some epochs improve validation accuracy whilst previous epochs did not? If you do not use any random sampling in your dataset (e.g. no dropout), how can it be that epoch 12 increases validation score while epoch 11 does not? Aren’t they based on the same starting point (the model outputted by epoch 10)?

thanks!

The algorithm is stochastic where not every update improve the scores across all of the data.

This is a property of the learning algorithm, gradient descent, that is doing its best, but cannot “see” the whole state of the problem, but instead operates piece-wise and incrementally. This is a feature, not a bug. It often leads to better outcomes.

Does that help?

I was curiosity what the different of these two version???

It was seemed that just filepath was different??

One keeps every improvement, one keeps only the best.

Why the filepath variable was so magic? Just change the filepath variable can make these different result. How is this achieved?

It is the Keras API.

In the last, if I add one code

`model.save_weights('weights.hdf5')`

, what the difference of this weights and ModelCheckpoint best weight??The difference is you are saving the weights from the final model after the last epoch instead of when a change in model skill has occurred during training.

When I add one code

`model.save_weights('weights.hdf5')`

to save weight from the final model; and I also save ModelCheckpoint best weight, I found that these two hdf5 file were not the same. I was confusing that why the final model weight I saved wasn’t the best model weight.Yes, that is the idea of checkpointing, that it the model at the last epoch may not be the best model, in fact it often is not.

Checkpointing will save the best model along the way.

Does that help? Perhaps re-read the tutorial to make this clearer?

Ok, thanks. In a sence, when we have used checkpoint, that is meaningless to use

`model.save_weights('weights.hdf5')`

again.Generally, this is true.

Hi Jason,

How do I checkpoint a regression model. Is my metric accuracy or mse in such a case? And what should I monitor in such case in the modelcheckpoint? I am training a time series model.

Thanks

The metric will be loss or error. You can specify a suite of metrics to record during training:

https://machinelearningmastery.com/custom-metrics-deep-learning-keras-python/

Pick one and use it to checkpoint.

Hi Jason

While running for 50 epochs, I am checkpointing and saving the weights after every 5 epochs.

Now after 27th, VM disconn.

Then I reconnect and compile the model after loading the saved weights. Now if I evaluate, I shall get the score on the best weight till 27th epoch. But since only 25 epochs are considered, accuracy will not be good, right?

In that case, how do I continue the remaining 25 epochs with the saved weights?

Thanks

You can load a set of weights and continue training the model.

Hi jason, thanks for the tutorial. If i want to extract weight values of all layer from my model, how to do that? thanks

I believe there is a get_weights() function on each layer.

Hi Jason,

Thanks for all the great tutorials, including this one. Is it possible to Grid Search model hyper-parameters and check-point the model at the same time? I can do either one independently, but not both at the same time. Can you show us?

Thanks,

Michael

Yes, but you will have to write the for loops for the grid search yourself. sklearn won’t help.

Thanks a lot Jason for excellent tutorial.

I’m glad it helped.

Where are the key values “02d” for epoch and “.2f” for val_acc coming from?

String formatting rules. They have been around for perhaps 50 years in perhaps all languages.

Maybe this will help:

https://mkaz.blog/code/python-string-format-cookbook/

Is there a way to checkpoint model weights to memory instead of a file?

Also in your example you are maximizing val_acc. What are the merits of maximizing val_acc vs minimizing val_loss?

Thank you for a great post!

I’m sure you could write a custom call back to do this. I don’t have an example.

You can choose what is important to you, accuracy might be more relevant to you when using the model to make predictions.

Thanks a lot, it is good practise to conciously design and implement checkpoints , in our model.

Thanks.

Hi Jason, thanks for the post (as well as many others I have read!).

I was wondering: after fitting with

`model.fit(....)`

and using checkpoint you will have the weights saved in an external file. But what’s the specific state of`model`

instance after the training? Will it have the best weights or it will have the last weights calculated during the training?So, to sum up. If I want to do a prediction on a test set immediately after the training/fitting should I load the best weights from the external file and then do the prediction or I could directly use

`model.predict(...)`

immediately after`model.fit(...)`

?Thanks a lot for your support!

The file will have the model weights at the time of the checkpoint. The weights can be loaded and used directly for making predictions. This is the whole point of checkpointing – to save the best model found during training.

Hey Jason, thanks for getting back to me. Yes that was clear to me. What’s not clear is what weights

`model`

has at the end of the training.Lets suppose I’m using

`callbacks = [`

EarlyStopping(patience=15, monitor='val_loss', min_delta=0, mode='min'),

ModelCheckpoint('best-weights.h5', monitor='val_loss', save_best_only=True, save_weights_only=True)

]

after training is done I’ll have the best weights saved on

`best-weights.h5`

but what are the weights stored in the`model`

instance? If I do`model.evaluate(...)`

(without loading`best-weights.h5`

) will it use the best weights or just the weights corresponding to the last epoch?They are the weights at the time the checkpoint is triggered. If it is triggered many times, you may have many weight files saved.

model instance will have the weights of last epoch not the best one

I have my owned pretrained model (.ckpt and .meta files). How to use such files to extract features of my dataset in form of matrix which rows represent samples and columns represent features?

Perhaps load in Python manually then try to populate a defined Keras manually?

Thanks a lot for the great description. But I have to clarify one thing regarding this. When I train the model using two callback functions model *ModelCheckpoint* and

I don’t follow, you can elaborate please?

Hi Jason,

Thanks for tutorial. Iam wondering if this approach works with GridSearch and how can i put the checkpoint to track the results. Also, iam working with Colab Research Notebook right now, is there a way to detect process interruption and use a checkpoint to save the model ?

Thank you for your help again!!

No, a grid search and checkpointing are at odds.

I’m not familiar with “Colab Research Notebook”, what is it?

https://colab.research.google.com . Google initiative to promove notebooks with free virtual machines, kernels with GPU and TPU processing (yet experimental). But it disconnects from the kernel after a short time of no use, and the virtual machine can be unmounted after some hours. So its fundamental for deeplearning applications that you checkpoint and save the state to keep going after reconnecting.

Too bad for Grid Search. How can you do Fine Tuning (hyperparameters) and process babysitting (verify vizualization of results, possible overfits, variance-bias so on …) without checkpointing on Grid Search ?

thank you !

When you grid search, you want to know about what hyperparameters give good performance. The models are discarded – no checkpointing needed. Later, you can use good config to fit a new model.

Yah, thats right, but still, without being able to finish without being disconnected from the server i will never know the model, so still the checkpoint comes in hand. But if you have in mind any other think to do the tuning together with checkpoints, please let us know !

Thanks ! 🙂

Why do you need the model?

The models during a grid search are discarded. You only need the hyperparameters of the best performing model.

Update

There seems to be a issue with Keras Cross Validation and Checkpointing that requires some gimmick turn-around:

https://github.com/keras-team/keras/issues/836

Cant understand why they closed the issue without a solution, just dropped my question there.

Will try to figure out how to do this …

“The models during a grid search are discarded. You only need the hyperparameters of the best performing model.”

Sorry i wasnt clear, i cant run Keras on my laptop, its more than 12 years old. The only free infra i found is Colab Notebook, to fully run through all models in GridSearchCV i need to checkpoint and restart the computing since the persistence of the process in google’s virtual machine wont last few hours, sometimes less than hours.

Hi Jason

I have a question regarding compiling the model after weights are loaded.

# load weights

model.load_weights(“weights.best.hdf5”)

# Compile model (required to make predictions)

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

When I looked at the Keras API https://keras.io/models/model/#compile the description for the compile function says “Configures the model for training”. I am confused on why we need to compile the model to make predictions?

Also, thought you might like to know that the Pima Indians onset of diabetes binary classification problem data set is no longer available.

Thanks

Thanks, you might not need to compile the model after loading any longer, perhaps the API has changed.

i am training a model about 100 epochs.. Now suppose the electricity gone. and i have a model checkpoints that is saved in hdf5 format… and the model run 30 epochs… but i have the model checkpoints saved with val_acc monitor.

In this kind of situation how can i load the checkpoint in the same model to continue the training where it interrupted… and is it gonna continue training the model after 30 epochs… it will be a great help it you answer my questions.

Thanks in advance.

Yes, you can continue training.

Hi Jason thanks for your tutorials, they are very helpful. I’ve implemented a model and then saved it with a checkpoint correctly. Unfortunately when I reload the model (the same structure) with the weights saved I can’t obtain the same predictions as before, there are slightly worse. My model is a simple model

model = Sequential()

model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))

model.add(Dense(100, activation=’relu’))

model.add(Dense(n_outputs))

model.compile(loss=’mse’, optimizer=’adam’)

The checkpoint is:

checkpoint = ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’)

callbacks_list = [checkpoint]

model.fit(train_x, train_y, validation_split=0.15, epochs=epochs, batch_size=batch_size,

callbacks=callbacks_list, verbose=verbose)

Then i rebuild the same structure and then call:

model.load_weights(“filename”)

and the predictions are a little different. Thanks in advance.

That is surprising, the model should make identical predictions before and after a save. Anything else would be a bug either in your code or in Keras. Try narrowing it down, here are some ideas:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

Maybe it can be that when I reload the weights I reload the best result saved to file, while in the first run the weights are different at the end of model.fit()?

The way I can reload exactly the model trained is without using ModelCheckpoint(filename, monitor=’val_loss’, verbose=1, save_best_only=True, mode=’min’). I create the model, train it and then use model.save(file). Then with model.load(file) I have the same result. But here a question: it is the best model? Because now I don’t say to it monitor=’val_loss’, mode=’min’, save_best_only ecc.. I simply fit the model and then save it.

I see, I believe I cover this problem in this post on early stopping:

https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

Hi, Jason,

Very helpful tutorial. Can I save the model after each mini-batch instead of each epoch?

Yes, you could achieve that will a custom callback.

Thanks for the reply.

Is there any template about how this custom callback will look like?

Yes, you can learn more here (under “Create a callback”):

https://keras.io/callbacks/

Thanks. I followed the instructions and created a custom callback to:

(1) at the end of each epoch, save the whole model to a file(model_pre.h5), and post-process the file and dump it to model_post.h5

(2) at the begin of next epoch, load the model from the post-processed file model_post.h5

The main part of the code implementing (1) and (2) is below:

But it seems like the model was not correctly updated using the ‘./model_post.h5’ file when I check the parameter values using HDF5View. Could you let me know if I did it in the correct way? Thanks.

-Jinwen

Wow!

I’m surprised that it is possible to load the model prior to each batch or epoch. Perhaps this is not working as expected?

Hi Jason, thanks for this great tutorial!

In a regression task, among ‘val_loss’ and other metrics (like ‘mse’, ‘mape’ and so on) for the monitor argument, which one would be more important to finalize the model?

Maybe I’m basically asking about the fundamental difference between loss and metrics..?

I’ve been just guessing that I might have to choose a metric not the ‘val_loss’ to see the prediction performance of the final model. Would this be correct?

Loss and metrics can be the same thing or different.

It comes down to what you and project stakeholders value in a final model (metric) and what must be optimized to achieve that (loss).

Hello Jason thanks for the posts.

as a suggestion could you show us how to checkpoint using Xgboost. Having difficulty figuring it out on my own.

Thanks for the suggestion. Maybe this post can help as a starting point:

https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

Hello Jason,

Straight as an arrow. You made it so easy to follow. This is really less abstract.

Thank you.

Thanks, I’m happy it helped.

I used this post as the base for a model. I’ve been using precision and recall from the keras_metrics module and can’t seem to get it working as the monitored metric for the checkpoint function. Just tells me:

RuntimeWarning: Can save best model only when val_recall: available: skipping.

But after each epoch the val_recall is calculated and displayed so I’m not quite sure what is wrong?

Is “monitor” set to your custom metric?

Dear Jason,

I am confused about Keras callback. You use save_best_only=True to save the weights, but according to Keras this is the setting for saving the latest best model. For weights it should be save_weights_only=True. By default, save_weights_only is set to false. Am I missing something?

Also, in one of my Jupyter notebooks, my weights are not saved at all, but in another they are, although the code is the same. It is weird.

Dear Jason,

Now I understand. When you load “weights.best.hdf5”, you are actually loading the full model including the weights. So there is no need to create the model before.

In section “Loading a Check-Pointed Neural Network Model” I skipped lines 11-14 and then:

from keras.models import load_model

model = load_model(“weights.best.hdf5”)

and I got the same result.

Nice work.

Thanks.

Maybe you should update the text, since now you write “The checkpoint only includes the model weights. It assumes you know the network structure.”, which is incorrect. The checkpoint includes the full model.

If only the weights are saved it should be ‘save_weights_only=True’.

What is confusing is that it is possible to treat the full model files as weights, using model.load_weights. Personally I think it should throw a warning.

Thanks.

Perhaps try running the example from the command line?

Thanks for a great tutorial!

You’re welcome, I’m glad it helped.

Hello Jason,

Great tutorial again!

I wanted to ask you how to monitor my val_top_k_categorical_accuracy instead of val_acc ?

I can’t find anything about it on internet.

I tryed differents things change the position of my metrics

`metrics=['top_k_categorical_accuracy', 'accuracy']`

, try`monitor="top_k_categorical_accuracy"`

… but nothing worksInteresting. If you add the metric to the list of metrics does it appear in the history dict?

If so, you can use that name.

Good idea but no, I get “val_top_k_categorical_accuracy” in the history dict but it doesn’t work for for the checkpoint monitor

That is surprising. Perhaps try posting code and issue to the Keras user group:

https://machinelearningmastery.com/get-help-with-keras/

Anyway I solved the problem by defining my own callback which save a model when I observe any “val_*” improvement

Nicely done, I would have done the same 🙂

Each time i search for the answer of a question, your blog solves it ! Great work, thanks so much!

Thanks!

As usual, great tutorial

Thanks!

I can’t save my model using your instructions, it does not inform me an error but does not register

and thanks for your help

Sorry to hear that. Perhaps try reducing your example to the simplest possible code example?

Perhaps try posting your code and issue to stackoverflow?

How can I load weights when I use a custom metrics?

As follows:

Thank you, Jason.

But what about the case of load_weights:

`model.load_weights("bestweights.hdf5")`

unfortunately, this function, cannot see the custom metric.

Yes, you must define the metric and load the weights in the same scope.

sorry, but I donot understand what do you mean?

For a certain problem, I try to solve it using CNN model on 5 cross validation, I find its performance through the average of the loss of the 5 folds and the average of the 5 folds accuracy. (I stop training when i reach the minimum validation loss)

1- Is this is a right way for calculating a model performance?

Then I evaluate the same problem on the same dataset but using LSTM.

2- Now, I want to select the best model, How?

3- Does the best model is the model with the lowest loss or the highest accuracy or what?

It is reasonable for MLP, cross validation is not reasonable for LSTMs. You must use walk-forward validation:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

My problem is not time series. It is Nlp text similarity.

Can cross validation for LSTM in this case be reasonable?

Yes, it may be.

My problem is regression not classification?

What about choosing the best model (CNN, LSTM, …) applied on the same problem and the same dataset?

I apply more than one algorithm(CNN, LSTM) to solve my problem (text similarity). How can I select the best algorithm?

Does the best algorithm is the algorithm that has the lowest validation loss?

It is common to choose a model based both on complexity (minimized) and on skill (optimizing a domain-specific metric).

Loss is a good proxy for the domain specific metric, but hard to communicate to subject matter experts/stakeholders.

Hi,

If i want to check point both the best model and model at last epoch when the training halted in middle because of some other reason , how to do that ?

Perhaps configure a different ModelCheckpoint instance for each case?

Dear Jason,

We are using the following statements to save the model.

model.compile(optimizer=’adam’, loss=’mse’, metrics=[‘accuracy’])

filepath=”model_lstm_10M_{epoch:02d}.h5″

checkpoint = ModelCheckpoint(filepath, period=1000, verbose=1, save_best_only=False)

#tbCallBack = TensorBoard(log_dir=’./log’, histogram_freq=0, write_graph=True, write_images=True)

callbacks_list = [checkpoint]

# fit model

model.fit(train_data, target, epochs=10000,batch_size=4,callbacks=callbacks_list,verbose=2)

model.save(‘model_10M_lstm_100.h5’)

We stop training at 2300 epochs, when we start again, it starts from epoch 1 I want to continue from previous epochs (2300) . What are the changes we have to do to achieve this?

Thanks inadvance

Training will always start at epoch 0.

If you load the model and start training again, it will start with weights you from the end of the last run. Only the epoch number will reset, not the weights.

If you know how many epochs were completed, you can subtract that from the number of epochs you wish to use in the second run.

Hi, Jason. Your course is pretty good, I get a lot. But there is a little bug, I guess that’s the API have been updated. The code:

checkpoint = ModelCheckpoint(filepath, monitor=’val_accuracy’, verbose=1, save_best_only=True, mode=’max’)

The argument ‘monitor’ should be ‘val_acc’ not ‘val_accuracy’

The examples assume Keras 2.3 or higher where you must use val_accuracy.

For older versions of Keras, you must use val_acc.

Hi Jason, do you know of a way to also save the epoch number of the last checkpointed model?

What I want to do is, after training, evaluate the difference between Train and Validation loss of the training epoch corresponding to the “best checkpointed model” to have a reference of the error gap at that exact point.

Right now I’m calculating that “error gap” from the results of the model validation: train loss vs test loss.

Thank you in advance!

You can save it into the filename of the model

Great clear post

is not ModelCheckpoint saving the full model with weights?

I think “save_weights_only” is by default false

Thanks.

Yes, it looks like it will save the entire model:

https://keras.io/callbacks/#modelcheckpoint

I am trying to save my model after every epoch. I am monitoring for accuracy. After first epoch, I get an OS error stating “Unable to open file: name = /logs/weights-improvement-01-0.55.hdf5′, errno = 2, error message = ‘No such file or directory’, flags = 13, o_flags = 242)”. Kindly let me know.

Perhaps try saving to a different location, e.g. /tmp/

Hey. Great blog. I wanted to ask how can I do something (e.g save an image) when it is best epoch.

Thanks!

You could create your own call back, and perform any action you like on any condition you like.

Hi Jason! Thank you very much for the tutorial!

I have a doubt, selecting the model weights that get the best performance over the validation set (monitor = val_accuracy, mode = ‘max’), wouldn’t that be a somewhat optimistic result?

On the other hand, will the control points always be applied based on the validation set?

It could be if the validation dataset is small or not representative.

Hi Jason, and thank you for your contribution.

I just need to clarify something.

After designing a DNN, we

1. compile()

2. fit()

3. evaluate()

If the model at the last epoch is not the one with the best weights, why we do not always use callback to save the best model, and evaluate with those ones:

1. compile()

2.fit() with callback to save the best weights

3. load best weights

4. compile()

5. evaluate()

Sounds right.

Hii @Jason …during training does the model improved or only weights…?

The weights are the model. Training changes the weights.