How to Diagnose Overfitting and Underfitting of LSTM Models

It can be difficult to determine whether your Long Short-Term Memory model is performing well on your sequence prediction problem.

You may be getting a good model skill score, but it is important to know whether your model is a good fit for your data or if it is underfit or overfit and could do better with a different configuration.

In this tutorial, you will discover how you can diagnose the fit of your LSTM model on your sequence prediction problem.

After completing this tutorial, you will know:

  • How to gather and plot training history of LSTM models.
  • How to diagnose an underfit, good fit, and overfit model.
  • How to develop more robust diagnostics by averaging multiple model runs.

Let’s get started.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Training History in Keras
  2. Diagnostic Plots
  3. Underfit Example
  4. Good Fit Example
  5. Overfit Example
  6. Multiple Runs Example

1. Training History in Keras

You can learn a lot about the behavior of your model by reviewing its performance over time.

LSTM models are trained by calling the fit() function. This function returns a variable called history that contains a trace of the loss and any other metrics specified during the compilation of the model. These scores are recorded at the end of each epoch.

For example, if your model was compiled to optimize the log loss (binary_crossentropy) and measure accuracy each epoch, then the log loss and accuracy will be calculated and recorded in the history trace for each training epoch.

Each score is accessed by a key in the history object returned from calling fit(). By default, the loss optimized when fitting the model is called “loss” and accuracy is called “acc“.

Keras also allows you to specify a separate validation dataset while fitting your model that can also be evaluated using the same loss and metrics.

This can be done by setting the validation_split argument on fit() to use a portion of the training data as a validation dataset.

This can also be done by setting the validation_data argument and passing a tuple of X and y datasets.

The metrics evaluated on the validation dataset are keyed using the same names, with a “val_” prefix.

2. Diagnostic Plots

The training history of your LSTM models can be used to diagnose the behavior of your model.

You can plot the performance of your model using the Matplotlib library. For example, you can plot training loss vs test loss as follows:

Creating and reviewing these plots can help to inform you about possible new configurations to try in order to get better performance from your model.

Next, we will look at some examples. We will consider model skill on the train and validation sets in terms of loss that is minimized. You can use any metric that is meaningful on your problem.

3. Underfit Example

An underfit model is one that is demonstrated to perform well on the training dataset and poor on the test dataset.

This can be diagnosed from a plot where the training loss is lower than the validation loss, and the validation loss has a trend that suggests further improvements are possible.

A small contrived example of an underfit LSTM model is provided below.

Running this example produces a plot of train and validation loss showing the characteristic of an underfit model. In this case, performance may be improved by increasing the number of training epochs.

In this case, performance may be improved by increasing the number of training epochs.

Diagnostic Line Plot Showing an Underfit Model

Diagnostic Line Plot Showing an Underfit Model

Alternately, a model may be underfit if performance on the training set is better than the validation set and performance has leveled off. Below is an example of an

Below is an example of an an underfit model with insufficient memory cells.

Running this example shows the characteristic of an underfit model that appears under-provisioned.

In this case, performance may be improved by increasing the capacity of the model, such as the number of memory cells in a hidden layer or number of hidden layers.

Diagnostic Line Plot Showing an Underfit Model via Status

Diagnostic Line Plot Showing an Underfit Model via Status

4. Good Fit Example

A good fit is a case where the performance of the model is good on both the train and validation sets.

This can be diagnosed from a plot where the train and validation loss decrease and stabilize around the same point.

The small example below demonstrates an LSTM model with a good fit.

Running the example creates a line plot showing the train and validation loss meeting.

Ideally, we would like to see model performance like this if possible, although this may not be possible on challenging problems with a lot of data.

Diagnostic Line Plot Showing a Good Fit for a Model

Diagnostic Line Plot Showing a Good Fit for a Model

5. Overfit Example

An overfit model is one where performance on the train set is good and continues to improve, whereas performance on the validation set improves to a point and then begins to degrade.

This can be diagnosed from a plot where the train loss slopes down and the validation loss slopes down, hits an inflection point, and starts to slope up again.

The example below demonstrates an overfit LSTM model.

Running this example creates a plot showing the characteristic inflection point in validation loss of an overfit model.

This may be a sign of too many training epochs.

In this case, the model training could be stopped at the inflection point. Alternately, the number of training examples could be increased.

Diagnostic Line Plot Showing an Overfit Model

Diagnostic Line Plot Showing an Overfit Model

6. Multiple Runs Example

LSTMs are stochastic, meaning that you will get a different diagnostic plot each run.

It can be useful to repeat the diagnostic run multiple times (e.g. 5, 10, or 30). The train and validation traces from each run can then be plotted to give a more robust idea of the behavior of the model over time.

The example below runs the same experiment a number of times before plotting the trace of train and validation loss for each run.

In the resulting plot, we can see that the general trend of underfitting holds across 5 runs and is a stronger case for perhaps increasing the number of training epochs.

Diagnostic Line Plot Showing Multiple Runs for a Model

Diagnostic Line Plot Showing Multiple Runs for a Model

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered how to diagnose the fit of your LSTM model on your sequence prediction problem.

Specifically, you learned:

  • How to gather and plot training history of LSTM models.
  • How to diagnose an underfit, good fit, and overfit model.
  • How to develop more robust diagnostics by averaging multiple model runs.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


38 Responses to How to Diagnose Overfitting and Underfitting of LSTM Models

  1. Shiraz Hazrat September 1, 2017 at 6:26 am #

    Very useful insight. Thanks for sharing!

  2. Jon September 1, 2017 at 8:36 am #

    Some good info thanks. Are you sure your x axis units are correct though?

    • Jason Brownlee September 1, 2017 at 8:42 am #

      Yes, in one case (Overfit Example) I truncated the data for readability.

  3. Irati September 1, 2017 at 5:26 pm #

    Hi Jason!

    And what if the loss fluctuates? I get up and down peaks after the second epoch and even with 25 epochs the loss in the validation set is greater than 0.5

    Could you give me some clue of what is happening?

    Thanks!

    • Jason Brownlee September 2, 2017 at 6:04 am #

      Good question.

      I would recommend looking at the trends in loss over 10s to 100s of epochs, not over very short periods.

  4. Eshika Roy September 1, 2017 at 10:31 pm #

    Thank you for posting this informative blog on how to diagnose LSTM models. i will definitely try this nifty trick. Please keep on sharing more helpful tips and suggestions in the upcoming posts.

    • Jason Brownlee September 2, 2017 at 6:08 am #

      Thanks. I hope it helps.

      What other types of tips would you like me to write about?

  5. Louis September 2, 2017 at 12:57 am #

    Hi, James! Any tips how to detect overfitting without validation set (when I have Dropout layers) ?
    (I am beginner at deep learning)

    • Jason Brownlee September 2, 2017 at 6:14 am #

      The idea of overfitting the training set only has meaning in the context of another dataset, such as a test or validation set.

      Also, my name is Jason, not James.

      • Long October 17, 2017 at 9:41 am #

        Hi Jason,

        Could a test dataset be used to detect overfitting or underfitting to a training dataset without validation dataset? How to do? Is it different to the method using validation dataset?

        Thanks a lot. BTW, your lessons are quite benefit, helpful to study machine learning.

        • Jason Brownlee October 17, 2017 at 4:04 pm #

          Perhaps, but one data point (as opposed to an evaluation each epoch) might not be sufficient to make claims/diagnose model behavior.

  6. Andrei September 2, 2017 at 11:28 pm #

    Hi Jason,

    Hyper-parameter tuning for LSTMS is something really useful – especially in the context of time-series. Looking forward for a blog-post on this topic..

    Best,
    Andrei

    • Jason Brownlee September 3, 2017 at 5:44 am #

      What would you like to see exactly? What parameters?

      I have a few posts on tuning LSTMs.

  7. Amin September 2, 2017 at 11:42 pm #

    Thank you for your post Jason.
    There is also another case, when val loss goes below training loss! This case indicates a highly non-stationary time series with growing mean (or var), wherein the network focuses on the meaty part of the signal which happenes to fall in the val set.

  8. Andrey Sharypov September 3, 2017 at 4:47 am #

    Hi Jason!
    Thank you for this very useful way of diagnosis of LSTM. I’m working on human activity recognition. Now my plot looks like this https://imgur.com/a/55p9b. What can you advise?
    I’m trying to classify fitness exercises.

    • Jason Brownlee September 3, 2017 at 5:49 am #

      Great work!

      Maybe try early stopping around epoch 10?

      • Andrey Sharypov September 3, 2017 at 4:30 pm #

        In this case my accuracy will be:
        Train on 21608 samples, validate on 5403 samples

        21608/21608 [==============================] – 802s – loss: 0.2115 – acc: 0.9304 – val_loss: 0.1949 – val_acc: 0.9337
        Epoch 7/50
        21608/21608 [==============================] – 849s – loss: 0.1803 – acc: 0.9424 – val_loss: 0.2132 – val_acc: 0.9249
        Epoch 8/50
        21608/21608 [==============================] – 786s – loss: 0.1632 – acc: 0.9473 – val_loss: 0.2222 – val_acc: 0.9297
        Epoch 9/50
        21608/21608 [==============================] – 852s – loss: 0.1405 – acc: 0.9558 – val_loss: 0.1563 – val_acc: 0.9460
        Epoch 10/50
        21608/21608 [==============================] – 799s – loss: 0.1267 – acc: 0.9590 – val_loss: 0.1453 – val_acc: 0.9606
        Epoch 11/50
        21608/21608 [==============================] – 805s – loss: 0.1147 – acc: 0.9632 – val_loss: 0.1490 – val_acc: 0.9567
        Epoch 12/50
        21608/21608 [==============================] – 788s – loss: 0.1069 – acc: 0.9645 – val_loss: 0.1176 – val_acc: 0.9626
        Epoch 13/50
        21608/21608 [==============================] – 838s – loss: 0.1028 – acc: 0.9667 – val_loss: 0.1279 – val_acc: 0.9578
        Epoch 14/50
        21608/21608 [==============================] – 808s – loss: 0.0889 – acc: 0.9707 – val_loss: 0.1183 – val_acc: 0.9648
        Epoch 15/50
        21608/21608 [==============================] – 785s – loss: 0.0843 – acc: 0.9729 – val_loss: 0.1000 – val_acc: 0.9706

        After 50 epochs accuracy:
        Epoch 50/50
        21608/21608 [==============================] – 793s – loss: 0.0177 – acc: 0.9950 – val_loss: 0.0772 – val_acc: 0.9832

        Also I didn’t use dropout and regularization.
        One of my class (rest) have much more samples than other (exercises) https://imgur.com/a/UxEPr.
        Confusion matrix – https://imgur.com/a/LYxUu.

        I use next model:
        model = Sequential()
        model.add(Bidirectional(LSTM(128), input_shape = (None,3)))
        model.add(Dense(9, activation=’softmax’))
        model.compile(loss=’categorical_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])

        mcp = ModelCheckpoint(‘best_model_50_epochs.hd5’, monitor=”val_acc”,
        save_best_only=True, save_weights_only=False)
        history = model.fit(X_train,
        y_train,
        batch_size=32,
        epochs=50,
        validation_split=0.2,
        callbacks=[mcp])

        And I tried dropout:
        model.add(Bidirectional(LSTM(128, dropout=0.5, recurrent_dropout=0.5), input_shape = (None,3)))
        train_loss vs val_loss – https://imgur.com/a/k5TVU
        accuracy on 50 epochs:
        loss: 0.2269 – acc: 0.9244 – val_loss: 0.1574 – val_acc: 0.9558

      • Andrey Sharypov September 7, 2017 at 6:34 pm #

        Hi Jason! What can you advise to increase accuracy, when I have multi classes and one class takes 50% of samples?
        (I showed my model above)
        Thank you!

  9. Andrei September 3, 2017 at 6:35 pm #

    Activation, batch_size (I noticed it correlates with the test size but not always), loss function, number of hidden layers, number of memory cells, optimizer type, input series history length (http://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/)

    For all of these parameters I do trial/error testing. Would be nice to know if there is a more systematic way to find the best parameters.

  10. Andrei September 4, 2017 at 4:38 pm #

    Thank you, Jason! Great work with your blog!

  11. Leonard September 4, 2017 at 6:16 pm #

    Hi Jason,

    Great post to for us to know how to diagnose our program! I have been playing with RNNs for a while now. And i started by reading your previous posts. They were great for beginners to start off with!

    Now as i want to go into more advanced level, i hope i could get your insights/summary on more interesting recent/state-of-the-art models like WaveNet, Deep Speech, using Attention with keras, etc.

    Thanks!

  12. Mazen September 9, 2017 at 12:15 am #

    Hi Jason,
    thank you for this very useful and clear post.
    I have a question.
    When we set validation_data in fit(), is it just to see the behaviour of the model when fit() is done (this is what i gues!), or does Keras use this validation_data somehow while optimizing from epoch to another to best fit such validation_data? (This is what I hope :-))
    I usually prepare three disjoint data sets: training data to train, validation data to optimize the hyper-parameters, and at the end testing data, as out-of-sample data, to test the model. Thus, if Keras optimizes the model based on validation_data, then I don’t have to optimize by myself!

    • Jason Brownlee September 9, 2017 at 11:58 am #

      Validation data is only used to give insight into the skill of the model on unseen data during training.

  13. Long September 11, 2017 at 11:16 pm #

    Hi Jason,

    Why accuracy value always are 0s when I regress? Loss values decreased. What is the reason for this? Thank you.

    model.compile(optimizer=’adam’, loss=’mse’, metrics=[‘mean_squared_error’, ‘accuracy’])

    • Jason Brownlee September 13, 2017 at 12:23 pm #

      We cannot measure accuracy for regression. It is a measure of correct label predictions and there are no labels in regression.

  14. Shabnam October 23, 2017 at 8:15 pm #

    Hello,

    Thank you for your helpful posts.
    Is overfitting always solvable?

    In other words, what are the necessary and sufficient conditions for a data to be trainable?
    Maybe we have enough data, but because data is not trainable, we have overfit.

  15. Farzad November 7, 2017 at 6:41 am #

    Hi Jason,

    Thanks for your great and useful post.

    I have a kind of weird problem in my LSTM train/validation loss plot. As you can see here [1], the validation loss starts increasing right after the first (or few) epoch(s) while the training loss decreases constantly and finally becomes zero. I used drop-out to deal with this severe overfitting problem, however, at the best case, the validation error remains the same value as it was in the first epoch. I’ve also tried changing various parameters of the model but in all configuration, I have such a increasing trend in validation loss. Is this really an overfitting problem or something is wrong with my implementation or problem phrasing?

    [1] https://imgur.com/mZXh4lh

    • Jason Brownlee November 7, 2017 at 9:55 am #

      You could explore models with more representational capacity, perhaps more neurons or more layers?

  16. Cliff November 16, 2017 at 1:57 pm #

    Hi, Jason.I’m quite shocked by your posts. I’m facing a hard situation where validation loss >>
    training loss when using LSTM so I googled it . Here https://github.com/karpathy/char-rnn/issues/160 and here https://www.reddit.com/r/MachineLearning/comments/3rmqxd/determing_if_rnn_model_is_underfitting_vs/ they suggested that this can be overfitting. But in your posts, this should be underfitting. I’m confused, can you explained it ?

Leave a Reply