How to Use Dropout with LSTM Networks for Time Series Forecasting

Long Short-Term Memory (LSTM) models are a type of recurrent neural network capable of learning sequences of observations.

This may make them a network well suited to time series forecasting.

An issue with LSTMs is that they can easily overfit training data, reducing their predictive skill.

Dropout is a regularization method where input and recurrent connections to LSTM units are probabilistically excluded from activation and weight updates while training a network. This has the effect of reducing overfitting and improving model performance.

In this tutorial, you will discover how to use dropout with LSTM networks and design experiments to test for its effectiveness for time series forecasting.

After completing this tutorial, you will know:

  • How to design a robust test harness for evaluating LSTM networks for time series forecasting.
  • How to design, execute, and interpret the results from using input weight dropout with LSTMs.
  • How to design, execute, and interpret the results from using recurrent weight dropout with LSTMs.

Let’s get started.

How to Use Dropout with LSTM Networks for Time Series Forecasting

How to Use Dropout with LSTM Networks for Time Series Forecasting
Photo by Jonas Bengtsson, some rights reserved.

Tutorial Overview

This tutorial is broken down into 5 parts. They are:

  1. Shampoo Sales Dataset
  2. Experimental Test Harness
  3. Input Dropout
  4. Recurrent Dropout
  5. Review of Results

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

Next, let’s take a look at a standard time series forecasting problem that we can use as context for this experiment.

If you need help setting up your Python environment, see this post:

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3-year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

You can download and learn more about the dataset here.

The example below loads and creates a plot of the loaded dataset.

Running the example loads the dataset as a Pandas Series and prints the first 5 rows.

A line plot of the series is then created showing a clear increasing trend.

Line Plot of Shampoo Sales Dataset

Line Plot of Shampoo Sales Dataset

Next, we will take a look at the model configuration and test harness used in the experiment.

Experimental Test Harness

This section describes the test harness used in this tutorial.

Data Split

We will split the Shampoo Sales dataset into two parts: a training and a test set.

The first two years of data will be taken for the training dataset and the remaining one year of data will be used for the test set.

Models will be developed using the training dataset and will make predictions on the test dataset.

The persistence forecast (naive forecast) on the test dataset achieves an error of 136.761 monthly shampoo sales. This provides a lower acceptable bound of performance on the test set.

Model Evaluation

A rolling-forecast scenario will be used, also called walk-forward model validation.

Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value from the test set will be taken and made available to the model for the forecast on the next time step.

This mimics a real-world scenario where new Shampoo Sales observations would be available each month and used in the forecasting of the following month.

This will be simulated by the structure of the train and test datasets.

All forecasts on the test dataset will be collected and an error score calculated to summarize the skill of the model. The root mean squared error (RMSE) will be used as it punishes large errors and results in a score that is in the same units as the forecast data, namely monthly shampoo sales.

Data Preparation

Before we can fit a model to the dataset, we must transform the data.

The following three data transforms are performed on the dataset prior to fitting a model and making a forecast.

  1. Transform the time series data so that it is stationary. Specifically, a lag=1 differencing to remove the increasing trend in the data.
  2. Transform the time series into a supervised learning problem. Specifically, the organization of data into input and output patterns where the observation at the previous time step is used as an input to forecast the observation at the current time step
  3. Transform the observations to have a specific scale. Specifically, to rescale the data to values between -1 and 1.

These transforms are inverted on forecasts to return them into their original scale before calculating and error score.

LSTM Model

We will use a base stateful LSTM model with 1 neuron fit for 1000 epochs.

A batch size of 1 is required as we will be using walk-forward validation and making one-step forecasts for each of the final 12 months of test data.

A batch size of 1 means that the model will be fit using online training (as opposed to batch training or mini-batch training). As a result, it is expected that the model fit will have some variance.

Ideally, more training epochs would be used (such as 1500), but this was truncated to 1000 to keep run times reasonable.

The model will be fit using the efficient ADAM optimization algorithm and the mean squared error loss function.

Experimental Runs

Each experimental scenario will be run 30 times and the RMSE score on the test set will be recorded from the end each run.

Let’s dive into the experiments.

Baseline LSTM Model

Let’s start off with the baseline LSTM model.

The baseline LSTM model for this problem has the following configuration:

  • Lag inputs: 1
  • Epochs: 1000
  • Units in LSTM hidden layer: 3
  • Batch Size: 4
  • Repeats: 3

The complete code listing is provided below.

This code listing will be used as the basis for all following experiments, with only the changes to this code listing provided in subsequent sections.

Running the experiment prints summary statistics for the test RMSE for all repeats.

We can see that on average this model configuration achieved a test RMSE of about 92 monthly shampoo sales with a standard deviation of 5.

A box and whisker plot is also created from the distribution of test RMSE results and saved to a file.

The plot provides a clear depiction of the spread of the results, highlighting the middle 50% of values (the box) and the median (green line).

Box and Whisker Plot of Baseline Performance on the Shampoo Sales Dataset

Box and Whisker Plot of Baseline Performance on the Shampoo Sales Dataset

Another angle to consider with a network configuration is how it behaves over time as the model is being fit.

We can evaluate the model on the training and test datasets after each training epoch to get an idea as to if the configuration is overfitting or underfitting the problem.

We will use this diagnostic approach on the top result from each set of experiments. A total of 10 repeats of the configuration will be run and the train and test RMSE scores after each training epoch plotted on a line plot.

In this case, we will use this diagnostic on the LSTM fit for 1000 epochs.

The complete diagnostic code listing is provided below.

As with the previous code listing, the code below will be used as the basis for all diagnostics in this tutorial and only the changes to this listing will be provided in subsequent sections.

Running the diagnostic prints the final train and test RMSE for each run. More interesting is the final line plot created.

The line plot shows the train RMSE (blue) and test RMSE (orange) after each training epoch.

In this case, the diagnostic plot shows a steady decrease in train and test RMSE to about 400-500 epochs, after which time it appears some overfitting may be occurring. This is signified by a continued decrease in train RMSE and an increase in test RMSE.

Diagnostic Line Plot of the Baseline Model on the Shampoo Sales Daset

Diagnostic Line Plot of the Baseline Model on the Shampoo Sales Daset

Input Dropout

Dropout can be applied to the input connection within the LSTM nodes.

A dropout on the input means that for a given probability, the data on the input connection to each LSTM block will be excluded from node activation and weight updates.

In Keras, this is specified with a dropout argument when creating an LSTM layer. The dropout value is a percentage between 0 (no dropout) and 1 (no connection).

In this experiment, we will compare no dropout to input dropout rates of 20%, 40% and 60%.

Below lists the updated fit_lstm(), experiment(), and run() functions for using input dropout with LSTMs.

Running this experiment prints descriptives statistics for each evaluated configuration.

The results suggest that on average an input dropout of 40% results in better performance, but the difference between the average result for a dropout of 20%, 40%, and 60% is very minor. All seemed to outperform no dropout.

A box and whisker plot is also created to compare the distributions of results for each configuration.

The plot shows the spread of results decreasing with the increase of input dropout. The plot also suggests that the input dropout of 20% may have a slightly lower median test RMSE.

The results do encourage the use of some input dropout for the chosen LSTM configuration, perhaps set to 40%.

Box and Whisker Plot of Input Dropout Performance on the Shampoo Sales Dataset

Box and Whisker Plot of Input Dropout Performance on the Shampoo Sales Dataset

We can review how input dropout of 40% affects the dynamics of the model while being fit to the training data.

The code below summarizes the updates to the fit_lstm() and run() functions compared to the baseline version of the diagnostic script.

Running the updated diagnostic creates a plot of the train and test RMSE performance of the model with input dropout after each training epoch.

The results show a clear addition of bumps to the train and test RMSE traces, which is more pronounced on the test RMSE scores.

We can also see that the symptoms of overfitting have been addressed with test RMSE continuing to go down over the entire 1000 epochs, perhaps suggesting the need for additional training epochs to capitalize on the behavior.

Diagnostic Line Plot of Input Dropout Performance on the Shampoo Sales Dataset

Diagnostic Line Plot of Input Dropout Performance on the Shampoo Sales Dataset

Recurrent Dropout

Dropout can also be applied to the recurrent input signal on the LSTM units.

In Keras, this is achieved by setting the recurrent_dropout argument when defining a LSTM layer.

In this experiment, we will compare no dropout to the recurrent dropout rates of 20%, 40%, and 60%.

Below lists the updated fit_lstm(), experiment(), and run() functions for using input dropout with LSTMs.

Running this experiment prints descriptive statistics for each evaluated configuration.

The average results suggest that an average recurrent dropout of 20% or 40% is preferred, but overall the results are not much better than the baseline.

A box and whisker plot is also created to compare the distributions of results for each configuration.

The plot highlights the tighter distribution with a recurrent dropout of 40% compared to 20% and the baseline, perhaps making this configuration preferable. The plot also highlights that the min (best) test RMSE in the distribution appears to be have been affected when using recurrent dropout, providing worse performance.

Box and Whisker Plot of Recurrent Dropout Performance on the Shampoo Sales Dataset

Box and Whisker Plot of Recurrent Dropout Performance on the Shampoo Sales Dataset

We can review how a recurrent dropout of 40% affects the dynamics of the model while being fit to the training data.

The code below summarizes the updates to the fit_lstm() and run() functions compared to the baseline version of the diagnostic script.

Running the updated diagnostic creates a plot of the train and test RMSE performance of the model with input dropout after each training epoch.

The plot shows the addition of bumps on the test RMSE traces, with little effect on the training RMSE traces. The plot also suggests a plateau, if not an increasing trend in test RMSE after about 500 epochs.

At least on this LSTM configuration and on this problem, perhaps recurrent dropout may not add much value.

Diagnostic Line Plot of Recurrent Dropout Performance on the Shampoo Sales Dataset

Diagnostic Line Plot of Recurrent Dropout Performance on the Shampoo Sales Dataset

Extensions

This section lists some ideas for further experiments you might like to consider exploring after completing this tutorial.

  • Input Layer Dropout. It may be worth exploring the use of dropout on the input layer and how this impacts the performance and overfitting of the LSTM.
  • Combine Input and Recurrent. It may be worth exploring the combination of both input and recurrent dropout to see if any additional benefit can be provided.
  • Other Regularization Methods. It may be worth exploring other regularization methods with LSTM networks, such as various input, recurrent, and bias weight regularization functions.

Further Reading

For more on dropout with MLP models in Keras, see the post:

Below are some papers on dropout with LSTM networks that you might find useful for further reading.

Summary

In this tutorial, you discovered how to use dropout with LSTMs for time series forecasting.

Specifically, you learned:

  • How to design a robust test harness for evaluating LSTM networks for time series forecasting.
  • How to configure input weight dropout on LSTMs for time series forecasting.
  • How to configure recurrent weight dropout on LSTMs for time series forecasting.

Do you have any questions about using dropout with LSTM networks?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


15 Responses to How to Use Dropout with LSTM Networks for Time Series Forecasting

  1. Kunpeng Zhang April 28, 2017 at 12:59 pm #

    Hi Jason,
    I get an idea.
    # transform data to be supervised learning
    supervised = timeseries_to_supervised(diff_values, n_lag)
    supervised_values = supervised.values[n_lag:,:]
    # split data into train and test-sets
    train, test = supervised_values[0:-12], supervised_values[-12:]
    In fact, the first pair is: [ 0. -120.1]. what about throw away the first pair from the train data?
    Starting from the true data instead of 0.

  2. Andrew April 28, 2017 at 7:46 pm #

    Hi Jason,
    Are there plans to extend these tutorials to the multivariate case
    Many thanks,
    Best,
    Andrew

    • Jason Brownlee April 29, 2017 at 7:23 am #

      Yes, there are a few scheduled on the blog in about a months time.

  3. Henry May 29, 2017 at 8:55 pm #

    Hi jason,
    I would like to know how i can pass feature learnt from one deep learning model to another in keras. for instance, features learnt with Convolutional neural network to Recurrent neural network before making making prediction or classification results. This may involve using two deep learning model to develop projects.
    Best
    Henry

    • Jason Brownlee June 2, 2017 at 12:27 pm #

      Yes, they are just weighted inputs (a functional transform of inputs).

      Save the network and calculate the weighted output of inputs and use as inputs to another network.

  4. Utkana June 26, 2017 at 11:40 pm #

    On running program,
    Error appeared

    Using TensorFlow backend.
    Traceback (most recent call last):
    File “lstm_time_series_keras.py”, line 134, in
    run()
    File “lstm_time_series_keras.py”, line 126, in run
    results[‘results’] = experiment(series, n_lag, n_repeats, n_epochs, n_batch, n_neurons)
    File “lstm_time_series_keras.py”, line 93, in experiment
    lstm_model = fit_lstm(train_trimmed, n_batch, n_epochs, n_neurons)
    File “lstm_time_series_keras.py”, line 72, in fit_lstm
    model.fit(X, y, epochs=1, batch_size=n_batch, verbose=0, shuffle=False)
    File “/usr/local/lib/python2.7/dist-packages/keras/models.py”, line 870, in fit
    initial_epoch=initial_epoch)
    File “/usr/local/lib/python2.7/dist-packages/keras/engine/training.py”, line 1435, in fit
    batch_size=batch_size)
    File “/usr/local/lib/python2.7/dist-packages/keras/engine/training.py”, line 1333, in _standardize_user_data
    str(x[0].shape[0]) + ‘ samples’)
    ValueError: In a stateful network, you should only pass inputs with a number of samples that can be divided by the batch size. Found: 19 samples

    • Jason Brownlee June 27, 2017 at 8:30 am #

      Try changing the batch size to be divisible by the number of samples.

  5. al July 10, 2017 at 3:11 pm #

    Hi Jason,

    Great blog!
    Can you share more on what’s the significants of the bumping RMSE plots? is it an acceptable condition, or we should try to fix it to make the RMSE plots smooth?

    • Jason Brownlee July 11, 2017 at 10:27 am #

      The smoothness is not really a concern. It is a result of the regularization of the network.

  6. tp July 18, 2017 at 3:15 am #

    Hi Jason, thanks for the tutorials. What’s your opinion on recurrent dropout generally? I’ve seen a few sources saying it’s not a good idea.
    e.g.
    https://arxiv.org/pdf/1409.2329.pdf – “Standard dropout perturbs the recurrent connections, which makes it difficult for the LSTM to learn to store information for long periods of time. By not using dropout on the recurrent connections, the LSTM can benefit from dropout regularization without sacrificing its valuable memorization abilit”

    • Jason Brownlee July 18, 2017 at 8:47 am #

      I focus on the effect to model skill, I’ve found generally that dropout on input weights and weight regularization on inputs both result in better skill for simple sequence prediction tasks.

  7. Michele August 3, 2017 at 6:37 pm #

    Hi Jason,
    Thank you for your great article. I’m finding it very useful.
    Just one question, in my case I’m scaling the input data between -1,1 but at the output of the model.predict() the data range is not not between -1 and 1. I have some strange values like -1.00688391 any idea?
    thank you

    • Jason Brownlee August 4, 2017 at 6:56 am #

      The network may have a linear activation on the output layer. You could just round it.

  8. Tryfon August 17, 2017 at 7:11 am #

    Hey Jason,

    why do you rescale the input to the range (-1, 1)?

    By default the activation function of the LSTM is ‘linear’. Shouldn’t you have changed it to tanh?

    The same question would easily apply to your post “How to Scale Data for Long Short-Term Memory Networks in Python”. In that post you rescale to (0,1). So in that case do you assume that the activation function is ‘sigmoid’?

    Thanks in advance!

    • Jason Brownlee August 17, 2017 at 4:50 pm #

      I should have normalized the data in [0,1], internal gates use sigmoid.

Leave a Reply