How to Use Weight Regularization with LSTM Networks for Time Series Forecasting

Long Short-Term Memory (LSTM) models are a recurrent neural network capable of learning sequences of observations.

This may make them a network well suited to time series forecasting.

An issue with LSTMs is that they can easily overfit training data, reducing their predictive skill.

Weight regularization is a technique for imposing constraints (such as L1 or L2) on the weights within LSTM nodes. This has the effect of reducing overfitting and improving model performance.

In this tutorial, you will discover how to use weight regularization with LSTM networks and design experiments to test for its effectiveness for time series forecasting.

After completing this tutorial, you will know:

  • How to design a robust test harness for evaluating LSTM networks for time series forecasting.
  • How to design, execute, and interpret the results from using bias weight regularization with LSTMs.
  • How to design, execute, and interpret the results from using input and recurrent weight regularization with LSTMs.

Let’s get started.

How to Use Weight Regularization with LSTM Networks for Time Series Forecasting

How to Use Weight Regularization with LSTM Networks for Time Series Forecasting
Photo by Julian Fong, some rights reserved.

Tutorial Overview

This tutorial is broken down into 6 parts. They are:

  1. Shampoo Sales Dataset
  2. Experimental Test Harness
  3. Bias Weight Regularization
  4. Input Weight Regularization
  5. Recurrent Weight Regularization
  6. Review of Results

Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

Next, let’s take a look at a standard time series forecasting problem that we can use as context for this experiment.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3-year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

You can download and learn more about the dataset here.

The example below loads and creates a plot of the loaded dataset.

Running the example loads the dataset as a Pandas Series and prints the first 5 rows.

A line plot of the series is then created showing a clear increasing trend.

Line Plot of Shampoo Sales Dataset

Line Plot of Shampoo Sales Dataset

Next, we will take a look at the model configuration and test harness used in the experiment.

Experimental Test Harness

This section describes the test harness used in this tutorial.

Data Split

We will split the Shampoo Sales dataset into two parts: a training and a test set.

The first two years of data will be taken for the training dataset and the remaining one year of data will be used for the test set.

Models will be developed using the training dataset and will make predictions on the test dataset.

The persistence forecast (naive forecast) on the test dataset achieves an error of 136.761 monthly shampoo sales. This provides a lower acceptable bound of performance on the test set.

Model Evaluation

A rolling-forecast scenario will be used, also called walk-forward model validation.

Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value from the test set will be taken and made available to the model for the forecast on the next time step.

This mimics a real-world scenario where new Shampoo Sales observations would be available each month and used in the forecasting of the following month.

This will be simulated by the structure of the train and test datasets.

All forecasts on the test dataset will be collected and an error score calculated to summarize the skill of the model. The root mean squared error (RMSE) will be used as it punishes large errors and results in a score that is in the same units as the forecast data, namely monthly shampoo sales.

Data Preparation

Before we can fit a model to the dataset, we must transform the data.

The following three data transforms are performed on the dataset prior to fitting a model and making a forecast.

  1. Transform the time series data so that it is stationary. Specifically, a lag=1 differencing to remove the increasing trend in the data.
  2. Transform the time series into a supervised learning problem. Specifically, the organization of data into input and output patterns where the observation at the previous time step is used as an input to forecast the observation at the current timestep
  3. Transform the observations to have a specific scale. Specifically, to rescale the data to values between -1 and 1.

These transforms are inverted on forecasts to return them into their original scale before calculating and error score.

LSTM Model

We will use a base stateful LSTM model with 1 neuron fit for 1000 epochs.

 

Ideally a batch size of 1 would be used for walk-foward validation. We will assume walk-forward validation and predict the whole year for speed. As such we can use any batch size that is divisble by the number of samples, in this case we will use a value of 4.

Ideally, more training epochs would be used (such as 1500), but this was truncated to 1000 to keep run times reasonable.

The model will be fit using the efficient ADAM optimization algorithm and the mean squared error loss function.

Experimental Runs

Each experimental scenario will be run 30 times and the RMSE score on the test set will be recorded from the end each run.

Let’s dive into the experiments.

Baseline LSTM Model

Let’s start-off with the baseline LSTM model.

The baseline LSTM model for this problem has the following configuration:

  • Lag inputs: 1
  • Epochs: 1000
  • Units in LSTM hidden layer: 3
  • Batch Size: 4
  • Repeats: 3

The complete code listing is provided below.

This code listing will be used as the basis for all following experiments, with only the changes to this code provided in subsequent sections.

Running the experiment prints summary statistics for the test RMSE for all repeats.

We can see that on average, this model configuration achieves a test RMSE of about 92 monthly shampoo sales with a standard deviation of 5.

A box and whisker plot is also created from the distribution of test RMSE results and saved to a file.

The plot provides a clear depiction of the spread of the results, highlighting the middle 50% of values (the box) and the median (green line).

Box and Whisker Plot of Baseline Performance on the Shampoo Sales Dataset

Box and Whisker Plot of Baseline Performance on the Shampoo Sales Dataset

Bias Weight Regularization

Weight regularization can be applied to the bias connection within the LSTM nodes.

In Keras, this is specified with a bias_regularizer argument when creating an LSTM layer. The regularizer is defined as an instance of the one of the L1, L2, or L1L2 classes.

More details here:

In this experiment, we will compare L1, L2, and L1L2 with a default value of 0.01 against the baseline model. We can specify all configurations using the L1L2 class, as follows:

  • L1L2(0.0, 0.0) [e.g. baseline]
  • L1L2(0.01, 0.0) [e.g. L1]
  • L1L2(0.0, 0.01) [e.g. L2]
  • L1L2(0.01, 0.01) [e.g. L1L2 or elasticnet]

Below lists the updated fit_lstm(), experiment(), and run() functions for using bias weight regularization with LSTMs.

Running this experiment prints descriptive statistics for each evaluated configuration.

The results suggest that on average, the default of no bias regularization results in better performance compared to all of the other configurations considered.

A box and whisker plot is also created to compare the distributions of results for each configuration.

The plot shows that all configurations have about the same spread and that the addition of bias regularization uniformly was not helpful on this problem.

Box and Whisker Plots of Bias Weight Regularization Performance on the Shampoo Sales Dataset

Box and Whisker Plots of Bias Weight Regularization Performance on the Shampoo Sales Dataset

Input Weight Regularization

We can also apply regularization to input connections on each LSTM unit.

In Keras, this is achieved by setting the kernel_regularizer argument to a regularizer class.

We will test the same regularizer configurations as were used in the previous section, specifically:

  • L1L2(0.0, 0.0) [e.g. baseline]
  • L1L2(0.01, 0.0) [e.g. L1]
  • L1L2(0.0, 0.01) [e.g. L2]
  • L1L2(0.01, 0.01) [e.g. L1L2 or elasticnet]

Below lists the updated fit_lstm(), experiment(), and run() functions for using bias weight regularization with LSTMs.

Running this experiment prints descriptive statistics for each evaluated configuration.

The results suggest that adding weight regularization to input connections does offer benefit across the board on this setup.

We can see that the test RMSE is approximately 10 units lower for all configurations with perhaps more benefit when both L1 and L2 are combined into an elasticnet type constraint.

A box and whisker plot is also created to compare the distributions of results for each configuration.

The plot shows the general lower distribution of error for input regularization. The results also suggest a tighter spread of results with regularization that may be more pronounced with the L1L2 configuration that achieved the better results.

This is an encouraging finding, suggesting that additional experiments with different L1L2 values for input regularization would be well worth investigating.

Box and Whisker Plots of Input Weight Regularization Performance on the Shampoo Sales Dataset

Box and Whisker Plots of Input Weight Regularization Performance on the Shampoo Sales Dataset

Recurrent Weight Regularization

Finally, we can also apply regularization to recurrent connections on each LSTM unit.

In Keras, this is achieved by setting the recurrent_regularizer argument to a regularizer class.

We will test the same regularizer configurations as were used in the previous section, specifically:

  • L1L2(0.0, 0.0) [e.g. baseline]
  • L1L2(0.01, 0.0) [e.g. L1]
  • L1L2(0.0, 0.01) [e.g. L2]
  • L1L2(0.01, 0.01) [e.g. L1L2 or elasticnet]

Below lists the updated fit_lstm(), experiment(), and run() functions for using bias weight regularization with LSTMs.

Running this experiment prints descriptive statistics for each evaluated configuration.

The results suggest no obvious benefit from using regularization on the recurrent connection with LSTMs on this problem.

The mean performance of all variations tried resulted in worse performance than the baseline model.

A box and whisker plot is also created to compare the distributions of results for each configuration.

The plot shows the same story as the summary statistics, suggesting little benefit from using recurrent weight regularization.

Box and Whisker Plots of Recurrent Weight Regularization Performance on the Shampoo Sales Dataset

Box and Whisker Plots of Recurrent Weight Regularization Performance on the Shampoo Sales Dataset

Extensions

This section lists ideas for follow-up experiments to extend the work in this tutorial.

  • Input Weight Regularization. The experimental results for input weight regularization on this problem showed great promise of listing performance. This could be investigated further by perhaps grid searching different L1 and L2 values to find an optimal configuration.
  • Behavior Dynamics. The dynamic behavior of each weight regularization scheme could be investigated by plotting train and test RMSE over training epochs to get an idea of weight regularization on overfitting or underfitting behavior patterns.
  • Combine Regularization. Experiments could be designed to explore the effect of combining different weight regularization schemes.
  • Activation Regularization. Keras also supports activation regularization, and this could be another avenue to explore imposing constraints on the LSTM and reduce overfitting.

Summary

In this tutorial, you discovered how to use weight regularization with LSTMs for time series forecasting.

Specifically, you learned:

  • How to design a robust test harness for evaluating LSTM networks for time series forecasting.
  • How to configure bias weight regularization on LSTMs for time series forecasting.
  • How to configure input and recurrent weight regularization on LSTMs for time series forecasting.

Do you have any questions about using weight regularization with LSTM networks?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


15 Responses to How to Use Weight Regularization with LSTM Networks for Time Series Forecasting

  1. Alex May 5, 2017 at 5:57 am #

    Gret job like every time.

    But I want to understand why always we are using rmse and we don’t use the accuracy metrics ??

    • Jason Brownlee May 5, 2017 at 7:35 am #

      You cannot measure accuracy on regression problems (unless you transform them to become a classification problem).

      In general, problems that have a real-valued quantity as the output variable are regression problems, those that have a category or label output are classification.

  2. Alex May 5, 2017 at 7:51 am #

    Ah OK, yes it is a logic thinking. Thanks

  3. shobhit May 5, 2017 at 9:08 pm #

    hello jason,
    how can be predicted a stock market data using a multiple input of columns and single output columns.

  4. shobhit May 5, 2017 at 9:18 pm #

    import numpy as np
    import pandas as pd
    import tensorflow as tf
    #tf.logging.set_verosity(tf.logging.ERROR)
    from pandas_datareader import data as web
    import matplotlib.pyplot as plt
    def get_data():
    feature_cols={‘ret_%s’ %i:tf.constant(idata[‘ret_%s’%i].values)for i in lags}
    labels=tf.constant((idata[‘returns’]>0).astype(int).values,shape=[idata[‘returns’].size,1])
    return feature_cols,labels

    symbol=’^GSPC’
    data=web.DataReader(symbol,data_source=’yahoo’,start=’2014-01-01′,end=’2016-10-31′)[‘Adj Close’]
    data=pd.DataFrame(data)
    data.rename(columns={‘Adj Close’:’price’},inplace=True)
    data[‘returns’]=np.log(data/data.shift(1))
    lags=range(1,6)
    for i in lags:
    data[‘ret_%s’% i]=np.sign(data[‘returns’].shift(i))
    data.dropna(inplace=True)
    print data.round(4).tail()
    cutoff=’2015-1-1′
    training_data=data[data.index=cutoff].copy()
    #def get_data():
    #feature_cols={‘ret_%s’ %i:tf.constant(data[‘ret_%s’%i].values)for i in lags}
    #labels=tf.constant((data[‘returns’]>0).astype(int).values,shape=[data[‘returns’].size,1])
    #return feature_cols,labels
    fc=[tf.contrib.layers.real_valued_column(‘ret_%s’% i,dimension=1) for i in lags]
    model=tf.contrib.learn.DNNClassifier(feature_columns=fc,n_classes=2,hidden_units=[100,100])
    idata=training_data
    model.fit(input_fn=get_data,steps=500)
    model.evaluate(input_fn=get_data,steps=1)
    pred=model.predict(input_fn=get_data)
    pred[:30]
    training_data[‘prediction’]=np.where(pred>0,1,-1)
    training_data[‘strategy’]=training_data[‘prediction’]*training_data[‘returns’]
    training_data[[‘returns’,’strategy’]].cumsum().apply(np.exp).plot(figsize=(10,6))
    idata=test_data
    model.evaluate(input_fn=get_data,steps=1)
    pred=model.predict(input_fn=get_data)
    test_data[‘prediction’]=np.where(pred>0,1,-1)
    test_data[‘strategy’]=test_data[‘prediction’]*test_data[‘returns’]
    test_data[[‘returns’,’strategy’]].cumsum().apply(np.exp).plot(figsize=(10,6))
    #pred[:1]
    if __name__ == ‘__main__’:
    get_data()

    hello jason,
    i just ran this code but i got some error in the code and i cannot understand the error.
    the error is:—–> 1 pred[:30]

    TypeError: ‘generator’ object is not subscriptable

    please kindly help me.

    • Jason Brownlee May 6, 2017 at 7:43 am #

      Perhaps contact the author of the tensorflow code you have pasted?

  5. Birkey May 10, 2017 at 10:48 am #

    Great article!
    however, what confused me is, in section LSTM model, “A batch size of 1 is required as we will be using walk-forward validation and making one-step forecasts for each of the final 12 months of test data.”, but then you use batch size as 4 (n_batch=4).

    I guess you mean “A time step of 1 is required”, am I right? cause one forecast per month means one time step.

    • Jason Brownlee May 11, 2017 at 8:26 am #

      You make a good point.

      A batch size of 1 would be required if we made predictions each time step of the test data. Here we predict the whole year first, then work through the predictions to see what they mean.

      I have updated the post correcting the error.

  6. Laurent May 11, 2017 at 5:58 pm #

    Thanks for all your pedagogic material, it’s really easy to learn with your articles!

    I have a question: in all your examples of time series deep learning, you’re always predicting the next time step, one at a time.

    How would you approach the case of training with a time series, and forecasting the next 12 steps, for example?

  7. jinhua zhang May 12, 2017 at 4:31 am #

    HI, your article is very useful for me! But I run the “Bias Weight Regularization procedure”, the spyder will have the fault :”D:\Program Files\Anaconda3\lib\site-packages\matplotlib\__init__.py:1357: UserWarning: This call to matplotlib.use() has no effect
    because the backend has already been chosen;
    matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
    or matplotlib.backends is imported for the first time.

    warnings.warn(_use_error_msg)”,
    I don’t know how to slove the fault, could you help me ?

    • Jason Brownlee May 12, 2017 at 7:47 am #

      Are you able to run the script from the command line instead of within the IDE?

      • jinhua zhang May 16, 2017 at 1:07 am #

        Thank you very much! Because I missed the command
        “# entry point
        run()”.
        When I added the command,the procedure could run.

Leave a Reply