Recurrent neural networks are a type of neural network that add the explicit handling of order in input observations.

This capability suggests that the promise of recurrent neural networks is to learn the temporal context of input sequences in order to make better predictions. That is, that the suite of lagged observations required to make a prediction no longer must be diagnosed and specified as in traditional time series forecasting, or even forecasting with classical neural networks. Instead, the temporal dependence can be learned, and perhaps changes to this dependence can also be learned.

In this post, you will discover the promised capability of recurrent neural networks for time series forecasting. After reading this post, you will know:

- The focus and implicit, if not explicit, limitations on traditional time series forecasting methods.
- The capabilities provided in using traditional feed-forward neural networks for time series forecasting.
- The additional promise that recurrent neural networks make on top of traditional neural nets and hints of what this may mean in practice.

Let’s get started.

## Time Series Forecasting

Time series forecasting is difficult.

Unlike the simpler problems of classification and regression, time series problems add the complexity of order or temporal dependence between observations.

This can be difficult as the specialized handling of the data is required when fitting and evaluating models. It also aids in modeling, providing additional structure like trends and seasonality that can be leveraged to improve model skill.

Traditionally, time series forecasting has been dominated by linear methods like ARIMA because they are well understood and effective on many problems. But these traditional methods also suffer from some limitations, such as:

**Focus on complete data**: missing or corrupt data is generally unsupported.**Focus on linear relationships**: assuming a linear relationship excludes more complex joint distributions.**Focus on fixed temporal dependence**: the relationship between observations at different times, and in turn the number of lag observations provided as input, must be diagnosed and specified.**Focus on univariate data**: many real-world problems have multiple input variables.**Focus on one-step forecasts**: many real-world problems require forecasts with a long time horizon.

Existing techniques often depended on hand-crafted features that were expensive to create and required expert knowledge of the field.

— John Gamboa, Deep Learning for Time-Series Analysis, 2017

Note that some specialized techniques have been developed to address some of these limitations.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Neural Networks for Time Series

Neural networks approximate a mapping function from input variables to output variables.

This general capability is valuable for time series for a number of reasons.

**Robust to Noise**. Neural networks are robust to noise in input data and in the mapping function and can even support learning and prediction in the presence of missing values.**Nonlinear**. Neural networks do not make strong assumptions about the mapping function and readily learn linear and nonlinear relationships.

… one important contribution of neural networks – namely their elegant ability to approximate arbitrary non-linear functions. This property is of high value in time series processing and promises more powerful applications, especially in the subfeld of forecasting …

— Georg Dorffner, Neural Networks for Time Series Processing, 1996.

More specifically, neural networks can be configured to support an arbitrary defined but fixed number of inputs and outputs in the mapping function. This means that:

**Multivariate Inputs**. An arbitrary number of input features can be specified, providing direct support for multivariate forecasting.**Multi-Step Forecasts**. An arbitrary number of output values can be specified, providing direct support for multi-step and even multivariate forecasting.

For these capabilities alone, feed-forward neural networks are widely used for time series forecasting.

Implicit in the usage of neural networks is the requirement that there is indeed a meaningful mapping from inputs to outputs to learn. Modeling a mapping of a random walk will perform no better than a persistence model (e.g. using the last seen observation as the forecast).

This expectation of a learnable mapping function also makes one of the limitations clear: the mapping function is fixed or static.

**Fixed inputs**. The number of lag input variables is fixed, in the same way as traditional time series forecasting methods.**Fixed outputs**. The number of output variables is also fixed; although a more subtle issue, it means that for each input pattern, one output must be produced.

Sequences pose a challenge for [deep neural networks] because they require that the dimensionality of the inputs and outputs is known and fixed.

— Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, 2014

Feed-forward neural networks do offer great capability but still suffer from this key limitation of having to specify the temporal dependence upfront in the design of the model.

This dependence is almost always unknown and must be discovered and teased out from detailed analysis in a fixed form.

## Recurrent Neural Networks for Time Series

Recurrent neural networks like the Long Short-Term Memory network add the explicit handling of order between observations when learning a mapping function from inputs to outputs.

The addition of sequence is a new dimension to the function being approximated. Instead of mapping inputs to outputs alone, the network is capable of learning a mapping function for the inputs over time to an output.

This capability unlocks time series for neural networks.

Long Short-Term Memory (LSTM) is able to solve many time series tasks unsolvable by feed-forward networks using fixed size time windows.

— Felix A. Gers, Douglas Eck, Jürgen Schmidhuber, Applying LSTM to Time Series Predictable through Time-Window Approaches, 2001

In addition to the general benefits of using neural networks for time series forecasting, recurrent neural networks can also learn the temporal dependence from the data.

**Learned Temporal Dependence**. The context of observations over time is learned.

That is, in the simplest case, the network is shown one observation at a time from a sequence and can learn what observations it has seen previously are relevant and how they are relevant to forecasting.

Because of this ability to learn long term correlations in a sequence, LSTM networks obviate the need for a pre-specified time window and are capable of accurately modelling complex multivariate sequences.

— Pankaj Malhotra, et al., Long Short Term Memory Networks for Anomaly Detection in Time Series, 2015

The promise of recurrent neural networks is that the temporal dependence in the input data can be learned. That a fixed set of lagged observations does not need to be specified.

Implicit within this promise is that a temporal dependence that varies with circumstance can also be learned.

But, recurrent neural networks may be capable of more.

It is good practice to manually identify and remove such systematic structures from time series data to make the problem easier to model (e.g. make the series stationary), and this may still be a best practice when using recurrent neural networks. But, the general capability of these networks suggests that this may not be a requirement for a skillful model.

Technically, the available context may allow recurrent neural networks to learn:

**Trend**. An increasing or decreasing level to a time series and even variation in these changes.**Seasonality**. Consistently repeating patterns over time.

What do you think the promise is for LSTMs on time series forecasting problems?

## Summary

In this post, you discovered the promise of recurrent neural networks for time series forecasting.

Specifically, you learned:

- Traditional time series forecasting methods focus on univariate data with linear relationships and fixed and manually-diagnosed temporal dependence.
- Neural networks add the capability to learn possibly noisy and nonlinear relationships with arbitrarily defined but fixed numbers of inputs and outputs supporting multivariate and multi-step forecasting.
- Recurrent neural networks add the explicit handling of ordered observations and the promise of learning temporal dependence from context.

Do you disagree with my thoughts on the promise of LSTMs for time series forecasting?

Leave a comment below and join the discussion.

Thank you Jason for sharing. I am making a gentle start in Deep Learning. Currently gathering very generic information.

Great, stick with it Benson.

Great article! I am currently working on my thesis and this very similar to what I am writing but only a bit better. Thank you for the clear summary of a somewhat complex theory about time series predictions!

Thanks Sander, I’m glad to hear that.

Thanks for the helpful article Jason. I used RNN in classification with the same random seed on the same data. But running the same code multiple times gives me different results. So far, most results are the same except for maybe 1 or 2 times. But the resutls are drastically different (76% accruacy vs. 48%). Have you had similar experience? If so, what did you do to mitigate it?

If you are using the tensorflow backend, you will also need to seed the tensorflow random number generator.

If you are using Keras turn off shuffling in the fit method.

With all the recent posts on time series forecasting , I would like to remind you that there are numerous time series that will not bow to even the most sophisticated RNN LSTM approach , no matter what you try . Sometimes the necessary data to make an accurate prediction are not contained in the time series data , but rather in exogenous variables . If you think you can predict the oil price of tomorrow based on historical oil price data , I have a bridge in Brooklyn for sale for you . As always , there is no “free lunch” here , no black magick

Great point Gerrit.

On the other hand – if oil prices trade at $50.0 for a few years we can safely assume that they will not start trading at “Microwave Oven” or “Wisp of sentient hydrogen gas at the peripheries of a white dwarf” or “The abstract concept of self identity as characterized by a lonely internet message” – in short one must certainly temper expectations and assume in part that life can be rather unpredictable and yet also understand that often it is not and within certain tolerances can be almost outright boring.

Hi Jason,

Have you had a chance to evaluate HTM models (https://numenta.org), they seem to fly under the radar. Maybe a future article?

>>Technically, the available context may allow recurrent neural networks to learn ..

Any actual measurements or research you can point to, I notice most of your recent articles you are differencing the series

>>Scaling

Also – any measurements or references?

Thanks again, please keep the great work going – I am an avid customer for your PDF books as well.

Yes, I dived deep into HTM back when they were first “launched”, 2008 or 09 perhaps.

Sorry, not sure I understand your questions. Perhaps you could restate them?

Thanks Jason, would love to see your input on HTMs

Sorry about my questions , i meant can you point me to research or tests/measurements on the exact influence on differencing(and if its beneficial at all in LSTMs) and feature scaling.

No, I have not seen this literature. Time series with LSTMs is a very new area.

Hi Jason,

I am currently doing my final year masters project using LSTM for battery life cycle prediction, your posts on LSTM for time series have been so helpful.

At the moment I am trying to determine how my model is affected by window size, but I want to approach it more systematically than just running empirical tests. I was just wondering, when you say “the temporal dependence in the input data can be learned,” how would this relate to the window size? Could you expand on this a little for me and maybe direct me to some papers.

Thank you 🙂

Yes, see this post on autocorrelation:

http://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

Hi Jason,

I am currently working for my internship on forecasting time series using neural networks, with Keras. Your blog blog has been has been so helpful so far.

But I still have one big question that I have some trouble to deal with; when you say “the promise of recurrent neural networks is that the temporal dependence in the input data can be learned. That a fixed set of lagged observations does not need to be specified”, I feel a contradiction on how we train the model.

For instance, let’s suppose we have a univariate time series x1, x2, …

The goal is to predict at the current time the next two values. In several of your posts using Keras, you shape your inputs/ouputs in this way :

[x1, x2, x3, … , xt] -> [xt+1, xt+2]

[x2, x3, x4, … , xt+1] -> [xt+2, xt+1]

.

.

.

But we are clearly using a time window to predict the next values ? I think I have some troubles to understand the difference between the temporal dependence you are talking about, and the temporal dependence induced by the size of the window …

Thank you

Not really.

We must provide sequence data to the model to learn. The vectorization of the data required by the lib imposes some limit. In fact, BPTT imposes a reasonable limit of 200-400 time steps anyway. These are not a window, but a limit on the number of time steps show to the model before updating weights during training.

It is different to a window of 5 or 10 time steps.

That being said, LSTMs may not be the best choice for autoregression problems:

http://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Hi,

Thank you for you answer Jason, it’s starting getting a little clearer !

Concerning your last remark; LSTM’s would not be the best choice for autoregression problems, but many of your blog posts are related to time series forecasting with LSTMs. Is it a recent analysis you made ? Or am I missing something here ?

Thank you, Thibault

Yes, both experience and some research turned me against LSTMs for autoregression.

This is the rationale why my latest book on LSTMs is not focused on time series, but instead on LSTM architectures.

Hi Jason, this is one of the best article I read about LSTM and got most out of it.

Thanks Akash.

Hi Jason, firstly I’d like to thank you for sharing as much knowledge as you do in all of your blog posts, I’ve read many of them and they’ve been tramendously helpful.

I’m currently trying to come up with a model to predict the power output from a wind turbine. My dataset has about 35,000 entries with 3 input variables and 1 output variable. I’m using a 5 layer LSTM network since it’s the one that’s been giving me the best results so far.

My doubt is the following: I understand there’s “trends” in the output, in the sense that, the speed of the wind for instance doesn’t suddenly go from 100mph to 0mph in 1 second. Therefore the power being generated at time t will depend on the value of the input variables at time t, but it will also depend partly on the power generated at t-1, and t-2, and so on…

How can I account for this in my neural network? Does using LSTM layers already account for this or is there a way for me to improve the results of my model? Thanks in advance!!

Hi Javier, you are describing “serial dependence” which is a core concept in time series problems.

Consider reading up a little on autocorrelation and other time series concepts here:

http://machinelearningmastery.com/start-here/#timeseries

Hi,

Could you please try to help me if it’s possible.

I am training my data using RNN, how can I give an evidence the I found the best network?

for example, can ensure obtaining best network by checking its prediction or can plot regression or MSE?

regards,

Evaluate lots of other models and configurations will give you evidence that perhaps you have a good model.

Thanks for the nice blog post. I am just wondering, since there are some linear models such as seasonal ARIMA that model data not only based on previous observations but also previous seasonal pattern, are RNNs capable of using seasonal information such as SARIMA models to model data? Are they able to remember all those patterns if they occur long time ago that does not fit into its training window?

Yes, it is possible. I do not have a good example though.

I have found MLPs outperform LSTMs on autoregression problems during my own testing.