Recurrent neural networks are a type of neural network that add the explicit handling of order in input observations.

This capability suggests that the promise of recurrent neural networks is to learn the temporal context of input sequences in order to make better predictions. That is, that the suite of lagged observations required to make a prediction no longer must be diagnosed and specified as in traditional time series forecasting, or even forecasting with classical neural networks. Instead, the temporal dependence can be learned, and perhaps changes to this dependence can also be learned.

In this post, you will discover the promised capability of recurrent neural networks for time series forecasting. After reading this post, you will know:

- The focus and implicit, if not explicit, limitations on traditional time series forecasting methods.
- The capabilities provided in using traditional feed-forward neural networks for time series forecasting.
- The additional promise that recurrent neural networks make on top of traditional neural nets and hints of what this may mean in practice.

Let’s get started.

## Time Series Forecasting

Time series forecasting is difficult.

Unlike the simpler problems of classification and regression, time series problems add the complexity of order or temporal dependence between observations.

This can be difficult as the specialized handling of the data is required when fitting and evaluating models. It also aids in modeling, providing additional structure like trends and seasonality that can be leveraged to improve model skill.

Traditionally, time series forecasting has been dominated by linear methods like ARIMA because they are well understood and effective on many problems. But these traditional methods also suffer from some limitations, such as:

**Focus on complete data**: missing or corrupt data is generally unsupported.**Focus on linear relationships**: assuming a linear relationship excludes more complex joint distributions.**Focus on fixed temporal dependence**: the relationship between observations at different times, and in turn the number of lag observations provided as input, must be diagnosed and specified.**Focus on univariate data**: many real-world problems have multiple input variables.**Focus on one-step forecasts**: many real-world problems require forecasts with a long time horizon.

Existing techniques often depended on hand-crafted features that were expensive to create and required expert knowledge of the field.

— John Gamboa, Deep Learning for Time-Series Analysis, 2017

Note that some specialized techniques have been developed to address some of these limitations.

### Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Neural Networks for Time Series

Neural networks approximate a mapping function from input variables to output variables.

This general capability is valuable for time series for a number of reasons.

**Robust to Noise**. Neural networks are robust to noise in input data and in the mapping function and can even support learning and prediction in the presence of missing values.**Nonlinear**. Neural networks do not make strong assumptions about the mapping function and readily learn linear and nonlinear relationships.

… one important contribution of neural networks – namely their elegant ability to approximate arbitrary non-linear functions. This property is of high value in time series processing and promises more powerful applications, especially in the subfeld of forecasting …

— Georg Dorffner, Neural Networks for Time Series Processing, 1996.

More specifically, neural networks can be configured to support an arbitrary defined but fixed number of inputs and outputs in the mapping function. This means that:

**Multivariate Inputs**. An arbitrary number of input features can be specified, providing direct support for multivariate forecasting.**Multi-Step Forecasts**. An arbitrary number of output values can be specified, providing direct support for multi-step and even multivariate forecasting.

For these capabilities alone, feed-forward neural networks are widely used for time series forecasting.

Implicit in the usage of neural networks is the requirement that there is indeed a meaningful mapping from inputs to outputs to learn. Modeling a mapping of a random walk will perform no better than a persistence model (e.g. using the last seen observation as the forecast).

This expectation of a learnable mapping function also makes one of the limitations clear: the mapping function is fixed or static.

**Fixed inputs**. The number of lag input variables is fixed, in the same way as traditional time series forecasting methods.**Fixed outputs**. The number of output variables is also fixed; although a more subtle issue, it means that for each input pattern, one output must be produced.

Sequences pose a challenge for [deep neural networks] because they require that the dimensionality of the inputs and outputs is known and fixed.

— Ilya Sutskever, Oriol Vinyals, Quoc V. Le, Sequence to Sequence Learning with Neural Networks, 2014

Feed-forward neural networks do offer great capability but still suffer from this key limitation of having to specify the temporal dependence upfront in the design of the model.

This dependence is almost always unknown and must be discovered and teased out from detailed analysis in a fixed form.

## Recurrent Neural Networks for Time Series

Recurrent neural networks like the Long Short-Term Memory network add the explicit handling of order between observations when learning a mapping function from inputs to outputs.

The addition of sequence is a new dimension to the function being approximated. Instead of mapping inputs to outputs alone, the network is capable of learning a mapping function for the inputs over time to an output.

This capability unlocks time series for neural networks.

Long Short-Term Memory (LSTM) is able to solve many time series tasks unsolvable by feed-forward networks using fixed size time windows.

— Felix A. Gers, Douglas Eck, Jürgen Schmidhuber, Applying LSTM to Time Series Predictable through Time-Window Approaches, 2001

In addition to the general benefits of using neural networks for time series forecasting, recurrent neural networks can also learn the temporal dependence from the data.

**Learned Temporal Dependence**. The context of observations over time is learned.

That is, in the simplest case, the network is shown one observation at a time from a sequence and can learn what observations it has seen previously are relevant and how they are relevant to forecasting.

Because of this ability to learn long term correlations in a sequence, LSTM networks obviate the need for a pre-specified time window and are capable of accurately modelling complex multivariate sequences.

— Pankaj Malhotra, et al., Long Short Term Memory Networks for Anomaly Detection in Time Series, 2015

The promise of recurrent neural networks is that the temporal dependence in the input data can be learned. That a fixed set of lagged observations does not need to be specified.

Implicit within this promise is that a temporal dependence that varies with circumstance can also be learned.

But, recurrent neural networks may be capable of more.

It is good practice to manually identify and remove such systematic structures from time series data to make the problem easier to model (e.g. make the series stationary), and this may still be a best practice when using recurrent neural networks. But, the general capability of these networks suggests that this may not be a requirement for a skillful model.

Technically, the available context may allow recurrent neural networks to learn:

**Trend**. An increasing or decreasing level to a time series and even variation in these changes.**Seasonality**. Consistently repeating patterns over time.

What do you think the promise is for LSTMs on time series forecasting problems?

## Summary

In this post, you discovered the promise of recurrent neural networks for time series forecasting.

Specifically, you learned:

- Traditional time series forecasting methods focus on univariate data with linear relationships and fixed and manually-diagnosed temporal dependence.
- Neural networks add the capability to learn possibly noisy and nonlinear relationships with arbitrarily defined but fixed numbers of inputs and outputs supporting multivariate and multi-step forecasting.
- Recurrent neural networks add the explicit handling of ordered observations and the promise of learning temporal dependence from context.

Do you disagree with my thoughts on the promise of LSTMs for time series forecasting?

Leave a comment below and join the discussion.

Thank you Jason for sharing. I am making a gentle start in Deep Learning. Currently gathering very generic information.

Great, stick with it Benson.

Great article! I am currently working on my thesis and this very similar to what I am writing but only a bit better. Thank you for the clear summary of a somewhat complex theory about time series predictions!

Thanks Sander, I’m glad to hear that.

Thanks for the helpful article Jason. I used RNN in classification with the same random seed on the same data. But running the same code multiple times gives me different results. So far, most results are the same except for maybe 1 or 2 times. But the resutls are drastically different (76% accruacy vs. 48%). Have you had similar experience? If so, what did you do to mitigate it?

If you are using the tensorflow backend, you will also need to seed the tensorflow random number generator.

If you are using Keras turn off shuffling in the fit method.

With all the recent posts on time series forecasting , I would like to remind you that there are numerous time series that will not bow to even the most sophisticated RNN LSTM approach , no matter what you try . Sometimes the necessary data to make an accurate prediction are not contained in the time series data , but rather in exogenous variables . If you think you can predict the oil price of tomorrow based on historical oil price data , I have a bridge in Brooklyn for sale for you . As always , there is no “free lunch” here , no black magick

Great point Gerrit.

On the other hand – if oil prices trade at $50.0 for a few years we can safely assume that they will not start trading at “Microwave Oven” or “Wisp of sentient hydrogen gas at the peripheries of a white dwarf” or “The abstract concept of self identity as characterized by a lonely internet message” – in short one must certainly temper expectations and assume in part that life can be rather unpredictable and yet also understand that often it is not and within certain tolerances can be almost outright boring.

Hi Jason,

Have you had a chance to evaluate HTM models (https://numenta.org), they seem to fly under the radar. Maybe a future article?

>>Technically, the available context may allow recurrent neural networks to learn ..

Any actual measurements or research you can point to, I notice most of your recent articles you are differencing the series

>>Scaling

Also – any measurements or references?

Thanks again, please keep the great work going – I am an avid customer for your PDF books as well.

Yes, I dived deep into HTM back when they were first “launched”, 2008 or 09 perhaps.

Sorry, not sure I understand your questions. Perhaps you could restate them?

Thanks Jason, would love to see your input on HTMs

Sorry about my questions , i meant can you point me to research or tests/measurements on the exact influence on differencing(and if its beneficial at all in LSTMs) and feature scaling.

No, I have not seen this literature. Time series with LSTMs is a very new area.

Hi Jason,

I am currently doing my final year masters project using LSTM for battery life cycle prediction, your posts on LSTM for time series have been so helpful.

At the moment I am trying to determine how my model is affected by window size, but I want to approach it more systematically than just running empirical tests. I was just wondering, when you say “the temporal dependence in the input data can be learned,” how would this relate to the window size? Could you expand on this a little for me and maybe direct me to some papers.

Thank you 🙂

Yes, see this post on autocorrelation:

http://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/

Hi Jason,

I am currently working for my internship on forecasting time series using neural networks, with Keras. Your blog blog has been has been so helpful so far.

But I still have one big question that I have some trouble to deal with; when you say “the promise of recurrent neural networks is that the temporal dependence in the input data can be learned. That a fixed set of lagged observations does not need to be specified”, I feel a contradiction on how we train the model.

For instance, let’s suppose we have a univariate time series x1, x2, …

The goal is to predict at the current time the next two values. In several of your posts using Keras, you shape your inputs/ouputs in this way :

[x1, x2, x3, … , xt] -> [xt+1, xt+2]

[x2, x3, x4, … , xt+1] -> [xt+2, xt+1]

.

.

.

But we are clearly using a time window to predict the next values ? I think I have some troubles to understand the difference between the temporal dependence you are talking about, and the temporal dependence induced by the size of the window …

Thank you

Not really.

We must provide sequence data to the model to learn. The vectorization of the data required by the lib imposes some limit. In fact, BPTT imposes a reasonable limit of 200-400 time steps anyway. These are not a window, but a limit on the number of time steps show to the model before updating weights during training.

It is different to a window of 5 or 10 time steps.

That being said, LSTMs may not be the best choice for autoregression problems:

http://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Hi,

Thank you for you answer Jason, it’s starting getting a little clearer !

Concerning your last remark; LSTM’s would not be the best choice for autoregression problems, but many of your blog posts are related to time series forecasting with LSTMs. Is it a recent analysis you made ? Or am I missing something here ?

Thank you, Thibault

Yes, both experience and some research turned me against LSTMs for autoregression.

This is the rationale why my latest book on LSTMs is not focused on time series, but instead on LSTM architectures.

Hi Jason, this is one of the best article I read about LSTM and got most out of it.

Thanks Akash.

Hi Jason, firstly I’d like to thank you for sharing as much knowledge as you do in all of your blog posts, I’ve read many of them and they’ve been tramendously helpful.

I’m currently trying to come up with a model to predict the power output from a wind turbine. My dataset has about 35,000 entries with 3 input variables and 1 output variable. I’m using a 5 layer LSTM network since it’s the one that’s been giving me the best results so far.

My doubt is the following: I understand there’s “trends” in the output, in the sense that, the speed of the wind for instance doesn’t suddenly go from 100mph to 0mph in 1 second. Therefore the power being generated at time t will depend on the value of the input variables at time t, but it will also depend partly on the power generated at t-1, and t-2, and so on…

How can I account for this in my neural network? Does using LSTM layers already account for this or is there a way for me to improve the results of my model? Thanks in advance!!

Hi Javier, you are describing “serial dependence” which is a core concept in time series problems.

Consider reading up a little on autocorrelation and other time series concepts here:

http://machinelearningmastery.com/start-here/#timeseries

Hi,

Could you please try to help me if it’s possible.

I am training my data using RNN, how can I give an evidence the I found the best network?

for example, can ensure obtaining best network by checking its prediction or can plot regression or MSE?

regards,

Evaluate lots of other models and configurations will give you evidence that perhaps you have a good model.

Thanks for the nice blog post. I am just wondering, since there are some linear models such as seasonal ARIMA that model data not only based on previous observations but also previous seasonal pattern, are RNNs capable of using seasonal information such as SARIMA models to model data? Are they able to remember all those patterns if they occur long time ago that does not fit into its training window?

Yes, it is possible. I do not have a good example though.

I have found MLPs outperform LSTMs on autoregression problems during my own testing.

Hi Jason, Thank you for your post. I am implementing LSTM for time series forecasting. The length of my series is 300. I have applied vanilla LSTM, stacked LSTM, MLP, and ARIMA to forecast my weekly time series data but LSTM is not performing better than ARIMA and MLP. I have used ‘adam’ optimizer as discussed in your post to train LSTM. Can you please give me some tips to optimize it better. I have also applied regularization and varied the number of epochs.

However, In my another data set which has time series of length 60, then LSTM performed better but just marginally.

Generally I find MLPs outperform LSTMs on time series forecasting. Stick with the MLP.

Hello Jason,

I’ve data for patients with some inputs(demographics etc) and 1 or more outputs for each of their visits to the hospital. I’ve done some experiments using NN in Matlab by considering all patients together which may not make more sense. Now I would like to do some time series analysis for each patient and seeing their behavior. I’ve some problems though

1. The data is not enough like many patients have for example 2 or 3 visits and I’ve only few patients

2. Each patient has different no. of visits

I’m still trying to think how could I make a time series out of it? Which model to use? ARMAX, NarxNet, etc. and several others.

I would be glad if you could guide me to some solutions.

Thanks.

There is no single best answer. I would recommend testing a suite of different methods to see what works for your specific data.

Thanks for the reply but …

Assume a simple and hypothetical scenario (similar to my problem) like

ID VisitDate Weight Height LowBP HighBP

1 Jan 1, 2010 76 5 76 119

1 Mar 10, 2010 77 5 73 119

1 July 1, 2010 76 5 76 120

2 Feb 2, 2009 55 5.5 70 132

2 Mar 5, 2009 60 5.5 70 132

2 Aug 2, 2009 57 5.5 71 130

….

I would like to predict LowBP and HighBP after 1 month, 2 months, etc.

As you can see the baseline for each patient will be different and the interval is not equal as well. In addition to that I’ve less data and also some missing data.

I would like your opinion about tackling this problem especially the initial stage of preparing the data. How could I make it a time series?

Do I need to create 2 more columns as future values of LowBP and HighBP and copy the next record of each patient as a future value. In that case the last record will have no future value and I may need to delete it.

After that how could I split for train and test data. Do I need to keep all the last records of each patient as test data.

Once I get some initial start I can then try to apply different techniques and models. I was reading your post about time series and also somewhere about LSTM. I’ve Matlab 2016 and LSTM I believe supported from 2017. If needed I’ll try to update.

Thanks again.

Maybe this post will help:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Thanks again. In addition to that post I also read this

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

and for sure will try to apply windows method and also walk-forward.

Just trying to remove one confusion regarding prediction whether it will be for each ID or for all. What I mean whether I take for example first ID and try to create different models using walk forward say for 15 times. Then I take second ID and do the same for it. Correct me if I’m wrong doing this way I’ll get predictive models that are dependent on IDs. So what about if a new patient comes in? How do we predict for him? Do we need previous values for this patient and need to create again new models for this patient?

You have options. You can choose to model per patient, per patient group, across all patients or some other variation.

There is no one way, so I’d encourage you to explore a few different framings of your problem and see what works best for your specific data.

Ok I’ll try these different options.

Do you have any post for increasing the amount of data. I was thinking of copula or with some regression coefficients. What is your opinion?

As I mentioned earlier I’ve less data and per patient maybe around 3 to 4 records. I’m not sure whether I understand correctly and clear in my previous reply that when I use walk around to predict the next value and then use this value to predict the other next value. Are we actually increasing the amount of data? If yes then how the model can be validated?

Perhaps it would be worth taking some time to really nail down what you want to predict and what data could be used:

http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

I’ve generated some synthetic data using uniform distribution with same mean and variance. Also I use copula fit to generate with Gaussian distribution. Now how do I verify that it is not biased. Is there any post?

Another thing regarding creating the model patient wise. When I create model (for example NarNet) and predict for one patient and then move to next patient, how to merge the previous model with this patient. Or if I create separate model for each patient then how to merge to create a final single model?

You can plot the residual errors to see if they are Gaussian.

Perhaps ensemble, or perhaps combine data and train one large model. You are only limited by your imagination.

Hi Jason,

i am new to neural network. I have got closing prices of a market index. I want to model volatility using LSTM neural network. How to implement LSTM in r to forecast the volatility of that index. what are the steps i should undertake to get a good out of sample forecast( prevent over/under fitting and robust method). Is there any r codes on how to model univariate times series using LSTM?

I have a few posts on time series forecasting with LSTMs in Python that might help, start here:

https://machinelearningmastery.com/start-here/#lstm

Hello Jason, Nice article.. I have a question.. Can I use RNNs for multi sensor data fusion?..I have data in the form of distances, velocities and acceleration

Perhaps. I am not familiar with the domain sorry. I would recommend searching on google scholar for some similar examples.

RNNs are suited to sequence prediction generally:

https://machinelearningmastery.com/sequence-prediction/

Hi Jason. I am trying to wrap my head around the different activation methods for neurons and how that can be useful for a more complex LSTM-based RNN. I am building a market close price predictor using OHLC, volume, # trades, RSI values, and SMAs. SMA’s and OHLC are all prices, but I am still struggling on how to tell the LSTM that a high RSI value has an inverse relationship to price. All my data points are normalized to a 0-1 scale but it feels like the LSTM just grabs all the numbers and treats them the same because the predictions are too far off. Im thinking maybe feeding the price inputs to one LSTM, and the RSI to another that uses inverse activation or something might help. Any thoughts? Thank you. GE

You might need a lot more data or a lot more training.

Perhaps start with an MLP and move to LSTM only if it can outperform the MLP. Often for time series problems, the LSTMs is not the right tool.

I’m working on a project to predict the usage of all the files in a filesystem in near future based on the metadata of the file system for past 6 months. I’ve got the following attributes about the files with me :

1. The temporal sequence of file usage for last 6 months(whenever the file was read/written/modified and by whom)

2. All the users who are on the server and can access the files.

3. Last modified/written/read epoch time and by whom

4. File creation epoch time and by whom

5. Any compliance regulations on the file(whether the file contains any confidential data)

6. Size, name, extension, version, type of the file

7. Number of users who can access the file

8. File path

9. Total number of times accessed

10. Permitted users

Now, I plan to use LSTM but for standard LSTMs, the input is temporal sequence only. However, all the attributes that I have seem significant in predicting the future usage of the file.

How should I also make use of the attributes of the file that I have? Should I train a Feedforward Neural Network, disregarding the fact that it usually fails on temporal sequences? How should I proceed?

Does a variant of LSTM exist that can take into account the attributes of the file as well and predict the usage of the file in near future?

Thanks in advance 🙂 !!

You could provide all features each time step or have a multi-headed model, with one head the sequence with an LSTM and another head a Dense with the vector.

Try both and see which works best.

Thanks for the reply Jason. Can you please point me to any research paper/ other resources which solve a similar problem?

That can help me immensely in working my way out.

This might help as a start:

https://machinelearningmastery.com/keras-functional-api-deep-learning/

I do have a few more examples of multi-headed models on the blog.

Hi Jason,

Great article! Thank you your for sharing!!

You said that “Feed-forward neural networks do offer great capability but still suffer from this key limitation of having to specify the temporal dependence upfront in the design of the model. And LSTM can overcome this limitation”

However, it seems that the parameter ‘timestep’ should be selected when using LSTM. Therefore, I wonder whether the ‘timestep’ is related to the ‘ temporal dependence’ or ‘the number of lag observations’? And how to select the ‘timestep’ when using LSTM?

Thank you!!

It is the number of lag observations, but it can vary from sample to sample via zero-padding. The network processes one time step at a time.