Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras

Time series prediction problems are a difficult type of predictive modeling problem.

Unlike regression predictive modeling, time series also adds the complexity of a sequence dependence among the input variables.

A powerful type of neural network designed to handle sequence dependence is called recurrent neural networks. The Long Short-Term Memory network or LSTM network is a type of recurrent neural network used in deep learning because very large architectures can be successfully trained.

In this post, you will discover how to develop LSTM networks in Python using the Keras deep learning library to address a demonstration time-series prediction problem.

After completing this tutorial you will know how to implement and develop LSTM networks for your own time series prediction problems and other more general sequence problems. You will know:

  • About the International Airline Passengers time-series prediction problem.
  • How to develop LSTM networks for regression, window and time-step based framing of time series prediction problems.
  • How to develop and make predictions using LSTM networks that maintain state (memory) across very long sequences.

In this tutorial, we will develop a number of LSTMs for a standard time series prediction problem.

The problem and the chosen configuration for the LSTM networks are
for demonstration purposes only they are not optimized.

These examples will show you exactly how you can develop your own differently structured LSTM networks for time series predictive modeling problems.

Let’s get started.

  • Update Oct/2016: There was an error in the way that RMSE was calculated in each example. Reported RMSEs were just plain wrong. Now, RMSE is calculated directly from predictions and both RMSE and graphs of predictions are in the units of the original dataset. Models were evaluated using Keras 1.1.0, TensorFlow 0.10.0 and scikit-learn v0.18. Thanks to all those that pointed out the issue, and to Philip O’Brien for helping to point out the fix.
  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
  • Update Apr/2017: For a more complete and better explained tutorial of LSTMs for time series forecasting see the post Time Series Forecasting with the Long Short-Term Memory Network in Python.
Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras

Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras
Photo by Margaux-Marguerite Duquesnoy, some rights reserved.

Problem Description

The problem we are going to look at in this post is theInternational Airline Passengers prediction problem.

This is a problem where, given a year and a month, the task is to predict the number of international airline passengers in units of 1,000. The data ranges from January 1949 to December 1960, or 12 years, with 144 observations.

The dataset is available for free from the DataMarket webpage as a CSV download with the filename “international-airline-passengers.csv“.

Below is a sample of the first few lines of the file.

We can load this dataset easily using the Pandas library. We are not interested in the date, given that each observation is separated by the same interval of one month. Therefore, when we load the dataset we can exclude the first column.

The downloaded dataset also has footer information that we can exclude with the skipfooter argument to pandas.read_csv() set to 3 for the 3 footer lines. Once loaded we can easily plot the whole dataset. The code to load and plot the dataset is listed below.

You can see an upward trend in the dataset over time.

You can also see some periodicity to the dataset that probably corresponds to the Northern Hemisphere vacation period.

Plot of the Airline Passengers Dataset

Plot of the Airline Passengers Dataset

We are going to keep things simple and work with the data as-is.

Normally, it is a good idea to investigate various data preparation techniques to rescale the data and to make it stationary.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Long Short-Term Memory Network

The Long Short-Term Memory network, or LSTM network, is a recurrent neural network that is trained using Backpropagation Through Time and overcomes the vanishing gradient problem.

As such, it can be used to create large recurrent networks that in turn can be used to address difficult sequence problems in machine learning and achieve state-of-the-art results.

Instead of neurons, LSTM networks have memory blocks that are connected through layers.

A block has components that make it smarter than a classical neuron and a memory for recent sequences. A block contains gates that manage the block’s state and output. A block operates upon an input sequence and each gate within a block uses the sigmoid activation units to control whether they are triggered or not, making the change of state and addition of information flowing through the block conditional.

There are three types of gates within a unit:

  • Forget Gate: conditionally decides what information to throw away from the block.
  • Input Gate: conditionally decides which values from the input to update the memory state.
  • Output Gate: conditionally decides what to output based on input and the memory of the block.

Each unit is like a mini-state machine where the gates of the units have weights that are learned during the training procedure.

You can see how you may achieve sophisticated learning and memory from a layer of LSTMs, and it is not hard to imagine how higher-order abstractions may be layered with multiple such layers.

LSTM Network for Regression

We can phrase the problem as a regression problem.

That is, given the number of passengers (in units of thousands) this month, what is the number of passengers next month?

We can write a simple function to convert our single column of data into a two-column dataset: the first column containing this month’s (t) passenger count and the second column containing next month’s (t+1) passenger count, to be predicted.

Before we get started, let’s first import all of the functions and classes we intend to use. This assumes a working SciPy environment with the Keras deep learning library installed.

Before we do anything, it is a good idea to fix the random number seed to ensure our results are reproducible.

We can also use the code from the previous section to load the dataset as a Pandas dataframe. We can then extract the NumPy array from the dataframe and convert the integer values to floating point values, which are more suitable for modeling with a neural network.

LSTMs are sensitive to the scale of the input data, specifically when the sigmoid (default) or tanh activation functions are used. It can be a good practice to rescale the data to the range of 0-to-1, also called normalizing. We can easily normalize the dataset using the MinMaxScaler preprocessing class from the scikit-learn library.

After we model our data and estimate the skill of our model on the training dataset, we need to get an idea of the skill of the model on new unseen data. For a normal classification or regression problem, we would do this using cross validation.

With time series data, the sequence of values is important. A simple method that we can use is to split the ordered dataset into train and test datasets. The code below calculates the index of the split point and separates the data into the training datasets with 67% of the observations that we can use to train our model, leaving the remaining 33% for testing the model.

Now we can define a function to create a new dataset, as described above.

The function takes two arguments: the dataset, which is a NumPy array that we want to convert into a dataset, and the look_back, which is the number of previous time steps to use as input variables to predict the next time period — in this case defaulted to 1.

This default will create a dataset where X is the number of passengers at a given time (t) and Y is the number of passengers at the next time (t + 1).

It can be configured, and we will by constructing a differently shaped dataset in the next section.

Let’s take a look at the effect of this function on the first rows of the dataset (shown in the unnormalized form for clarity).

If you compare these first 5 rows to the original dataset sample listed in the previous section, you can see the X=t and Y=t+1 pattern in the numbers.

Let’s use this function to prepare the train and test datasets for modeling.

The LSTM network expects the input data (X) to be provided with a specific array structure in the form of: [samples, time steps, features].

Currently, our data is in the form: [samples, features] and we are framing the problem as one time step for each sample. We can transform the prepared train and test input data into the expected structure using numpy.reshape() as follows:

We are now ready to design and fit our LSTM network for this problem.

The network has a visible layer with 1 input, a hidden layer with 4 LSTM blocks or neurons, and an output layer that makes a single value prediction. The default sigmoid activation function is used for the LSTM blocks. The network is trained for 100 epochs and a batch size of 1 is used.

Once the model is fit, we can estimate the performance of the model on the train and test datasets. This will give us a point of comparison for new models.

Note that we invert the predictions before calculating error scores to ensure that performance is reported in the same units as the original data (thousands of passengers per month).

Finally, we can generate predictions using the model for both the train and test dataset to get a visual indication of the skill of the model.

Because of how the dataset was prepared, we must shift the predictions so that they align on the x-axis with the original dataset. Once prepared, the data is plotted, showing the original dataset in blue, the predictions for the training dataset in green, and the predictions on the unseen test dataset in red.

We can see that the model did an excellent job of fitting both the training and the test datasets.

LSTM Trained on Regression Formulation of Passenger Prediction Problem

LSTM Trained on Regression Formulation of Passenger Prediction Problem

For completeness, below is the entire code example.

Running the example produces the following output.

We can see that the model has an average error of about 23 passengers (in thousands) on the training dataset, and about 52 passengers (in thousands) on the test dataset. Not that bad.

LSTM for Regression Using the Window Method

We can also phrase the problem so that multiple, recent time steps can be used to make the prediction for the next time step.

This is called a window, and the size of the window is a parameter that can be tuned for each problem.

For example, given the current time (t) we want to predict the value at the next time in the sequence (t+1), we can use the current time (t), as well as the two prior times (t-1 and t-2) as input variables.

When phrased as a regression problem, the input variables are t-2, t-1, t and the output variable is t+1.

The create_dataset() function we created in the previous section allows us to create this formulation of the time series problem by increasing the look_back argument from 1 to 3.

A sample of the dataset with this formulation looks as follows:

We can re-run the example in the previous section with the larger window size. The whole code listing with just the window size change is listed below for completeness.

Running the example provides the following output:

We can see that the error was increased slightly compared to that of the previous section. The window size and the network architecture were not tuned: this is just a demonstration of how to frame a prediction problem.

LSTM Trained on Window Method Formulation of Passenger Prediction Problem

LSTM Trained on Window Method Formulation of Passenger Prediction Problem

LSTM for Regression with Time Steps

You may have noticed that the data preparation for the LSTM network includes time steps.

Some sequence problems may have a varied number of time steps per sample. For example, you may have measurements of a physical machine leading up to a point of failure or a point of surge. Each incident would be a sample the observations that lead up to the event would be the time steps, and the variables observed would be the features.

Time steps provide another way to phrase our time series problem. Like above in the window example, we can take prior time steps in our time series as inputs to predict the output at the next time step.

Instead of phrasing the past observations as separate input features, we can use them as time steps of the one input feature, which is indeed a more accurate framing of the problem.

We can do this using the same data representation as in the previous window-based example, except when we reshape the data, we set the columns to be the time steps dimension and change the features dimension back to 1. For example:

The entire code listing is provided below for completeness.

Running the example provides the following output:

We can see that the results are slightly better than previous example, although the structure of the input data makes a lot more sense.

LSTM Trained on Time Step Formulation of Passenger Prediction Problem

LSTM Trained on Time Step Formulation of Passenger Prediction Problem

LSTM with Memory Between Batches

The LSTM network has memory, which is capable of remembering across long sequences.

Normally, the state within the network is reset after each training batch when fitting the model, as well as each call to model.predict() or model.evaluate().

We can gain finer control over when the internal state of the LSTM network is cleared in Keras by making the LSTM layer “stateful”. This means that it can build state over the entire training sequence and even maintain that state if needed to make predictions.

It requires that the training data not be shuffled when fitting the network. It also requires explicit resetting of the network state after each exposure to the training data (epoch) by calls to model.reset_states(). This means that we must create our own outer loop of epochs and within each epoch call model.fit() and model.reset_states(). For example:

Finally, when the LSTM layer is constructed, the stateful parameter must be set True and instead of specifying the input dimensions, we must hard code the number of samples in a batch, number of time steps in a sample and number of features in a time step by setting the batch_input_shape parameter. For example:

This same batch size must then be used later when evaluating the model and making predictions. For example:

We can adapt the previous time step example to use a stateful LSTM. The full code listing is provided below.

Running the example provides the following output:

We do see that results are worse. The model may need more modules and may need to be trained for more epochs to internalize the structure of the problem.

Stateful LSTM Trained on Regression Formulation of Passenger Prediction Problem

Stateful LSTM Trained on Regression Formulation of Passenger Prediction Problem

Stacked LSTMs with Memory Between Batches

Finally, we will take a look at one of the big benefits of LSTMs: the fact that they can be successfully trained when stacked into deep network architectures.

LSTM networks can be stacked in Keras in the same way that other layer types can be stacked. One addition to the configuration that is required is that an LSTM layer prior to each subsequent LSTM layer must return the sequence. This can be done by setting the return_sequences parameter on the layer to True.

We can extend the stateful LSTM in the previous section to have two layers, as follows:

The entire code listing is provided below for completeness.

Running the example produces the following output.

The predictions on the test dataset are again worse. This is more evidence to suggest the need for additional training epochs.

Stacked Stateful LSTMs Trained on Regression Formulation of Passenger Prediction Problem

Stacked Stateful LSTMs Trained on Regression Formulation of Passenger Prediction Problem


In this post, you discovered how to develop LSTM recurrent neural networks for time series prediction in Python with the Keras deep learning network.

Specifically, you learned:

  • About the international airline passenger time series prediction problem.
  • How to create an LSTM for a regression and a window formulation of the time series problem.
  • How to create an LSTM with a time step formulation of the time series problem.
  • How to create an LSTM with state and stacked LSTMs with state to learn long sequences.

Do you have any questions about LSTMs for time series prediction or about this post?
Ask your questions in the comments below and I will do my best to answer.

Related Posts

For a more complete and better explained tutorial of LSTMs for time series forecasting see the post:

Looking for some more tutorials on LSTMs in Python with Keras? Take a look at some of these:

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.

540 Responses to Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras

  1. Shilin Zhang July 21, 2016 at 12:49 pm #

    I like Keras, the example is excellent.

    • Jason Brownlee July 21, 2016 at 1:55 pm #


      • Robin Schäfer November 4, 2016 at 6:41 am #

        Hi, thanks for your awesome tutorial!

        I just don’t get one thing… If you’d like to predict 1 step in the future, why does the red line stop before the blue line does?

        So for example, we have the testset untill end of the year 1960. How can i predict the future year? Or passangers at the 1/1/1961 (if dataset ends at 12/31/1960).



        • Jason Brownlee November 4, 2016 at 11:16 am #

          Great question, there might be a small bug in how I am displaying the predictions in the plot.

          • Terrence November 15, 2016 at 12:06 am #

            Hey, great tutorial.

            I have the same question about the future prediction. The “testPredict” has two fewer rows that the “test” once the algorithm is done running, how would I obtain the values for the a prediction 1 or 2 days ahead from the end date of the time series? Thanks.


          • zhang February 3, 2017 at 7:46 am #

            I think in the ”create_dataset” function, the range should be “len(dataset)-look_back” but not “len(dataset)-look_back-1”. No “1”should be subtracted here.

          • Shiva May 11, 2017 at 5:35 am #

            Hi Jason,
            how to fix this bug? what modifications you need to make in the code to predict the values for 1/31/1961, if the dataset ends at 12/31/1960?

        • Shiva May 11, 2017 at 5:33 am #

          Hi Robin,
          Are you up with a solution for the bug? as you rightly said, the testpredict is smaller than test. How do you modify the code so that it predicts the value on 1/1/1961?

      • Shovon Sengupta February 16, 2017 at 1:00 am #

        Hello Jason,

        Thanks for sharing this great tutorial! Can you please also suggest the way to get the forward forecast based on the LSTM method. For example, if we want to forecast the value of the series for the next few weeks (ahead of current time–As we usually do for the any time series data), then what would be process to do that.


        • Jason Brownlee February 16, 2017 at 11:07 am #

          Hi Shovon,

          I would suggest reframing the problem to predict a long output sequence, then develop a model on this framing of the problem.

          • Andrew July 13, 2017 at 6:29 am #

            Hi Jason,

            Can you elaborate a bit on what is the use of modeling a time-series without being able to make predictions of future time? I ask because I’m learning LSTMs and I’m facing the same issue as the person above: I can model a time series and make accurate predictions for data that I already have, but have difficulty predicting future observations.

            Thanks a bunch.

          • Jason Brownlee July 13, 2017 at 10:02 am #

            Time series analysis is the study of time series without the interest in making predictions.

      • dal August 3, 2017 at 3:58 am #

        Can RNN be used on Input with multi-variables?

        • Jason Brownlee August 3, 2017 at 6:55 am #

          Yes. LSTMs can take multiple input features.

          • Rachid August 11, 2017 at 2:29 am #

            Thanks for this great tutorial Jason. I’m still having trouble figuring out what kind of graph do you get when you do this:
            # create and fit the LSTM network
            model = Sequential()
            model.add(LSTM(4, input_shape=(1, look_back)))

            for instance if your lookback=1: the input is one value xt, and the target output is xt+1. How is “LSTM(4, input_shape=(1, look_back))” linking your LSTM blocks with the input?
            Or do you have 1 input => 1 LSTM block which hidden value (output of the LSTM) is fed to a 4X1 dense MLP? So that the output of the LSTM is actually the input of a 1x4x1 MLP…

            And if your input is [xt-1, xt] with target xt+1 (lookback=2), you have two LSTMs blocks (fed with xt-1 and xt respectively) and the hidden value of the second block is the input of a 1x4x1 MLP.

            I hope I’m being clear, I really have troubles answering this question. Your tutorial helps though!

          • Jason Brownlee August 11, 2017 at 6:43 am #

            The input_shape define the input, the LSTM is the first hidden layer, the Dense is the output layer.

            Try this to get an idea of the graph:

  2. Alex July 21, 2016 at 1:04 pm #

    Hi, thanks for the walkthrough. I’ve tried modifying the code to run the network for online prediction, but I’m having some issues. Would you be willing to take a look at my SO question? http://stackoverflow.com/questions/38313673/lstm-with-keras-for-mini-batch-training-and-online-testing


    • Jason Brownlee July 23, 2016 at 2:15 pm #

      Sorry Alex, you’re question is a little vague.

      It’s of the order “I have data like this…, what is the best way to model this problem”. It’s a tough StackOverflow question because it’s an open question rather than a specific technical question.

      Generally, my answer would be “no idea, try lots of stuff and see what works best”.

      I think your notion of online might also be confused (or I’m confused). Have you seen online implementations of LSTM? Keras does not support it as far as I know. You train your model then you make predictions. Unless of course you mean the maintained state of the model – but this is not online learning, it is just static model with state, the weights are not updated in an online manner unless you re-train your model frequently.

      It might be worth stepping back from the code and taking some time to clearly define I/O of the problem and requirements to then figure out the right kind of algorithm/setup you need to solve it.

  3. Tommy Johnson July 26, 2016 at 2:26 am #

    Hello Dr. Brownlee,
    I have a question about the difference between the Time Steps and Windows method. Am I correct in understanding that the only difference is the shape of the data you feeding into the model? If so, can you give some intuition why the Time Steps method works better? If I have two sequences (For example, if I have 2 noisy signals, one noisier than the other), and I’m using them both to predict a sequence, which method do you think is better?


    • Jason Brownlee July 26, 2016 at 5:58 am #

      Hi Tommy,

      The window method creates new features, like new attributes for the model, where as timesteps are a sequence within a batch for a given feature.

      I would not say one works better than another, the examples in this post are for demonstration only and are not tuned.

      I would advise you to try both methods and see what works best, or frame your problem in the way that best makes sense.

  4. Pedro Ferreira July 29, 2016 at 1:48 am #

    Hi Jason,

    What are the hyperparameters of your network?


    • Jason Brownlee July 29, 2016 at 6:30 am #

      Hi Pedro, the hyperparameters for each network are available right there in the code.

      • Evgeni Stavinov January 19, 2017 at 11:30 pm #

        Is it possible to perform hyperparameter optimization of the LTSM, for example using hyperopt?

        • Jason Brownlee January 20, 2017 at 10:21 am #

          I don’t see why not Evgeni., Sorry I don’t have an example.

  5. Jack Kinkade July 30, 2016 at 7:41 pm #

    Hi Jason,

    Interesting post and a very useful website! Can I use LSTMS for time series classification, for a binary supervised problem? My data is arranged as time steps of 1 hr sequences leading up to an event and the occurrence and non-occurrence of the event are labelled in each instance. I have done a bit of research and have not seen many use cases in the literature. Do you think a different recurrent neural net or simpler MLP might work better in this case? Most of my the research done in my area has got OK results(70% accuracy) from feed forward neural networks and i thought to try out recurrent neural nets, specifically LSTMs to improve my accuracy.

  6. Peter Ostrowski July 31, 2016 at 11:19 pm #

    Hi Jason,

    Thanks for this example. I ran the first code example (lookback=1) by just copying the code and can reproduce your train and test scores precisely, however my graph looks differently. Specifically for me the predicted graph (green and red lines) looks as if it is shifted by one to the right in comparison to what I see on this page. It also looks like the predicted graph starts at x=0 in your example, but my predicted graph starts at 1. So in my case it looks like the prediction is almost like predicting identity? Is there a way for me to verify what I could have done wrong?


    • Jason Brownlee August 1, 2016 at 6:26 am #

      Thanks Peter.

      I think you’re right, I need to update my graphs in the post.

      • Peter Ostrowski August 2, 2016 at 12:05 am #

        Hi Jason,

        when outputting the train and test score, you scale the output of the model.evaluate with the minmaxscaler to match into the original scale. I am not sure if I understand that correctly. The data values are between 104 and 622, the trainScore (which is the mean squared error) will be scaled into that range using a linear mapping, right? So your transformed trainscore can never be lower than the minimum of the dataset, i.e. 104. Shouldn’t the square root of the trainScore be transformed and then the minimum of the range be subtracted and squared again to get the mean square error in the original domain range? Like numpy.square(scalar.inverse_transform([[nump.sqrt(trainScore)]])-scaler.data_min_)


        • Jason Brownlee August 3, 2016 at 8:33 am #

          Hi Peter, you may have found a bug, thanks.

          I believe I thought the default evaluation metric was RMSE rather than MSE and I was using the scaler to transform the RMSE score back into original units.

          I will update the examples ASAP.

          Update: All estimates of model error were updated to first convert the error score to RMSE and then invert scale transform back to original units.

  7. seiya.kumada August 2, 2016 at 3:17 pm #

    Thank you for your excellent post.

    I have one question.
    In your examples, you are discussing a predictor such as {x(t-2),x(t-1),x(t)} -> x(t+1).
    I want to know how to implement a predictor like {x(t-2),x(t-1),x(t)} -> {x(t+1), x(t+2)}.
    Could you tell me how to do so?

    • Jason Brownlee August 3, 2016 at 5:54 am #

      This is a sequence in and sequence out type problem.

      I believe you prepare the dataset in this form and model it directly with LSTMs and MLPs.

      I don’t have a worked example at this stage for you, but I believe it would be straight forward.

  8. Sachin August 2, 2016 at 6:08 pm #


    First of all thanks for the tutorial. An excellent one at that.

    However, I do have some questions regarding the underlying architecture that I’m trying to reconcile with what I’ve done learnt about. I posted a question here: http://stackoverflow.com/questions/38714959/understanding-keras-lstms which I felt was too long to post in this forum.

    I would really appreciate your input, especially the question on time_steps vs features argument.


    • Jason Brownlee August 3, 2016 at 6:01 am #

      If I understand correctly, you want more elaboration on time steps vs features?

      Features are your input variables. In this airline example we only have one input variable, but we can contrive multiple input variables using past time steps in what is called the window method. Normally, multiple features would be a multivariate time series.

      Timesteps are the sequence through time for a give attribute. As we comment in the tutorial, this is the natural mapping of the problem onto LSTMs for this airline problem.

      You always need a 3D array as input for LSTMs [samples, features, timesteps], but you can reduce each dimension to one if needed. We explore this ability in reframe the problem in the tutorial above.

      You also ask about the point of stateful. It is helpful to have memory between batches over one training run. If we keep all of out time series samples in order, the method can learn the relationships between values across batches. If we did not enable the stateful parameter, the algorithm we no knowledge beyond each batch, much like a MLP.

      I hope that helps, I’m happy to dig into a specific topic further if you have more questions.

      • Jack Dan August 1, 2017 at 4:49 am #

        Dr. Jason,
        I think this is a good place to bring this question. Suppose if I have X_train, X_test, y_train and y_test, should I transform all the values into a np.array? If I have in this format, should I still use ‘create_dataset’ function to create X and y?

        • Jason Brownlee August 1, 2017 at 8:12 am #

          Yes Jack.

          Generally, prepare your data consistently.

          • Jack Dan August 1, 2017 at 8:22 am #

            Dr Jason,
            I have an hourly time series with multiple predictor variables. I skipped create_dataset and just converted all my X_train, X_test, y_train and y_test into np arrays. The reason is, ex: I use past three months as my training and I would like to predict for next 7 days, which will be about 168 observations. If this is the case, if I happen to prepare consistent, would my ‘look_back = 168’ in create_dataset function?

          • Jason Brownlee August 2, 2017 at 7:40 am #

            I would recommend preparing data with the function in this post:

          • Jack Dan August 2, 2017 at 1:30 am #

            Dr. Jason,

            After a deep thought and research I am thinking to just use my X_train, y_train, X_test and y_test without doing a look back. The reason is, y_train is dependent on on my X_train features. Therefore, my gut feeling is not use look back or sliding window. I just wanted to confirm with you and please let me know if I am on right track. BTW, when are you planning on doing a multivariate time series analysis? if you can educate us on that, it will be great. Thank you sir!

          • Jason Brownlee August 2, 2017 at 7:55 am #

            You may not need an LSTM if there is no input sequence, consider an MLP.

  9. Sachin August 4, 2016 at 3:54 pm #

    So does that mean (in reference to the LSTM diagram in http://colah.github.io/posts/2015-08-Understanding-LSTMs/) that the cell memory is not passed between consecutive lstms if stateful=false (i.e. set to zero)? Or do you mean cell memory is reset to zero between consecutive batches (In this tutorial batch_size is 1). Although I guess I should point out that the hidden layer values are passed on, so it will still be different to a MLP (wouldn’t it?)

    On a side note, the fact that the output has to be a factor of batch_size seems to be confounding. Feels like it limits me to using a batch_size of one.

    • Jason Brownlee August 5, 2016 at 8:38 am #

      If stateful is set to false (the default), then I understand according to the Keras documentation that the state within each LSTM node is reset after each batch, either for prediction or training.

      This is useful if you do not want to use LSTMs in a stateful manner of you want to train with all of the required memory to learn from within each batch.

      This does tie into the limit on batch size for prediction. The TF/Theano structures created from this network definition are optimized for the batch size.

      • Mango Freezz October 16, 2016 at 6:18 am #

        I’m super confused here. If the LSTM node is reset after each batch (in this case batch_size 1), does that mean in each forward-backprop session, the LSTM starts with a fresh state without any memory of previous inputs, and it’s only input is a single value? If that’s the case, how could it possibly learn anything?

        E.g., let’s say on both time step 10 and 15 the input value is 150, how does the network predict step (10+1) to be 180 and step (15+1) to be 130 while the only input is 150 and the LSTM start with a fresh state?

        • ARandomPerson December 6, 2016 at 9:35 am #

          Hi Mango, I think you’re right. If the number of time-steps is one and the LSTM is not stateful, then I don’t think he is using the recurrent property of the LSTM at all.

  10. Nuno Fonseca August 4, 2016 at 8:52 pm #


    First of all, thank you for that great post

    I have just one small question: For some research work i am working on, I need to make a prediction, so I’ve been looking for the best possible solution and I am guessing its LSTM…

    The app. that I am developing is used in a learning environment, so to predict is the probability of a certain student will submit one solution for a certain assignment…

    I have data from previous years in this format:

    A1 A2 A3 A4 …
    Student 1 – Y Y Y Y N Y Y N
    Student 2 – N N N N N Y Y Y

    Where Y means that the student has submitted, and N otherwise…

    From what I understood, the best to achieve what I need is by using the solution described in the section “LSTM For Regression Using the Window Method” where my data will be something like

    I1 I2 I3 O
    N N N N
    Y Y Y Y

    And when I present a new case like Y N N the “LSTM” will make a prediction according to what has been learnt in the training moment.

    Did I understand it right? Do you suggest another way?

    Sorry for the eventually dumb question…

    Best regards

    • Jason Brownlee August 5, 2016 at 5:30 am #

      Looks fine Nuno, although I would suggest you try to frame the problem a few different ways and see what gives you the best results.

      Also compare results from LSTMs to MLPs with the window method and ensure it is worth the additional complexity.

  11. Dunhui Xiao August 7, 2016 at 6:59 am #

    Hi Jason,
    Very interesting. Is there a function to descale the scaled data (0-1)? You show the data from 0-1. I want to see the original scale data. This is a easy question. But, it is better to show the original scale data, I suppose.

    • Jason Brownlee August 7, 2016 at 8:46 am #

      Great point.

      Yes, you can save the MinMaxScaler used to scale the training data and use it later to scale new input data and in turn descale predictions. The call is scaler.inverse_transform() from memory.

  12. Pacchu August 9, 2016 at 5:09 am #

    Why is the shift necessary for plotting the output? Isn’t it unavailable information at time ‘t+1’?

    • Jason Brownlee August 15, 2016 at 9:46 am #

      Hi Pacchu, the framing of the problem is to predict t+1, given t, and possibly some subset of t-n.

      I hope that is clearer.

  13. Mat August 10, 2016 at 12:06 am #

    Does the output simply mimics the input ? (the copy is shifted by one)
    Just like here : https://github.com/fchollet/keras/issues/2856 ?

    • Jason Brownlee August 15, 2016 at 9:47 am #

      No, the output is a prediction of the next time step given prior time steps.

      • André C. Andersen May 14, 2017 at 1:45 am #

        Have you tried to use the input value as a prediction? It produces an RMSE similar to what you are getting, 48.66.

      • Jacky July 22, 2017 at 4:22 am #

        Hi Jason, thanks for the tutorial. Is it because the input features or hyperparameter are not tuned so the prediction (t+1) is only using the input (t)? Thanks

  14. Shaifu August 10, 2016 at 6:35 pm #

    Hi sir

    I tried your code for time series prediction. On passing either univariate or multivariate data, the predictions of the target variable are same. Should’nt there be a difference in the predicted values. I expect the predictions to improve with the multivariate data. Please shed some light on this.

    • Jason Brownlee August 15, 2016 at 9:47 am #

      The performance of the model is dependent on both the framing of your problem and how the model is configured.

  15. Madhav August 17, 2016 at 4:02 pm #

    Hi Jason,

    Thanks for the wonderful tutorial. It felt great following your code and implementing my first LSTM network. Looking forward to learning a lot more!!

    Can we extend time series forecasting problems to multiple time series? I have the following problem in my mind. Suppose we have stock prices of 100 companies (instead of one) and we wanna forecast what happens in the next month for all the companies? Is it possible to use LSTMs and RNNs for such multiple time series problems?

    • Jason Brownlee August 18, 2016 at 7:15 am #

      Forecasting stock prices is not my area of expertise. Nevertheless, LSTMs can be used for multiple regression as well as sequence prediction (if you want to predict multiple steps ahead). Give it a shot.

    • Mango December 18, 2016 at 10:08 pm #

      i guess i have the same idea in mind as Madhav..^^ i want to predict multiple time series, each one represent the flow of one grid in the city(since i assume that the neighboured grids influence each other to some extend).. have you done your stock prediction with LSTM?? will you share me some tricks or experience? Thankyou~

  16. Liu August 18, 2016 at 2:11 am #

    I guess the function learnt is only an one-step lag identity (mimic) prediction.

    If the code of your basic version runs, it will look like this:


    I change the csv (setting all the data points after some time to be 400 until the end) and run the same code, it will look like this:


    If it is truly learning the dynamics of the pattern, the prediction should not look like a strict line. At least the previous information before the 400 values will pull down the curve a little bit.

    • Liu August 18, 2016 at 3:44 am #

      Typo: a *straight line

      Clarification: Of course what I said may not be correct. But I think this is an alarming concern to interrupt what the LSTM is really learning behind the scene.

    • Jason Brownlee August 18, 2016 at 8:01 am #

      A key to the examples is that the LSTM is capable of learning the sequence, not just the input-output pairs. From the sequence, it is able to predict the next step.

      • sevity August 3, 2017 at 2:06 am #

        I think Liu is right. because even when I change LSTM to Dense, result is almost the same.
        if you use time-step=1. it is actually not LSTM anymore.

    • Nicholas August 19, 2016 at 7:16 am #

      Hi Liu,

      after investigating a bit, I have concluded that the 1 time-step LSTM is indeed the trivial identity function (you can convince yourself by reducing the layer to 1 neuron, and adding ad-hoc data to the test set, as you have). But if we think about it, this makes alot of sense that the ‘mimic’ function would minimize MSE for such a simple network – it doesn’t see enough time steps to learn the sequence, anyways.

      However, if you increase the number of timesteps, you will see that it can reach lower MSE on the test set by slowly moving away from the mimic function to actually learning the sequence, although for low #’s of neurons the approximation will be rougher-looking. I recommend experimenting with the look_back amount, and adding more layers to see how the MSE can be reduced further!

      • Liu August 20, 2016 at 8:47 am #

        Hi Nicholas,

        Thanks for the comment!

        I guess the problem (or feature you can say) in the first example is that ‘time-step’ is set to 1 if I understand the API correctly:

        trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))

        It means it is feeding sequence of length 1 to the network in each training. Therefore, the RNN/LSTM is not unrolled. The updated internal state is not used anywhere (as it is also resetting the states of the machine in each batch).

        I agree with what you said. But by setting timestep and look_back to be 1 as in the first example, it is not learning a sequence at all.

        For other readers, I think it worths to look at http://stackoverflow.com/questions/38714959/understanding-keras-lstms

      • Gilles September 2, 2016 at 5:44 pm #

        Hi Nicholas,

        This is a very good point, thanks for mentioning it.

        I have implemented an LSTM on a 1,500 points sample size and indeed sometimes I was wondering whether there really was a big difference with a “mimic” function.

        A lot of work to predict the value t+1 while value at t would have been a good enough predictor!

        Will try more experiments as I have more data.

    • Logan April 6, 2017 at 4:44 am #

      hey Liu, it’s a very good observation. I still on the basics and I think these sort of information is really important if we want to build something with quality. Thanks.
      Thanks for the tutorial as well.

  17. Chris August 20, 2016 at 12:07 am #

    Hi Jason,

    Thanks for this amazing tut, could you please tell me about what is the main role of batch_size in model.fit() and output shape of LSTM layer parameter ?
    I read somewhere that using batch_size is depend on our dataset why you chose batch_size = 1 for fitting model and what is the effect of choosing it’s value on calculating gradient of the model?


    • Jason Brownlee August 20, 2016 at 6:09 am #

      Great question Chris.

      The batch_size is the number of samples from your train dataset shown to the model at a time. After batch_size samples are run through the network and error calculated, an update to the weights is performed. Too many and the updates are too big, too few, and the updates are too noisy. The hardware you use is also a factor for batch_size and you want to ensure you can fit the batch of samples in memory (e.g. so your GPU can get at them).

      I chose a batch_size of 1 because I want to explore and demonstrate LSTMs on time series working with only one sample at a time, but potentially vary the number of features or time steps.

  18. Hany El-Ghaish August 22, 2016 at 8:46 am #

    Hi Jason,

    Thanks for this series. I have a question for you.
    I want to apply a multi-classification problem for videos using LSTM. also, video samples have a different number of frames.
    Dataset: samples of videos for actions like boxing, jumping, hand waving, etc.. (Dataset like UCF1101) . each class of action has one label.

    so, each video has a label.
    Really, I do not know how to describe the data set to LSTM when a number of frames sequence are different from one action to another and I do not know how to use it when a number of frames are fixed as well.

    if you have any example of how to use:
    LSTM, stacked of LSTM, or CNN with LSTM with this problem this will help me too much.
    I wait for your recommendations

    • Harsha August 30, 2016 at 7:47 pm #

      Hi Jason. Thanks for such a wonderful tutorial. it helped me a lot to get an insight on LSTM’s. I too have a similar question. Can you please comment on this question.

  19. Alvin August 26, 2016 at 3:02 am #

    Hi Jason,

    Thanks for this great tutorial! I would like to ask, suppose I have a set of X points : X1, X2, .. Xn that contributes to the total sales of the day represented by Y, and I have 60 days data (Y1 until Y60), how do I do time series forecast using these data? Assuming that I would like to predict Y65. Do you have any sample or coding references?

    Thanks in advance

    • Jason Brownlee August 26, 2016 at 10:34 am #

      I believe you could adapt one of the examples in your post directly to your problem. Little effort required.

      Consider normalizing or standardizing your input and output values when working with neural networks.

      • Alvin August 30, 2016 at 8:31 am #

        Hi Jason,

        I just found out the question that I have is a multi step ahead prediction, where all the X contributes to Y, and I would like to predict ahead the value of Y n days ahead. Is the example that you gave in this tutorial still relevant?


        • Jason Brownlee August 31, 2016 at 8:42 am #

          Hi Alvin,

          Yes, you could trivially extend the example to do sequence-to-sequence prediction as it is called in recurrent neural network parlance.

          • Alvin August 31, 2016 at 1:03 pm #

            Hi Jason,

            Thanks for your reply. I still would like to clarify after looking at the sequence to sequence concept. Assuming I would like to predict the daily total sales (Y), given x1 such as the total number of customers, total item A sold as x2, total item B sold as x3 and so on for the next few items, is sequence to sequence suitable for this?


          • Alvin August 31, 2016 at 6:05 pm #

            Hi Jason,

            I have another question. Looking at your example for the Window method, on line 35:
            # reshape input to be [samples, time steps, features]
            trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
            testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

            what if I would like to change the time steps to more than 1? What other parts of codes I would need to change? Currently when I change it, it says
            ValueError: total size of new array must be unchanged.


          • Alvin September 13, 2016 at 11:23 am #

            Hi Jason,

            For using stateful LSTM, to predict multiple steps, I came across suggestions to feed the output vector back into the model for the next timestep’s prediction. May I know how to get an output vector from based on your LSTM stateful example?

          • Jason Brownlee September 14, 2016 at 10:04 am #

            Hi Alvin,

            The LSTM will maintain internal state so that you only need to provide the next input pattern. The LSTM implementation in Keras does require that you provide your data in consistent batch sizes, but when experimenting with this you could reduce the batch size down to 1.

            I hope that helps.

  20. DRD August 29, 2016 at 2:25 am #

    Apparently, when using the tensor flow backend, you have to specify input_length in the LTSM constructor. Otherwise, you get an exception. I assume it would just be input_length=1

    • DRD August 29, 2016 at 2:29 am #

      So like this:
      model.add(LSTM(4, input_dim=look_back, input_length=1))

      This references the first example where number of features and timesteps is 1. Here input_length corresponds to timesteps.

      • Alvin August 30, 2016 at 8:32 am #

        Hi DRD,

        Is this the setting used to solve multi step ahead prediction?

        Thanks in advance!

        • DRD September 1, 2016 at 6:19 am #


          Haven’t tried it yet, but in the section titled: “LSTM With Memory Between Batches

          ” input_length should be 3. Basically the same as look_back

  21. Nick August 31, 2016 at 1:34 am #

    Hi Jason,
    I applied your technique on stock prediction:
    But, I am having some issues.
    I take all the historical prices of a stock and frame it the same way the airline passenger prices are in a .csv file.

    I use a look_back=20 and I get the following image:


    Then I try to predict the next stock price and the prediction is not accurate.

    Why is the model able to predict the airline passengers so precisely ?


    Thank you

    • Jason Brownlee August 31, 2016 at 9:47 am #

      I would suggest tuning the number of layers and number of blocks to suits your problem.

      • Nader September 1, 2016 at 7:25 am #

        Thank you.
        I will play around the network.

        In general, For input_dim (windows size), is a smaller or larger number better ?

  22. Marcel August 31, 2016 at 8:29 pm #

    Hi Jason,

    First off, thanks again for this great blog, without you I would be nowhere, with LSTM, and life!

    I am running a LSTM model that works, but when I make predictions with “model.predict” it spits out 4000 predictions, which look fine. However, when I run “model.predict” again and save those 4000 predictions again, they are different. From prediction 50 onward, they are all essentially the same, but the first few (that are very important to me) are very different. To give you an idea, the correlation between the first 10 predictions of both rounds is 0.11.

    Please help!

    • Marcel August 31, 2016 at 10:55 pm #

      The problem wasn’t with numpy.random.seed(0) as I originally thought. I’ve tested this over and over, and even if on the exact same data, predictions are always different/inconsistent for the first few predictions, and only “converge” to some consistent predictions after about 50 predictions have been made previously (on the same or different input data).

      • Marcel September 1, 2016 at 1:15 am #

        It seems like I have made an error by neglecting to include “model.reset_states()” after one line of calling model.predict()

        • Jason Brownlee September 1, 2016 at 8:05 am #

          I’m glad to hear you worked it out Marcel.

          A good lesson for all of to remember or calls to reset state.

  23. Nader September 1, 2016 at 9:30 pm #

    In the part “LSTM For Regression with Time Steps”,

    should’t the reshaping be in the form:

    [Samples, Features, Time] = (trainX, (trainX.shape[0], trainX.shape[1], 1]

    Because in the previous two section:
    “LSTM Network For Regression” and
    “LSTM For Regression Using the Window Method” we used:

    [Samples, Time Steps, Features] = (trainX, (trainX.shape[0], 1, trainX.shape[1])

    Thank you

  24. sachin September 2, 2016 at 3:57 pm #

    Hi Jason,

    Correct me if I’m wrong, but you don’t want to reset_state in the last training iteration do you? Basically my logic is that you want to carry through the last ‘state’ onto the test set because they occur right after the other.


    • Jason Brownlee September 3, 2016 at 6:57 am #

      You do. The reason is that you can seed the network with state later when you are ready to use it to make a prediction.

  25. Megs September 3, 2016 at 10:04 pm #

    Hello Jason,

    Am I correct if I was to use Recurrent Neural Networks to predict Dengue Incidences against data on temperature, rainfall, humidity, and dengue incidences.. If so, how would I go about in the processing of my data. I already have the aforementioned data at hand and I have tried using a feed forward neural network using pybrain. It doesn’t seem to get the trend hence my trying of Recurrent Neural Network.

    Thank you!

  26. Christoph September 5, 2016 at 4:22 am #

    I am a little bit confused regarding the “statefulness”.

    If I use a Sequential Model with LSTM layers and stateful set to false. Will this still be a recurrent network that feeds back into my nodes? How would I compare it to the standard LSTM model proposed by Hochreiter et al. (1997)? Do I have to use the stateful layers to mimic the behaviour presented in the original paper?

    In essence, I have a simple time series of sales forecasts that show a weekly and partly a yearly pattern. It was easy to create a simple MLP with the Dense layer and the time window method. I put some sales values from the last week, the same week day a few weeks back and the sales of the days roughly a year before into my feature vector. Results are pretty good so far.

    I now want to compare it to an LSTM approach. I am however not sure how I can model the weekly and yearly pattern correctly and if I need to use the stateful LSTM or not. Basically I want to use the power of an LSTM to predict a sequence of a longer period of time and hope that the forecasts will be better than with a standard (and much faster) MLP.

  27. Nathan George September 7, 2016 at 3:27 pm #

    These lines don’t make sense to me:

    # reshape input to be [samples, time steps, features]
    trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

    Isn’t it [samples, fetaures, timesteps] ?
    When you switch to lookback = 3, you still use trainX.shape[0], 1, trainX.shape[1] as your reshape, and aren’t the timesteps the lookback? I noticed the Keras model doesn’t work unless you reshape in this way, which is really strange to me. Why do we have to have a matrix full of 1×1 vectors which hold another vector inside of them? Is this in case we have more features or something? Are there ever any cases where the ‘1’ in the reshape would be any other number?

    • Christoph September 8, 2016 at 10:30 pm #

      I think the example given here is wrong in the sense, that each data point represents a sample with one single timestep. This doesn’t make sense at all if you think about the strengths of recurrent networks and especially LSTM. How is the LSTM going to memorize if there’s only one timestep for each sample?

      Since we are working on a single time series, it should probably be the other way around, with one sample and n timesteps. One could also try and reduce the number of timesteps and split the time series up into samples of a week or a month depending on the data.

      Not sure if this is all correct what I just said, but I get the feeling that this popular tutorial isn’t a 100% correct and therefore a bit misleading for the community.

      • Jason Brownlee September 9, 2016 at 7:20 am #

        I do provide multiple examples just so you can compare the methods, both learning by single sample, by window and by timestep.

        For this specific problem, the timestep approach is probably the best way to model it, but if you want to use some of these examples as a template for your own problem, you have a few options to choose from.

  28. Oswaldo September 9, 2016 at 10:35 am #

    If I want to predict t+1 in a test set with t, the prediction doesn’t make sense. If I shift the prediction, it make sense but I end up predicting t with t, using a next-step model that learnt the sequence. Nice exercise to get used with the implementation, but what’s the point in real life? I really want a t+1 prediction that match t+1 (not t) in the test set, What I’m missing?

  29. Easwar September 12, 2016 at 1:21 am #

    Hi Jason,

    This is an excellent tutuorial. I have a question. You have one LSTM (hidden) layer with 4 neurons. What if I construct a LSTM layer with only 1 neuron. Why should I have 4 neurons ? I suppose this is different from having two or layers (depth) ? Depth in my understanding is if you have more layers of LSTM.

    If you have 4 LSTM neurons in first layers, does input get fed to all the 4 neurons in a fully connected fashion ? Can you explain that ?

    Best Regards,

    • Jason Brownlee September 12, 2016 at 8:33 am #

      Great question Easwar.

      More neurons means more representational capacity at that layer. More layers means more opportunity to abstract hierarchical features.

      The number of neurons (LSTM call them blocks) and layers to use is a matter of trial and error (tuning).

      • Sam February 19, 2017 at 2:41 pm #

        If we reduce the number of neurons BELOW the number of features fed into an RNN, then does the model simply use as many features as the neuron number allows ?
        For example, if I have 10 features but define a model with only 5 neurons in the initial layer(s), would the model only use the FIRST 5 features ?


        • Jason Brownlee February 20, 2017 at 9:28 am #

          No, I expect it will cause an error. Try it and see.

          • Sam February 23, 2017 at 12:05 pm #

            NO, surprisingly it works very well and gives great prediction results.
            Is there a requirement that each feature have a neuron ?

  30. Max Clayer September 13, 2016 at 4:38 am #

    Hi I have found when running your raw example on the data, the training data seems to be shifted to the right of the true plot and not the same as your graph in your first example, why could this be?

  31. Stijn September 19, 2016 at 1:29 am #

    Hi Jason,

    Nice blog post.

    I noticed however, that when you do not scale the input data and switch the activation of the LSTMs to ReLu, you are able to get performance comparable to the feedforward models in your other blog post (http://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/). The performance becomes: Train Score: 23.07 RMSE, Test Score: 48.59 RMSE

    Moreover, when you run the feedforward models in the other blog post with scaling of the input data their performance degrades.

    Any idea why scaling the dataset seems to worsen the performance performance?



    • Jason Brownlee September 19, 2016 at 7:44 am #

      Interesting finding Stijn, thanks for reporting it.

      I need to experiment more myself. Not scaling and using ReLu would go hand in hand.

      • V September 23, 2016 at 2:48 pm #

        Hi Jason – actually I was able to verify Stjin’s results (could you please delete my inquiry to him).

        But I am curious about this:

        Train Score: 22.51 RMSE
        Test Score: 49.75 RMSE

        The error is almost twice as large on the out of sample data, what does that mean about the quality of our model?

    • V September 23, 2016 at 2:35 pm #

      Hi Stijn – I wasn’t able to replicate your results, could you please post your code. Thanks!

  32. Jakob Aungiers September 22, 2016 at 10:52 pm #

    Hey Jason,

    As far as I can tell (and you’ll have to excuse me if I’m being naive) this isn’t predicting the next timestep at all? Merely doing a good job at mimicking the previous timestep?

    For example the with the first example, if we take the first timestep of trainX (trainX[0]) the prediction from the model doesn’t seem to be trying to predict what t+1 (trainX[1]) is, but merely mimics what it thinks fits the model at that particular timestep (trainX[0]) i.e. tries to copy the current timestep. Same for trainX[1], the prediction is not a prediction of trainX[2] but a guess at trainX[1]… Hence which the graphs in the post (which as you mentioned above you need to update) look like they’re forwardlooking, but running the code actually produces graphs which have the predictions shifted t+look_back.

    How would you make this a forward looking graph? Hence also, I tried to predict multiple future timesteps with your first model by initialising the first prediction with testX[0] and then feeding the next predictions with the prior predictions but the predictions just plummeted downwards into a downwards curve. Not predicting the next timesteps at all.

    Am I being naive to the purpose of this blog post here?

    All the best, love your work,

    • Jeremy Irvin September 24, 2016 at 11:08 am #

      Hi Jakob,

      I believe you are correct.

      I have tried these methods on many different time series and the same mimicking behavior occurs – the training loss is somehow minimized by outputting the previous timestep.

      A similar mimicking behavior occurs when predicting multiple time steps ahead as well (for example, if predicting two steps ahead, the model learns to output the previous two timesteps).

      There is a small discussion on this issue found here – https://github.com/fchollet/keras/issues/2856 – but besides that I haven’t discovered any ways to combat this (or if there is some underlying problem with Keras, my code, etc.).

      I am in the process of writing a blog to uncover this phenomenon in more detail. Will follow up when I am done.

      Any other advancements or suggestions would be greatly appreciated!


    • Max Clayer October 4, 2016 at 4:27 am #

      Are you simply using t-1 to predict t+1 in the time window, if so I don’t think there is enough data being fed into the neural network to learn effectively. with a bigger time window I notice that the model does start to fit better.

  33. Dominic September 23, 2016 at 2:55 am #

    Hi, Jason
    Thank you for your post.
    I am still confused about LSTM for regression with window method and time steps.
    Could you explain more about this point. Could you use some figures to show the difference between them?
    Many thanks!

    • Dominic September 23, 2016 at 6:36 am #

      As my understanding, the LSTM for regression with window method is the same as a standard MLP method, which has 3 input features and one output as the example. Is this correct? What’s the difference?

  34. Wolfgang September 24, 2016 at 4:20 pm #

    Thank you very much for the detailed article!

    Does anybody have a scientific source for the time window? I can’t seem to find one.

    • Jason Brownlee September 25, 2016 at 8:01 am #

      Great question.

      I don’t know of any, but there must be something written up about it back in the early days. I also expect it’s in old neural net texts.

  35. Brian September 25, 2016 at 7:14 am #

    Have you experimented with having predictors (multivariate time series) versus a univariate? Is that possible in Keras?

    • Jason Brownlee September 25, 2016 at 8:05 am #

      Yes, you can have multiple input features for multiple regression.

      • Brian September 26, 2016 at 12:37 am #

        Any chance you will add this type of example?

        • Jason Brownlee September 26, 2016 at 6:59 am #

          I will in coming weeks.

          • Jacques Rand September 27, 2016 at 11:40 pm #

            Me too will be interested in using multivariate(numerical) data !
            Been trying for a few days , but the “reshaping/shaping/data-format-blackmagic” always breaks
            Purely cause I don’t yet understand it !
            Otherwise great example !

          • Jason Brownlee September 28, 2016 at 7:41 am #

            Understood, I’ll prepare tutorials.

          • Richard Ely October 3, 2016 at 4:31 pm #

            Sir, Awesome work!!!

            I am very interested in cross-sectional time series estimation… How can that be done?

            I am starting your Python track, but will eventually target data with say 50 explanatory variables, with near infinite length of time series observations available on each one. Since the explanatory variables are not independent normal OLS is useless and wish to learn your methods.

            I would be most interested in your approach to deriving an optimal sampling temporal window and estimation procedure.

          • Jason Brownlee October 4, 2016 at 7:20 am #

            Sorry Richard, I don’t know about cross-sectional time series estimation.

            For good window sizes, I recommend designing a study and resolving the question empirically.

          • Zhang Wenjie December 25, 2016 at 8:04 am #

            Hi Jason

            As you mentioned before, you will prepare the tutorial for multiple input features for multiple regression. Could you provide us the link to that tutorial?

          • Jason Brownlee December 26, 2016 at 7:44 am #

            It will be part of a new book I am writing. I will put a version of it on the blog soon.

            Multiple regression is straight forward with LSTMs, remember input is defined as [samples, timesteps, features]. Your multiple inputs are features.

            For multiple output multi-step regression you can recurse the LSTM or change the number of outputs in the output layer.

          • Zach May 2, 2017 at 2:15 am #

            Did you ever create a tutorial for multivariate LSTM? I can’t seem to find any!

  36. Bob September 27, 2016 at 12:22 pm #

    I get this error when I run your script with the Theano backend:

    ValueError: (‘The following error happened while compiling the node’, forall_inplace,cpu,scan_fn}(Elemwise{maximum,no_inplace}.0, Subtensor{int64:int64:int8}.0, IncSubtensor{InplaceSet;:int64:}.0, IncSubtensor{InplaceSet;:int64:}.0, Elemwise{Maximum}[(0, 0)].0, lstm_1_U_o, lstm_1_U_f, lstm_1_U_i, lstm_1_U_c), ‘\n’, ‘numpy.dtype has the wrong size, try recompiling’)

    Any idea what might be happening?

    • Jason Brownlee September 28, 2016 at 7:35 am #

      Hi Bob, it might be a problem with your backend. Try reinstalling Theano or Tensorflow – whichever you are using. Try switching up the one you are using.

  37. Philip September 28, 2016 at 9:32 pm #

    Excellent article, really insightful. Do you have an article which expands on this to forecast data? An approach that would mimic that of say arima.predict in statsmodels? So ideally we train/fit the model on historical data, but then attempt to predict future values?

    • Jason Brownlee September 29, 2016 at 8:36 am #

      Thanks Philip, I plan to prepare many more time series examples.

      • Waldemar October 19, 2016 at 12:39 pm #

        I don’t see future values on your plots, as I understand your model don’t predict the Future, only describe History. Can you give advice, how can I do this? And how I can print predictive values?
        Thanks a lot!

  38. Tucker Siegel September 29, 2016 at 10:51 am #

    Great article Jason.
    I was just wondering if there was any way I could input more than 1 feature , and have 1 output, which is what I am trying to predict? I am trying to build a stock market predictor. And yes I know, it is nearly impossible to predict the stock market, but I am just testing this, so lets say we live in a perfect world and it can be predicted. How would I do this?

    • Jason Brownlee September 30, 2016 at 7:44 am #

      Hi Tucker,

      You can have multiple features as inputs.

      The input structure for LSTMS is [samples, time steps, features], as explained above. In fact, there are examples of what you are looking for above (see the section on the window method).

      Just specify the features (e.g. different indicators) in the third dimension. You may have 1 or more timesteps for each feature (second dimension).

      I hope that helps.

      • Tucker Siegel September 30, 2016 at 1:33 pm #

        I did what you said, but now it wants to output 2 sequences out of the activation layer, but I only wanted it to have a final output of 1. Basically what I am doing is trying to use open and close stock data, and use it to predict tomorrow’s close. So I need to input 2 sequences and have an output of 1. I hope I explained that right. What should I do?

  39. Joe October 1, 2016 at 11:24 pm #

    Jason, you mentioned that LSTMs input shape must be [samples, time stamps, features]. What if my time series is sampled (t, x), i.e. each sample has its own time stamp, and the time stamps are NOT evenly spaced. Do I have to generate another time series in which all samples are evenly spaced? Is there any way to handle the original time series?

    • Jason Brownlee October 2, 2016 at 8:19 am #

      Really good question Joe. Thanks. I have not thought about this.

      My instinct would be to pad the time series, fill in the spaces with zeros and make the time series steps equidistant. At least, I would try that against not doing it and evaluate the effect on performance.

    • Pho King November 5, 2016 at 3:38 pm #

      Take samples in blocks via Sklearn.model_selection.TimeSeriesSplit

  40. Rio October 6, 2016 at 4:07 am #

    What an excellent article!
    Recently I used LSTM to predict stock market index where the data is fluctuating and has no seasonal pattern like the air passanger data. I was just wondering about how does LSTM (or every gate) decide when to forget or keep a certain value of this type of series data. Any explanation about this? Thank you.

    • Jason Brownlee October 6, 2016 at 9:40 am #

      Great question Rio. I would love to work through how the gates compute/output on a specific problem.

      I think this would be a new blog post in the future.

      • Rio October 7, 2016 at 1:15 pm #

        I’m looking forward to that post, Jason. Thank you

  41. SalemAmeen October 6, 2016 at 10:27 pm #

    I used “LSTM For Regression Using the Window Method” with the following parameters
    look_back = 20

    I got the following results

    Train_Score: 113.67 RMSE
    Test_Score: 122.88 RMSE

    I computed R-squared and I got 0.93466300136

    In addition I tried changing the hyperparameters in the other two models but R-squared was less in both comparing to this model.

  42. Randy October 7, 2016 at 11:03 pm #

    Hi, Jason
    First of all, this is really a fantastic post and thank you so much!
    I’ve got confused on the “model.predict(x,batch_size)”.
    I can’t figure what it means “predict in a batched way” on the keras official website.
    My situation is like:
    I have a test sample [x_1] \in R^{2}, and I put it into the function,
    [x_2] = model.predict([x_1],batch_size=batch_size)
    (Let’s skip the numpy form issue)
    Then, subsequently, I put [x_2] into it, similarly, and I get [x_3] = model.predict([x_2],batch_size=batch_size), and so on, till x_10.

    I don’t know if the function “predict” treats [x_1],[x_2],…[x_3] as in a batch ?
    I guess it does.(although I didn’t put them into the function at one time)

    Otherwise, I’ve tried another way to compute [x_2],…[x_10] and I got the same as above.
    Another strategy is like:
    [x_2] = model.predict([x_1],batch_size=batch_size)
    [x_3] = model.predict([[x_1],[x_2],batch_size=batch_size)
    [x_4] = model.predict([[x_1],[x_2],[x_3]],batch_size=batch_size)

    What’s the difference between the two ways?

    • Randy October 7, 2016 at 11:55 pm #

      btw, I am also confused at “batch”.
      If batch_size=1, does that mean there’s no relation between samples? I mean the state s_t won’t be sent to affect the next step s_(t+1).
      So, why we need RNN?

      • Jason Brownlee October 8, 2016 at 10:40 am #

        It’s a good point.

        We need RNN because the state they can maintain gives results better than methods that do not maintain state.

        As for batch_size=1 during calls to model.predict() I have not tested whether indeed state is lost as in training, but I expect it may be. I expect one would need batch_size=n_samples and replay data each time a prediction is needed.

        I must experiment with this and get back to you.

    • Jason Brownlee October 8, 2016 at 10:37 am #

      I’m not sure I understand the “two ways” you’re comparing, sorry.

      The batch size is the number of records that the network will process at once – as in load into memory and perform computation upon.

      In training, this is the constraint on data before weight update. In test, it is data before computed predictions are returned.

      Does that help at all?

      • Randy October 13, 2016 at 3:11 pm #

        yes, it does!
        I appreciate it !!

  43. dubi dubi October 8, 2016 at 3:09 pm #

    This is a great post! Thanks for the guidance. I’m wondering about performance. I’ve setup my network very similarly to yours, just have a larger data set (about 2500 samples, each with 218 features). Up to about 20 epochs runs in a reasonable amount of time, but anything over that seems to take forever.

    I’ve set-up random forests and MLPs, and nothing has run so slowly. I can see all CPUs are being used, so am wondering whether Keras and/or LSTM has performance issues with larger data sets.

    • Jason Brownlee October 9, 2016 at 6:48 am #

      Great question.

      LSTMs do use more resources than classical networks because of all the internal gates. No significant, but you will notice it at scale.

      I have not performed any specific studies on this, sorry.

  44. Jason Wills October 12, 2016 at 4:02 pm #

    Hi Jason,

    I am confusing about deep learning and machine learning in Stock Market , forex . There are a lot of models which analyses via chart using amibroker or metastock which redraw the history price and take the prediction in that model . Does it call the machine learning or deep learning ?
    How is it when we could do farther to make better prediction via deep learning if it’s right ?

    • Jason Brownlee October 13, 2016 at 8:35 am #

      Hi Jason, it sounds like you are already using predictive models.

      It may fair to call them machine learning. Deep learning is one group of specific techniques you may or may not be using.

      There are may ways to improve results, but it is trial and error. I offer some ideas here:

      Sorry, I don’t know the specifics of stockmarket data.

  45. Alexander October 13, 2016 at 12:46 pm #

    Hi Jason,

    These models do not predict, they extrapolate current value 1 step ahead in more or less obscured way. As seen on the pictures, prediction is just shifted original data. For this data one can achieve much better RMSE 33.7 without neural net, in just one line of code:

    trainScore = math.sqrt(mean_squared_error(x[:-1], x[1:]))

    • Jason Brownlee October 14, 2016 at 8:56 am #

      Hi Alexander, thanks.

      This is good motivation for me/community to go beyond making LSTMs “just work” for time series and dive into how to train LSTMs effectively and even competitively on even very simple problems.

      It’s an exciting open challenge.

  46. Leftriver October 18, 2016 at 12:45 pm #

    This is a nice tutorial for starters. Thank you.

    However, I have some concerns about the create_dataset function. I think it just make a simple problem complicated (or even wrong).

    When look_back=1, the function is simply equivalent to: dataX = dataset[:len(dataset)-look_back], dataY = dataset[look_back:].

    When look_back is larger than 1, the function is wrong: after each iteration, dataX is appended by more than 1 data, but dataY is appended by just 1 data. Finally, dataX will be look_back times larger than dataY.

    Is that what create_dataset supposed to do?

    def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

  47. pang wenfeng October 21, 2016 at 3:23 pm #

    Thanks for your great article! I have a question that when I use the model “Stacked LSTMs With Memory Between Batches” with my own data, I found that cpu is much faster than gpu.
    May data contains many files and each files’ size is about 3M. I put each file into the model to
    trian one by one. I guess that the data is too small so the gpu is useless, but I can’t sure. I use thano backend and I can sure that the type of the data is float32. So I want to know what reason would make this happen, or the only reason is the data too small? Thank you very much and best wishes to you.

  48. Rajesh October 22, 2016 at 4:00 pm #

    Hi Jason,

    Excellent tutorial. I am new to time series prediction. I have a basic question. In this example(international-airline-passengers) model predicted the values on test data from 1957-01 to 1960-12 time period.

    How to predict the passengers in next one year from 1961-01 – 1961-12.

    How to pass input values to model, so it will predict the passengers count in each month for next one year.

    • Rajesh October 23, 2016 at 5:53 pm #


      Any inputs to solve below question

      How to predict the passengers in next one year from 1961-01 – 1961-12.


  49. NicoAd October 22, 2016 at 8:33 pm #


    “The dataset is available for free from the DataMarket webpage as a CSV download with the filename “international-airline-passengers.csv“.”

    Not anymore I guess.

    Any other way to get the file?


  50. Brian October 25, 2016 at 3:51 am #


    Great article on LSTM and keras. I was really struggling with this, until I read through your examples. Now I have a much better understanding and can use LSTM on my own data.

    One thing I’d like to point out. The reuse of trainY and trainX on lines 55 & 57.
    Line 55 trainY = scaler.inverse_transform([trainY])

    This confused me a lot, because the model can’t run fit() or predict() again, after this is done. I was struggling to understand why it could not do a second predict or fit. Until i very carefully read each line of code.

    I think renaming the above variables would make the example code clearer.

    Unless I am missing something….. and being a novice programmer that’s very possible.

    Thanks again for the great work.

    • Jason Brownlee October 25, 2016 at 8:32 am #

      Thanks Brian, I’m glad the examples were useful to get you started.

      Great point about remaining variables.

  51. Joaco October 28, 2016 at 12:09 pm #

    Hi, Jason, thank you for the example.
    I have used the method on my own data. The data is about the prediction of the average temperature per month. I want to predict more than one month. But I can only predict one month now. Because the inputs are X1 X2 X3, the result is only y. I want to kown how to modify the code to use ,like, X1 X2 X3 X4 X5 X6 to predict Y1 Y2 Y3.
    I don’t know if I have made it clear. I hope you will help me.
    Thank you very much.

  52. Nida October 28, 2016 at 1:21 pm #

    Nice post Jason!
    I read that GRU has a less complex architecture than the LSTM has, but many people still use LSTM for forecasting. I’d like to ask, what are the advantages LSTM compared to GRU? Thank you

    • Jason Brownlee October 29, 2016 at 7:35 am #

      Hi Nida, I would defer to model skill in most circumstances, rather than concerns of computational complexity – unless that is a requirement of your project.

      Agreed, we do want the simplest and best performing model possible, so perhaps evaluate GRUs and then see of LSTMs can out perform them on your problem.

  53. Tim October 28, 2016 at 9:49 pm #

    I’m a total newbie to Keras and LSTM and most things NN, but if you’ll excuse that, I’d like to run this idea past you just to see if I’m talking the same language let alone on the same page… :

    I’m interested in time-series prediction, mostly stocks / commodities etc, and have encountered the same problem as others in these comments, namely, how is it prediction if it’s mostly constrained to within the time-span for which we already have data?

    With most ML algorithms I could train the model and implement a shuffle, ie get the previous day’s prediction for today and append it in the input-variable column, get another prediction, … repeat. The worst that would happen is a little fudge around the last day in the learning dataset.
    That seems rather laborious if we want to predict how expensive gold is going to be in 6 months’ time.
    (Doubly so, since in other worlds (R + RSNNS + elman or jordan), the prediction is bound-up with training so a prediction would involve rebuilding the entire NN for every day’s result, but we digress.)

    I saw somewhere Keras has a notion of “masking”, assigning a dummy value that tells the training the values are missing. Would it be possible to use this with LSTM, just append a bunch of 180 mask zeroes, let it train itself on this and then use the testing phase to impute the last values, thereby filling in the blanks for the next 6 months?

    It would also be possible to run an ensemble of these models and draw a pretty graph similar to arima.predict with varying degrees of confidence as to what might happen.

    Waffle ends.

    • Jason Brownlee October 29, 2016 at 7:45 am #

      Interesting idea.

      My thoughts go more towards updating the model. A great thing about neural nets is that they are updatable. This means that you can prepare just some additional training data for today/this week and update the weights with the new knowledge, rather than training them from scratch.

      Again, the devil is in the detail and often updating may require careful tuning and perhaps balance of old data to avoid overfitting.

  54. Lazaros October 30, 2016 at 6:39 pm #

    Dear Jason,

    I am trying to implement your code in order to make forecasting on a time-series that i am receiving from a server. My only problem is that the length of my dataset is continuously increasing. Is there any way to read the last N rows from my csv file? What changes do i have to make in code below in order to succeed it.

    def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)
    # fix random seed for reproducibility
    # load the dataset
    dataframe = pandas.read_csv(‘timeseries.csv’, usecols=[1], engine=’python’, skipfooter=3)
    dataset = dataframe.values
    dataset = dataset.astype(‘float32’)

    • Jason Brownlee October 31, 2016 at 5:29 am #

      This post might help with loading your CSV into memory:

      If you load your data with Pandas, you can use DataFrame.tail() to select the last n records of your dataset, for example:

  55. Alisson Pereira November 3, 2016 at 9:12 am #

    Hello, I would like to use that your model. But the problem I am using sliding window size greater than one. Type {[x-2], [x-1], [x]} ==> [x + 1]. But I found several problems in training. For example, when I turn your trainX in {[x-2], [x-1], [x]} and trainY in [x + 1], the keras tells me that the input and the target must have same size. Can you help me with this?

    • Jason Brownlee November 4, 2016 at 9:02 am #

      Hi Alisson, I think the error suggests that input and target do not have the same number of rows.

      Check your prepared data, maybe even save it to file and look in a text editor or excel.

      • Alisson Pereira November 9, 2016 at 3:55 am #

        Thaks, Jason. I was able to solve my problem. But see, the use of the ReLu function in the memory cell and the sigmoid function on the output showed strange behavior. You have some experience with this setting.
        Congratulations on the work, this page has helped me a lot.

  56. Soren November 4, 2016 at 2:20 am #

    Hi Jason,

    Thanks for your great content.

    As you did i upgraded to Keras 1.1.0 and scikit-learn v0.18. however i run Theano v.0.9.0dev3 as im on Windows 10. Also im on Anaconda 3.5. (installed from this article: http://ankivil.com/installing-keras-theano-and-dependencies-on-windows-10/)

    Your examples run fine on my setup – but i seem to be getting slightly different results.

    For eamples in your first example: # LSTM for international airline passengers problem with window regression framing – i get:

    Train Score: 22.79 RMSE
    Test Score: 48.80 RMSE

    Should i be getting exact the same results as in your tutorial? If yes, any idea what i should be looking at changing?

    Best regards

    • Jason Brownlee November 4, 2016 at 9:12 am #

      Great work Soren!

      Don’t worry about small differences. It is hard to get 100% reproducible results with Keras/Theano/TensorFlow at the moment. I hope the community can work something out soon.

  57. pemfir November 4, 2016 at 2:00 pm #

    great post ! thank you so much. I was wondering how can be adapt the code to make multiple-step-ahead prediction. One of the commenters suggested defining the out-put like [x(t+1),x(t+2),x(t+3),…x(t+n)] , but is there a way to make prediction recursively ? More specifically, to build an LSTM with only one output. We first predict x(t+1), then use the predicted x(t+1) as the input for the next time step to predict x(t+2) and continue doing so ‘n’ times.

  58. Bill November 5, 2016 at 12:11 pm #

    Hi Jason,

    I am wondering how to apply LSTM to real time data. The first change I can see is the data normalisation. Concretely, a new sample could be well out of min max among previous observations. How would you go about this problem?


  59. Noque November 6, 2016 at 2:58 am #

    Could it be that in:

    # calculate root mean squared error
    math.sqrt(mean_squared_error(trainY[0], trainPredict[:,0]))

    you mean :

    # calculate root mean squared error
    trainScore = math.sqrt(mean_squared_error(trainY[:], trainPredict[:,0]))

    • Kit December 21, 2016 at 9:30 am #

      trainY is an array of (one) array. Compare:

      print len(trainYi[0]) # 720
      print len(trainYi[:]) # 1

      print len(trainPredict[:,0]) # 720

      So the original code is the right one.

  60. sherlockatszx November 6, 2016 at 6:05 am #

    Hi jason, i got a question,that have stacked me for many days,how can i add hidden layer into the LSTM model(By using model.add(LSTM()).what i tried : say ,in your first code example, I assume ‘model.add(LSTM(4, input_dim=look_back))’ this line was to create a hidden layer in the LSTM model. So i thought:oh , 1 hidden layer is so easy , why don’t add one hidden layer into it .So i try to add one layer. After the code:’model.add(LSTM(4, input_dim=look_back))’ , i try many ways to insert one hidden layer , such as : just copy model.add(LSTM(4,input_dim=look_back)) and insert after it . I try many ways ,but it always got the error that got the wrong input _ dimension. So can you show me how to add one hidden layer in example 1st . Or , i don’t got the LSTM model right ,it can’t be inserted ?

    • Jason Brownlee November 7, 2016 at 7:07 am #

      See the section titled “Stacked LSTMs With Memory Between Batches”.

      It gives an example of multiple hidden layers in an LSTM network.

      • sherlockatszx November 7, 2016 at 1:26 pm #

        Thanks . I got that.
        However, I got another question: compared to the other article you published ‘time series prediction with deep learning ‘(http://machinelearningmastery.com/time-series-prediction-with-deep-learning-in-python-with-keras/?utm_source=tuicool&utm_medium=referral) , It seems that ‘LSTM’ model doesn’t predict as well as the simple neurons. Does that mean LSTM may not a good choice for some specific time series structure?

        • Jason Brownlee November 8, 2016 at 9:49 am #

          I would not agree, these are just demonstration projects and were not optimized for top performance.

          These examples show how LSTMs could be used for time series projects (and how to use MLPs for time series projects), but not optimally tuned for the problem.

  61. Noque November 9, 2016 at 1:28 am #

    Hi, great post! Thanks

    How could I set the input if I have several observations (time series) with same length of the same feature and I want to predict t+1? Would I concatenate them all? In that case the last sample of one observation would predict the first one of the next.. Or should I explicitly assign the length of each time series to the batch_size?

  62. Mauro November 10, 2016 at 9:44 am #

    Hi, you’re predicting one day after your last entry, if i want to predict a day five days after what should i do?

    • Jason Brownlee November 11, 2016 at 9:56 am #

      Hi Mauro, that would be a sequence to sequence prediction.

      Sorry, I don’t have an example just yet.

  63. Ron November 16, 2016 at 7:01 am #


    This is a great example. I am quite new in deep learning and keras. But this website has been very helpful. I want to learn more.

    Like many commenters, I am also requesting to find out: how to predict future time periods. Is that possible? How can I achieve this using the example above? If there are multiple series or Ys and there are categorical predictors, how can I accomodate that?

    Please help, and am very keen to learn this via other channels in this website if required. Please let me know.

    Many thanks

    • Jason Brownlee November 16, 2016 at 9:34 am #

      The example does indeed predict future values.

      You can adapt the example and call model.predict to make a prediction for new data, rather than just evaluate the performance of predictions on new data as in the example.

      • Nico AD November 17, 2016 at 3:21 am #


        I tried to predict future values, but have trouble finding the right way to do it

        I work with the window method so my current data à t-3 t-2 t-1 t looks like this


        If I try

        data = [[100,110,108]]


        I get the following error :

        Attribute error : ‘list’ object has no attribute “shape”

        I guess the format is not correct, and I need sort of reshape.

        but for me the line

        trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1])) is not clear.

        and if I try to apply the same transform on my new data I get this error :

        IndexError : tuple index out of range

        could you provide an exemple dealing with new data ?

        Thanks !

        • Nico AD November 17, 2016 at 3:54 am #

          I think I almost here but still have a last error (my window side is 8)

          todayData = numpy.array([[1086.22,827.62,702.94,779.5,711.8399999999999,1181.25,1908.69,2006.39]])
          todayData = todayData.astype(‘float32’)
          todayData = scaler.fit_transform(todayData)
          print “todayData scaled ” + str(todayData)

          todayData = scaler.inverse_transform(todayData)
          print “todayData inversed ” + str(todayData)
          todayData = numpy.reshape(todayData, (todayData.shape[0], 1, todayData.shape[1]))

          predictTomorrow = model.predict(todayData)
          predictTomorrow = scaler.inverse_transform([predictTomorrow])
          print “prediction” + str(predictTomorrow)

          the inverse_transform line on predictTomorrow generate the following error

          ValueError : Found array with dim 3 . Estimator expected <= 2

          again a reshape issue 🙁

        • Jason Brownlee November 17, 2016 at 9:55 am #

          I am working on a new example Nico, it may be a new blog post.

          • Nico AD November 22, 2016 at 8:08 pm #

            thanks Jason. I tried various things with no luck. for me some part of the tutorial (like the reshape part / scaling ) are pure magic 🙂 trying to get some help from the keras community on gitter 🙂

          • Nico AD November 23, 2016 at 2:05 am #

            finally got it , I need to reshape in (1,1,8) ( where 8 is the look_back size)

          • Jason Brownlee November 23, 2016 at 9:00 am #

            Well done Nico.

  64. Ron November 16, 2016 at 7:15 am #

    Hi Jason

    Which book gives complete examples/codes with time series keras? I want to predict future time periods ahead and want add other predictor variables? Is that achievable?

    Please let me know if you have any resources / book that I can purchase.

    Many thanks

  65. Sarah November 19, 2016 at 7:54 am #

    Hi Jason

    Thank you for your great tutorial,
    I have a question about number of features. How could I have input with 5 variables?

    Thank you in advance


    • Jason Brownlee November 19, 2016 at 8:52 am #

      Hi Sarah, LSTMs take input in the form [samples, timesteps, features], e.g. [n, 1, 5].

      You can prepare your data in this way, then set the input configuration of your network appropriately, e.g. input_dim=5.

  66. Adam November 19, 2016 at 1:24 pm #

    Nice tutorial, thanks.
    I think the line
    for i in range(len(dataset)-look_back-1):
    should be
    for i in range(len(dataset)-(look_back-1)):

    • Adam November 19, 2016 at 1:52 pm #

      Actually, I think its
      for i in range(len(dataset)-look_back):
      testPredictPlot[train_size+(look_back-1):len(dataset)-1, :] = testPredict

  67. Ben November 23, 2016 at 8:55 am #

    Hi Adam, nice blog ! I only have a small suggestion for shifting data: use the shift() method from pandas. Cheers

    • Jason Brownlee November 23, 2016 at 9:07 am #

      Great suggestion Ben. I have been using this myself recent to create a lagged dataset.

  68. Weixian November 23, 2016 at 3:08 pm #

    Hi Jason,

    As i am new to RNN, i would like to ask about the difference in stateful:

    for i in range(100):
    model.fit(trainX, trainY, nb_epoch=1, batch_size=batch_size, verbose=2, shuffle=False)

    and the stateless:

    model.fit(trainX, trainY, nb_epoch=100, batch_size=1, verbose=2)

    Do the range of 100(nb_epoch=1) the same as nb_epoch=100?
    What is the difference between these 2?

    • Jason Brownlee November 24, 2016 at 10:37 am #

      Good question Weixian.

      In the stateful case, we are running each epoch manually and resetting the state of the network at the end of each epoch.

      In the stateless case, we let the Keras infrastructure run the loop over epochs for us and we’re not concerned with resetting network state.

      I hope that is clear.

      • Weixian November 24, 2016 at 4:21 pm #

        Hi Jason,

        Thanks for the reply.

        In this case for the stateful:
        if i reset the network, would the next input from the last trained epoch?

        For the stateless:
        Does it loop from the epochs that was previously trained?

        How does the 2 affect the data trained or tested?

        • Jason Brownlee November 25, 2016 at 9:32 am #

          Sorry, I don’t understand your questions. Perhaps you could provide more context?

          • Weixian November 28, 2016 at 7:36 pm #

            Hi Jason,

            I mean like the training results of the last epoch [Y1] output for example A
            Would the [X2] input of the network be A from the last epoch?

            How would the top situation be different from the epoch=2?

          • Jason Brownlee November 29, 2016 at 8:49 am #

            Yes, you need to have the same inputs in both cases. The difference is the LSTM is maintaining some internal state when stateful.

  69. Quinn November 23, 2016 at 5:08 pm #

    Hi Jason
    Thank you for your LSTM tutorial.

    But i found that an error always occurred, when i ran the first code in ‘model.add(LSTM(4, input_dim=look_back))’

    The error is : TypeError: super() argument 1 must be type, not None

    So, why?

    • Jason Brownlee November 24, 2016 at 10:38 am #

      Check your white space Quinn, it’s possible to let extra white space sneak in when doing the copy-paste.

  70. Vedhas November 23, 2016 at 9:57 pm #

    Many thanks for this article. I am trying to wrap my head around

    trainX = numpy.reshape(trainX, (trainX.shape[0], trainX.shape[1], 1))
    testX = numpy.reshape(testX, (testX.shape[0], testX.shape[1], 1))

    (from LSTM for Regression with Time Steps section), since this is exactly what I need.

    Let’s say I have 4 videos (prefix v) , of different lengths (say, 2,3,1 sec) (prefix t) , and for every 1 sec, I get a feature vector of length 3 (prefix f).

    So, as I understand my trainX would be like this, right? –>

    trainX=np.array (

    [ [v1_t1_f1, v1_t1_f2, v1_t1_f3],
    [v1_t2_f1, v1_t2_f2, v1_t2_f3] ],

    [ [v2_t1_f1, v2_t1_f2, v2_t1_f3],
    [v2_t2_f1, v2_t2_f2, v2_t2_f3],
    [v2_t3_f1, v2_t3_f2, v2_t3_f3], ],

    [ [v3_t1_f1, v3_t1_f2, v3_t1_f3] ] )
    (=[ [v1], [v2], [v3] ], and v is Nt x Nf python list?)

    If I have v1, v2,v3, how do I start with an **empty** xTrain and update them **recursively** to xTrain, so that xTrain can be used by Keras?

    I have tried np.append, np.insert, np.stack methods, but no success as yet, I always get some error. Kindly help!!!

    • Vedhas November 23, 2016 at 10:18 pm #

      If I make my ‘v1′,’v2′,v3′ ..’v19’ as np arrays, and trainX as a list =[ v1, v2, v3…v19 ] using trainX.append(vn) –> and eventually outside of for loop: trainX=np.array(trainX), I get following error.

      File “/usr/local/lib/python2.7/dist-packages/keras/engine/training.py”, line 100, in standardize_input_data
      Exception: Error when checking model input: expected lstm_input_1 to have 3 dimensions, but got array with shape (19, 1)

      Which makes sense since, Keras must be expecting input to have 3 dimensions = (sample,tstep, features).

      But how do I fix this???

      Your comment is awaiting moderation.

  71. Vedhas November 23, 2016 at 10:18 pm #

    If I make my ‘v1′,’v2′,v3′ ..’v19’ as np arrays, and trainX as a list =[ v1, v2, v3…v19 ] using trainX.append(vn) –> and eventually outside of for loop: trainX=np.array(trainX), I get following error.

    File “/usr/local/lib/python2.7/dist-packages/keras/engine/training.py”, line 100, in standardize_input_data
    Exception: Error when checking model input: expected lstm_input_1 to have 3 dimensions, but got array with shape (19, 1)

    Which makes sense since, Keras must be expecting input to have 3 dimensions = (sample,tstep, features).

    But how do I fix this???

  72. Ilias November 25, 2016 at 12:08 pm #

    Guestion about the stateful data representation.
    If I understood correctly prepare_data makes repeats the previous look_back sequences.
    For example the original data

    will become
    1 2 3 -> 4
    2 3 4 ->5
    3 4 5 ->6

    Then when you reshape for the stateful LSTM don’t you feed these sequences like this ?
    batch 1 sequences [ 1, 2, 3] -> predict 4
    batch 2 sequences [ 2, 3, 4] -> predict 5
    batch 3 sequences [ 3, 4, 5] -> predict 6

    In the stateful RNN shouldn’t it be two batches only that continue one from the next:
    batch 1 sequences [ 1, 2, 3] -> predict 4
    batch 2 sequences [ 3, 4, 5] -> predict 6

    Or alternatively you can have it to return the full state and predict all of them
    batch 1 sequences [ 1, 2, 3] -> predict [2, 3, 4]
    batch 2 sequences [ 3, 4, 5] -> predict [4, 5, 6]


  73. Ilias November 25, 2016 at 12:09 pm #

    Sorry i mean for the stateful RNN
    batch 1 sequences [ 1, 2, 3] -> predict 4
    batch 2 sequences [ 4, 5, 6] -> predict 7 (not 3,4,5)

  74. Luca November 25, 2016 at 8:27 pm #

    First of all, thanks for the tutorial. I’m trying to predict data that are very similar to the example ones. I was playing with the code you gave, but then something very strange happened: if I fit a model using the flight data and i use those hyper parameters to predict white noise I receive a very accurate results. Example:

    #Data Generation:

    dataset = numpy.random.randint(500, size=(200,1))
    dataset = dataset.astype(‘float32’)

    #Data Prediction:

    scaler = MinMaxScaler(feature_range=(0, 1))
    dataset = scaler.fit_transform(dataset)

    prediction in red:


    How could that be possible? White noise should be not predictable, what am I doing wrong?

    • Luca November 25, 2016 at 8:43 pm #

      Sorry, i was doing something very stupid, just ignore my latest post.


      • Luca November 25, 2016 at 10:09 pm #

        Ok, sorry again for the last correction, the result was obtained using:


        so I’am actually predicting white noise, how could that be possible?

        • Jason Brownlee November 26, 2016 at 10:38 am #

          Hi Luca, glad you’re making progress.

          If results are too good to be true, they usually are. There will be a bug somewhere.

          • Vedhas November 28, 2016 at 10:03 pm #

            Kindly reply to my question above as well, please?

            How do I shape trainX for 4 videos (v1,..v4) , of different lengths (2,3,1 sec) and for every 1 sec, I get a feature vector [f1 f2 f3] ?

          • Jason Brownlee November 29, 2016 at 8:50 am #

            Sorry, I don’t have examples of working with video data. Hopefully soon.

          • Vedhas November 30, 2016 at 4:18 am #

            oh, it is not about videos.. Question is about ‘instances/samples’ in general…
            I am saying,
            Instance1 through instance4 correspond to 2,3,1,5 feature vectors in time respectively, each of dimension 3. How do I shape these to train LSTM?

            That is the whole idea behind the section “LSTM for Regression with Time Steps” above, right?

            Features of instance1 should not be considered when training LSTM on instance2! Just as your paragraph says:

            “Some sequence problems may have a varied number of time steps per sample. For example, you may have measurements of a physical machine leading up to a point of failure or a point of surge. Each incident would be a sample the observations that lead up to the event would be the time steps, and the variables observed would be the features.”

            I don’t need *examples of working with video data.* Kindly advise only on how to shape trainX I mentioned above.

        • Shu December 20, 2016 at 3:51 am #

          look careful, isn’t there +1 shift in your white noise prediction? )))
          same as in charts in tutorial?
          best prediction for weather tomorrow is: it’ll be exact the same as today. see?

  75. Prakash November 27, 2016 at 1:24 pm #

    I see many factors for your handling this time series prediction:

    -Number of LSTM blocks
    -Lookback number

    Can you show the order of importance for these in creating a prediction model? Also, you have chosen 4 LSTM blocks, any reason for this?

  76. C November 30, 2016 at 12:51 am #

    Hi Jason,
    When I try your “Stacked LSTMs with Memory Between Batches” example as it is, I found the following error. I wonder if you could help to explain what went wrong and how to rectify it please?
    Thank you.

    ValueError Traceback (most recent call last)
    in ()
    41 model = Sequential()
    42 model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True, return_sequences=True))
    —> 43 model.add(LSTM(4, batch_input_shape=(batch_size, look_back, 1), stateful=True))
    44 model.add(Dense(1))
    45 model.compile(loss=’mean_squared_error’, optimizer=’adam’)

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/models.py in add(self, layer)
    322 output_shapes=[self.outputs[0]._keras_shape])
    323 else:
    –> 324 output_tensor = layer(self.outputs[0])
    325 if type(output_tensor) is list:
    326 raise Exception(‘All layers in a Sequential model ‘

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/engine/topology.py in __call__(self, x, mask)
    515 if inbound_layers:
    516 # This will call layer.build() if necessary.
    –> 517 self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
    518 # Outputs were already computed when calling self.add_inbound_node.
    519 outputs = self.inbound_nodes[-1].output_tensors

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/engine/topology.py in add_inbound_node(self, inbound_layers, node_indices, tensor_indices)
    569 # creating the node automatically updates self.inbound_nodes
    570 # as well as outbound_nodes on inbound layers.
    –> 571 Node.create_node(self, inbound_layers, node_indices, tensor_indices)
    573 def get_output_shape_for(self, input_shape):

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/engine/topology.py in create_node(cls, outbound_layer, inbound_layers, node_indices, tensor_indices)
    154 if len(input_tensors) == 1:
    –> 155 output_tensors = to_list(outbound_layer.call(input_tensors[0], mask=input_masks[0]))
    156 output_masks = to_list(outbound_layer.compute_mask(input_tensors[0], input_masks[0]))
    157 # TODO: try to auto-infer shape if exception is raised by get_output_shape_for.

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/layers/recurrent.py in call(self, x, mask)
    225 constants=constants,
    226 unroll=self.unroll,
    –> 227 input_length=input_shape[1])
    228 if self.stateful:
    229 updates = []

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/keras/backend/tensorflow_backend.py in rnn(step_function, inputs, initial_states, go_backwards, mask, constants, unroll, input_length)
    1304 loop_vars=(time, output_ta) + states,
    1305 parallel_iterations=32,
    -> 1306 swap_memory=True)
    1307 last_time = final_outputs[0]
    1308 output_ta = final_outputs[1]

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in while_loop(cond, body, loop_vars, shape_invariants, parallel_iterations, back_prop, swap_memory, name)
    2634 context = WhileContext(parallel_iterations, back_prop, swap_memory, name)
    2635 ops.add_to_collection(ops.GraphKeys.WHILE_CONTEXT, context)
    -> 2636 result = context.BuildLoop(cond, body, loop_vars, shape_invariants)
    2637 return result

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in BuildLoop(self, pred, body, loop_vars, shape_invariants)
    2467 self.Enter()
    2468 original_body_result, exit_vars = self._BuildLoop(
    -> 2469 pred, body, original_loop_vars, loop_vars, shape_invariants)
    2470 finally:
    2471 self.Exit()

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in _BuildLoop(self, pred, body, original_loop_vars, loop_vars, shape_invariants)
    2448 for m_var, n_var in zip(merge_vars, next_vars):
    2449 if isinstance(m_var, ops.Tensor):
    -> 2450 _EnforceShapeInvariant(m_var, n_var)
    2452 # Exit the loop.

    /home/nbuser/anaconda3_410/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py in _EnforceShapeInvariant(merge_var, next_var)
    584 “Provide shape invariants using either the shape_invariants
    585 “argument of tf.while_loop or set_shape() on the loop variables.”
    –> 586 % (merge_var.name, m_shape, n_shape))
    587 else:
    588 if not isinstance(var, (ops.IndexedSlices, sparse_tensor.SparseTensor)):

    ValueError: The shape for while_2/Merge_2:0 is not an invariant for the loop. It enters the loop with shape (1, 4), but has shape (?, 4) after one iteration. Provide shape invariants using either the shape_invariants argument of tf.while_loop or set_shape() on the loop variables.

  77. Icy December 1, 2016 at 8:14 pm #

    Hi, Jason.
    Thank you for your LSTM tutorial! I would like to use it to do some predictions, however, the input_dim is two variables, and the output_dim is one, just like: input: x(t) y(t) output: y(t+1) .I have known that you answered it with the window method. I still have no idea, any suggestions?

  78. Benjamin S. Skrainka December 2, 2016 at 6:41 am #

    This is an informative and fun article. Thanks!

    However, for this application, ARIMA and exponential smoothing perform better out of the box without any tuning.

    # Compare ARIMA vs. NN


    df.raw <- read.csv('international-airline-passengers.csv', stringsAsFactors=FALSE)
    df <- ts(df.raw[,2], start=c(1949,1), end=c(1960,12), frequency=12)

    # Same train/test split as example
    train.size <- floor(length(df) * 0.67)
    ts.train <- ts(df[1:train.size], start=c(1949,1), frequency=12)
    ts.test <- ts(df[(train.size+1):length(df)], end=c(1960,12), frequency=12)

    ts.fit <- auto.arima(ts.train)
    ets.fit <- ets(ts.train)

    fcast <- forecast(ts.fit, 4*12)
    y_hat <- fcast$mean

    # Simple ARIMA vs. NN has RMSE = 8.83/26.47 vs. 22.61/51.58 for train/test
    ModelMetrics::rmse(ts.train, fcast$fitted)
    ModelMetrics::rmse(ts.test, y_hat)

    # ETS is even better ... RMSE 7.25/23.05
    ets.fcast <- forecast(ets.fit, 4*12)
    ModelMetrics::rmse(ts.train, ets.fcast$fitted)
    ModelMetrics::rmse(ts.test, ets.fcast$mean)

    • Jason Brownlee December 2, 2016 at 8:21 am #

      Agreed Benjamin. The post does show how LSTMs can be used, just not a very good use on this dataset.

    • Hans April 25, 2017 at 6:04 am #

      Is this Python too?

  79. libra December 10, 2016 at 8:35 pm #

    I have a question is how to predict the data outside the dataset

    • Jason Brownlee December 11, 2016 at 5:27 am #

      Hi libra, train your model your training data and make predictions by calling model.predict().

      The batch size/pattern dimensions must match what was used to train the network.

  80. Nilavra Pathak December 15, 2016 at 1:38 am #

    Hi, does the dataset need to be continuous … if i have intermittent missing data then is it supposed to work ?

    • Jason Brownlee December 15, 2016 at 8:29 am #

      You can use 0 to pad and to mark missing values Nilavra.

      Also, try consider imputing and see how that affects performance.

  81. Aubrey Li December 16, 2016 at 6:02 pm #

    Hi Jason,

    This is a wonderful tutorial. As a beginner, just wondering, how do I know when I should add a layer and when I should add more neurons in a layer?

    • Jason Brownlee December 17, 2016 at 11:09 am #

      Great question.

      More layers offer more “levels of abstraction” or indirection, depending on how you want to conceptualize.

      More nodes/modules in a layer offers more “capacity” at one level of abstraction.

      Increasing the capacity of the network in terms of layers or neurons in a layer will both require more learning (epochs) or faster learning (learning rate).

      What is the magic bullet for a given problem? There’s none. Find a big computer with lots of CPU/RAM and grind away a suite of ideas on a sample of the dataset to see what works well.

      • Aubrey Li December 17, 2016 at 10:25 pm #

        Thanks for the reply, another question is, is there a typical scenario we should use stacked LSTM instead of the normal one?

        • Jason Brownlee December 18, 2016 at 5:31 am #

          When you need more representation capacity.

          It’s a vague answer , because it’s a hard question to answer objectively, sorry.

  82. Je December 20, 2016 at 5:32 am #

    Hi Jason,
    Many thanks for the tutorial. Very useful indeed.

    Following up the question from Aubrey Li and your response to that, does it mean that if I double the number of LSTM nodes (from four to eight), it will perform better?. In other words, how did you decide that number of LSTM nodes to be of 4 and not 6 or 8?

    Thanks 🙂



    • Jason Brownlee December 20, 2016 at 7:26 am #

      It may perform better but may require a lot more training.

      It may also not converge or it may overfit the problem.

      Sadly, there is no magic bullet, just a ton of trial and error. This is why we must develop a strong test harness for a given problem and a strong baseline performance for models to out-perform.

      • Je December 21, 2016 at 6:02 am #

        Thanks Jason. Please keep throwing all these nice and very informative blogs / tutorials.


  83. nrcjea001 December 20, 2016 at 6:34 pm #

    Hi Jason

    I’ve been struggling with a particular problem and I am hoping you can assist. Basically, I’m running a stateful LSTM following the same logic and code as you’ve discussed above and in addition I’ve played around a bit by adding for example a convolutional layer. My issue is with the mean squared error given at the last epoch (where verbose=2 in model.fit) compared to the mean squared error calculated from trainPredict as in the formula you provide above. Please correct me if I am wrong, but my intuition tells me that these two mean square errors should be the same or at least approximately equal because we are predicting on the training set. However, in my case the mean square error calculated from trainPredict is nearly 50% larger than the mean square error at the last epoch of model.fit. Initially, I thought this had something to do with the resetting of states, but this seems not to be the case with only small differences noticed through my investigation. Does anything come to mind of why this may be? I feel like there is something obvious I’m missing here.

    model.compile(loss=’mean_squared_error’, optimizer=ada)
    for i in range(500):
    XXm = model.fit(trainX, trainY, nb_epoch=1, batch_size=bats, verbose=0,

    at epoch 500: {‘loss’: [0.004482088498778204]}

    trainPredict = model.predict(trainX, batch_size=bats)
    mean_squared_error(trainY, trainPredict[:,0])
    Out[68]: 0.0064886363673947768


    • Jason Brownlee December 21, 2016 at 8:36 am #

      I agree with your intuition, I would expect the last reported MSE to match a manually calculated MSE. Also, it is not obvious from a quick scan where you might be going wrong.

      Start off by confirming this expectation on a standalone small network with a contrived or well understand dataset. Say one hidden layer MLP on the normalized boston house price dataset.

      This is a valuable exercise because it cuts out all of the problem specific and technique specific code and concerns and gets right to the heart of the matter.

      Once achieved, now come back to your project and cut it back to the bone until it achieves the same outcome.

      Let me know how you go.

      • nrcjea001 December 22, 2016 at 11:36 pm #

        Hi Jason

        Thanks for getting back to me.

        I followed your suggestion by running a simple MLP using the housing dataset but I’m still seeing differences. Here is my code as well as the output:

        %reset -f
        import numpy
        seed = 50
        import pandas
        from keras.models import Sequential
        from keras.layers import Dense
        from sklearn.metrics import mean_squared_error

        dataframe = pandas.read_csv(“housing.csv”, delim_whitespace=True,
        dataset = dataframe.values

        X = dataset[:,0:13]
        Y = dataset[:,13]

        model = Sequential()
        model.add(Dense(13, input_dim=13, init=’normal’, activation=’relu’))
        model.add(Dense(1, init=’normal’))
        model.compile(loss=’mean_squared_error’, optimizer=’adam’)

        mhist = model.fit(X, Y, nb_epoch=nep, batch_size=3, verbose=0)
        print ‘MSE on last epoch:’, mhist.history[“loss”][nep-1]

        print ‘Calculated MSE:’, mean_squared_error(Y, PX)

        MSE on last epoch: 30.7131816067
        Calculated MSE: 28.8423397398

        Please advise. Thanks

      • nrcjea001 December 23, 2016 at 12:43 am #

        Apologies. I forgot to scale. Used a MinMaxScaler

        scalerX = MinMaxScaler(feature_range=(0, 1))
        scalerY = MinMaxScaler(feature_range=(0, 1))
        X = scalerX.fit_transform(dataset[:,0:13])
        Y = scalerY.fit_transform(dataset[:,13])

        MSE on last epoch: 0.00589414117318
        Calculated MSE: 0.00565485540125

        The difference is about 4%. Perhaps this is negligible?

        • Jason Brownlee December 23, 2016 at 5:32 am #

          Might be small differences due to random number generators and platform differences.

  84. David Holmgren December 22, 2016 at 10:07 am #

    Hi Jason,

    Thank you for an excellent introduction to using LSTM networks for time series prediction; I learned a great deal from this article. One question I did have: if I wanted to plot the difference between the data and prediction, would it be correct to use something like (in the case of the training data):


    Once again, many thanks.


  85. unknnw0afa December 23, 2016 at 1:22 pm #

    For the codes with stacked ltsm, I’m getting the following error. Copy paste the whole thing doesn’t work either. Any help?

    The shape for while_1/Merge_2:0 is not an invariant for the loop. It enters the loop with shape (1, 4), but has shape (?, 4) after one iteration. Provide shape invariants using either the shape_invariants argument of tf.while_loop or set_shape() on the loop variables.

    • Jason Brownlee December 24, 2016 at 4:32 am #

      Ouch, I’ve not seen that before.

      Perhaps try StackOverflow or the google group for the backend that you’re using?

  86. Søren Pallesen December 25, 2016 at 9:18 pm #

    Hi Jason.

    Thanks for all you valuable advice here.

    I have trained a model for time series prediction on a quite big data set, which took 12 hours for 100 epochs.

    The results (validation accuracy) stayed flat for the first 90 epochs and then began to move up.

    Now wonder how to add more training on top of a trained model in Keras without loosing the training gained from the first 100 epochs?

    Best regards

  87. Je December 27, 2016 at 10:31 am #

    Hi Jason,
    Another question towards the normalisation. Here, we are lucky to have all the data for training and testing. And this has enabled us to normalise the data (MinMaxScaler). However, in real-life, we may not have all the data in one go and in fact it is very likely the case that we will be receiving data from streams. In such cases, we will never has the max or min or even the sum. How do we handle this case (so that we can feed the RNN with the normalised values?).

    One obvious solution, perhaps, is calculating this over the running data. But that will be an expensive approach. Or something to do with stochastic sampling strategy ? Any help Jason?

    Thanks in advance

    Kind Regards


    • Jason Brownlee December 28, 2016 at 7:03 am #

      Great question Je.

      For normalization we need to estimate the expected extremes of the data (min/max). For standardization we need to estimate the expected mean and standard deviation. These can be stored and used any time to validate and prepare data.

      For more on normalizing and standardizing time series data, see this post:

      • Je December 28, 2016 at 11:02 pm #

        Hi Jason,
        Thanks for the response and for the pointer. Useful – I have to say. 🙂

        Kind Regards


      • Je December 28, 2016 at 11:08 pm #

        Thanks Jason for the response and for the pointer. Useful – I have to say. 🙂

  88. Shaun L January 5, 2017 at 2:21 am #

    Hi Jason,

    Great article! I got a lot of benefits from your work.

    One question here, lots of LSTM code like yours use such

    trainX[1,2,3,4] to target trainY[5]
    trainX[2,3,4,5] to target trainY[6]

    It is possible to make trainY also be time series? like

    trainX[1,2,3,4] to target trainY[5,6]
    trainX[2,3,4,5] to target trainY[6,7]

    So the prediction will be done at once rather than 5 and then 6.

    Best regards,

    • Jason Brownlee January 5, 2017 at 9:24 am #

      Yes, Shaun.

      Reform the dataset with two output variables. Then change the number of neurons in the output layer to 2.

      I will have an example of this on the blog in coming weeks.

      • Shaun L January 7, 2017 at 1:27 am #

        Thanks, I look forward to your example! I really wonder the advantages and disadvantages in doing so.

  89. Joaco January 9, 2017 at 6:49 pm #

    Hi Jason, I am here again. I have achieved my goal to predict more than one day in this period of time. But now I have another question. I make X=[x1,x2…x30] and Y=[y1,y2…y7], which means I use 30 days to predict 7 days. When predicting y2, actually I used the real value. So here is the question. How can I put my predicted number,like y2,to the X sequence to predict y3? I am looking forward to your answer.
    Thank you very much

  90. Kavitha January 11, 2017 at 12:11 am #

    Hi Jason, a great tutorial. I’m a newbie, and trying to understand this code. My understanding of Keras is that time steps refers to the number of hidden nodes that the system back propagates to through time, and input dimensions refers to the number of ‘features’ for a given input datum (e.g. if we had 2 categorical values, the input dimensions would be 2). So what confuses me about the code is that it tries to model past values (look back) as the number of input dimensions. Timesteps is always set to 1. In that case isn’t the system not behaving like a recurrent network at all but more like an MLP? Thanks!

    • Jason Brownlee January 11, 2017 at 9:28 am #

      Hi Kavitha,

      The tutorial demonstrates a number of ways that you can use LSTMs, including using lag variables as input features and lag variables as time steps.

      • Kavitha January 16, 2017 at 10:21 am #

        Got it, thank you!

      • amal July 11, 2017 at 11:08 pm #

        hi jason,
        thank you for this great Tuto

        with one timestep what is the difference between an MLP and lstm

        • Jason Brownlee July 12, 2017 at 9:44 am #

          LSTMs are a very different architecture to MLP. The internal state and gates will result in a different mapping function being learned.

          Using a single time step input would not be a good use for an LSTM.

  91. Nishat January 12, 2017 at 2:56 pm #

    Hi Jason, I am looking for a machine learning algorithm that can learn the timing issues like debounce and flip flops in logic circuits and predict an output.

  92. sss January 13, 2017 at 5:43 pm #

    I think this is wrong :len(dataset)-look_back-1
    it should be len(dataset)-look_back

  93. Jakub January 17, 2017 at 9:04 pm #


    I would like to point out that the graphics

    LSTM Trained on Regression Formulation of Passenger Prediction Problem

    is the most confusing part of the article.

    The red line is NOT the actual prediction for 1,2,3, etc. steps ahead. As we can see from the data, you need to know the REAL value just at the time T to predict T+1, it is not based on your prediction in this setup.

    If you need to do a prediction for more steps ahead a different approach is needed.

    I am still grateful for the parts of the code you have provided, but this part led me way away from my goal.

    • Faezeh January 24, 2017 at 3:34 am #

      Hi Jakub, do you have any idea on what approach to take for multi-step ahead prediction?

  94. Salvo January 19, 2017 at 11:09 pm #

    I would control the input of the internal gate of the cell memory. is it a possible thing to do?
    In case of yes, what are the function that allow it? Thanks!

    • Jason Brownlee January 20, 2017 at 10:20 am #

      I don’t believe this is the case in Keras Salvo. I’m happy to be corrected though.

      • Salvo January 21, 2017 at 2:57 am #

        Thanks for your help! These articles are very useful for my studies!

  95. Nader January 20, 2017 at 4:23 am #

    in the “LSTM for Regression with Time Steps”
    how can we add more layers to the model ?

    model = Sequential()
    model.add(LSTM(4, input_dim=1))
    model.compile(loss=’mean_squared_error’, optimizer=’adam’)
    model.fit(trainX, trainY, nb_epoch=100, batch_size=1, verbose=2)

    How can I add another Layer or more Layers ?

    • Jason Brownlee January 20, 2017 at 10:24 am #

      Hi Nader,

      Set the batch_input_shape on each layer and set the return_sequences argument on all layers except the output layer.

      I’d recommend carefully re-reading the words and code in the section titled “Stacked LSTMs with Memory Between Batches”.

      I hope that helps.

  96. Anthony January 21, 2017 at 1:25 am #

    Thanks for the nice blog. What Hardware configurations are required for running this program?

    • Jason Brownlee January 21, 2017 at 10:34 am #

      Hi Anthony, you’re welcome.

      A normal PC without a GPU is just fine for running small LSTMs like those in this tutorial.

    • Nikola Tanković January 21, 2017 at 9:34 pm #

      I have a small question. I dont see how look_back feature is relevant. If I put look_back to zero or one but increase memory units to lets say 20, I get much better results because the network itself “learns” to look back as much as its needed. Can you replicate that? Isn’t that the whole point of LSTM?

  97. Sam January 23, 2017 at 6:57 am #

    How do you recommend we include additional features, such as
    moving averages, standard deviation,etc.. ?

    Also, how would we tune the Stacked LSTMs with Memory Between Batches
    to achieve better accuracy ?

  98. Anthony January 23, 2017 at 4:04 pm #

    Thanks Jason for a wonderful post. Your code uses keras which has tensorflow working in the background. Tensorflow is not available under Windows platform. Is there any way one could run this code in windows?
    I am using Anaconda.

    • Jason Brownlee January 24, 2017 at 11:00 am #

      Hi Anthony, absolutely. Use the Theano backend instead:

    • Hans April 25, 2017 at 6:17 am #

      Correction 04.2017: Its available on Windows/Anaconda

  99. Akhilesh Kumar January 23, 2017 at 7:27 pm #

    I think the way data is normalized in this tutorial is not correct. The shows hetroskadicity and hence needs advanced method of normalization.

    • Jason Brownlee January 24, 2017 at 11:01 am #

      I agree Akhilesh.

      The series really should have been made stationary first. A log or box-cox transform and then differenced.

  100. S Wollner January 24, 2017 at 2:06 am #

    I’m sorry to tell you that this is no prediction.
    Your LSTM network learned to save the value from t-1 and retrieve it at time t.

    Try one thing… train this model on that dataset… and test this on a hole different Timeseries. E.g. a sincurve.

    You will get the inputted sincurve with an offset of 1 timestep out. Maybe with some distortion in it.

    I can get the same results with an stupid arima model…

    This is no prediction at all. It just a stupid system.

    Kind Regards,
    S. Wollner

    • Jason Brownlee January 24, 2017 at 11:06 am #

      Thanks S. Wollner,

      It is a trivial perhaps even terrible prediction example, but it does show how to use the LSTM features of Keras.

      I hope to provide some updated examples soon.

      • Hans April 25, 2017 at 4:42 pm #

        If the example is not predicting anything, is this article somehow misleading for those trying to predict with this code?

        • Hans April 25, 2017 at 5:01 pm #

          Now I have two adapted versions of the example, feeded with own data.

          One from Wollner and one from Jason. Both are running and plotting.

          With some additions, I’m even able to forecast unseen data- BUT…

          As beginner, how can I decide whether I’m dealing with real predictions or not?

  101. S. Wollner January 26, 2017 at 8:24 am #

    Hi again,

    I’ve updated your example so that a real prediction is possible.

    What I did:
    set look_back to 25
    add a linear activation to the Dense layer
    and changed trainings settings like batch size

    optionally I added detrending and stationarity of signal (Currently it’s commented out)

    Here is the code:

    import numpy
    import math
    import matplotlib.pyplot as plt
    import pandas
    from keras.models import Sequential
    from keras.layers import Dense
    from keras.layers import LSTM
    from keras.layers import Activation
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.metrics import mean_squared_error

    # fix random seed for reproducibility

    # load the dataset
    dataframe = pandas.read_csv(‘international-airline-passengers.csv’, usecols=[1], engine=’python’, skipfooter=3)
    dataset = dataframe.values
    dataset = dataset.astype(‘float32′)


    # normalize the dataset
    #dataset = numpy.log10(dataset) # stationary signal
    #dataset = numpy.diff(dataset, n=1, axis=0) # detrended signal
    dataset = (dataset – numpy.min(dataset)) / (numpy.max(dataset) – numpy.min(dataset)) # normalized signal


    # split into train and test sets
    train_size = int(len(dataset) * 0.67)
    test_size = len(dataset) – train_size
    train, test = dataset[:train_size,:], dataset[train_size:len(dataset),:]
    print(len(train), len(test))

    # convert an array of values into a dataset matrix
    def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    dataY.append(dataset[i + look_back, 0])
    return numpy.array(dataX), numpy.array(dataY)

    # reshape into X=t and Y=t+1
    look_back = 25
    trainX, trainY = create_dataset(train, look_back)
    testX, testY = create_dataset(test, look_back)

    # reshape input to be [samples, time steps, features]
    trainX = numpy.reshape(trainX, (trainX.shape[0], 1, trainX.shape[1]))
    testX = numpy.reshape(testX, (testX.shape[0], 1, testX.shape[1]))

    # create and fit the LSTM network
    model = Sequential()
    model.add(LSTM(100, input_dim = look_back))

    model.compile(loss=’mean_squared_error’, optimizer=’adam’)
    #model.compile(loss=”mean_squared_error”, optimizer=”rmsprop”)
    model.fit(trainX, trainY, nb_epoch=100, batch_size=25, validation_data=(testX, testY), verbose=1)
    score = model.evaluate(testX, testY, verbose=0)
    print(‘Test score:’, score)

    # make predictions
    trainPredict = model.predict(trainX, verbose=0)
    testPredict = model.predict(testX, verbose=0)

    # shift train predictions for plotting
    trainPredictPlot = numpy.empty_like(dataset)
    trainPredictPlot[:, :] = numpy.nan
    trainPredictPlot[look_back:len(trainPredict)+look_back, :] = trainPredict

    # shift test predictions for plotting
    testPredictPlot = numpy.empty_like(dataset)
    testPredictPlot[:, :] = numpy.nan
    testPredictPlot[len(trainPredict)+(look_back*2)+1:len(dataset)-1, :] = testPredict

    # plot baseline and predictions

    Kind regards
    S. Wollner

    • Kay February 2, 2017 at 10:52 am #

      Hello Wollner,
      I tried to follow your code, however i got the prediction as a straight line. Where do you think i went wrong.
      Thank you.

  102. Luis January 27, 2017 at 1:10 am #


    Thank you for this excellent post. I have reproduce the example and also used a real time-series data set successfully. But I have a simple question:

    How I can generate a sequence of predict new values? I mean future values (no the test values), for example, the six first months of the year 1961; values for 1961-01, 1961-02,…1961-06.


    • Jason Brownlee January 27, 2017 at 12:09 pm #

      Hi Luis, you can make predictions on new data by calling y = model.predict(X)

  103. shazz January 27, 2017 at 5:54 am #

    Hi Jason,
    I hope I don’t ask for something already in the comments, but at least I did not see it. All my apologizes else.

    Based on your dataset, let’s assume that we have more features than only the number of passengers per unit of time, for example the “current” weather, fuel price,… whatever.
    If I want to use them in the training, the idea is the same, for each one I “copy” the n loopback values for each sample ?


    • Jason Brownlee January 27, 2017 at 12:24 pm #

      Hi Shazz,

      I would recommend creating a new dataset using DataFrame.shift() rather than the crude loop back function in this example.

  104. Sam January 27, 2017 at 6:06 am #

    Hello S. Wollner:

    I too notice the predictions to be simply mimicking the last known value.
    Thanks for posting your code.

    I had a couple questions on your post:

    1. How did you set the batch_size ? It appears that matches the lookback.
    Is that intentional ?

    2. Similarly, how did you know how to set the number of neurons to 100 via this line:
    model.add(LSTM(100, input_dim = look_back)) ?
    That appears to be 4*look_back ?

    • S. Wollner January 27, 2017 at 11:57 pm #

      Hi Sam,

      To 1.:
      No it doesn’t have to match the look_back amount.
      Here is a similar question and a good answer

      Short, you devide your time series into pieces for training. In this case:
      Training_set_size = (data_size * train_size – forecast_amount) / look_back
      That is the set for training your network. Now you devide it by the batch_size.
      Each batch should have more or less the same size.
      In the link above you will see pros and cons about the size of each batch.

      To 2.:
      Try and error, like almost everything with neuronal networks. That’s parameter optimization. There is no true config for all problems. The more neurons you have, the more powerful your network can be. The problem is you also need a larger trainingset.

      Normally you iterate through the number of neurons. E.g. you start at 2 and go up till 100 in a step size of 2 neurons. For each step you calculate at least 35 networks (statistical expression) and calculate the mean and variance over the error of train- and testset.
      Plot all the results in a graph and take the network with less complexity and best TEST rate (not train!). Consider variance and mean!!!

      That’s a paper from our research group. In this paper you’ll see such a graphic for lvq networks (Figure 4).

      Kind regards,
      S. Wollner

  105. Sam January 29, 2017 at 6:40 am #

    Thanks S. Wollner for the guidance.

    I’m currently trying to use this LSTM RNN to predict monthly stock returns.
    Again though I cannot beat the naive benchmark of simply predicting
    t+1 = t or predicting the future return is simply the last known/given return at time t.

    I’m wondering what else I can tune /change in the LSTM RNN to remove
    the” mimicking” effect ?

  106. berkmeister January 29, 2017 at 11:17 pm #

    The major difficulty here is that the time series is non stationary – it is both mean trending and the variance is exploding as well. It is very hard to forecast using this time series.

    You get around this by scaling using the entire dataset, therefore violating the in-sample out-of-sample separation. In other words, you are looking into the future, i.e. your test set, for scaling – which unfortunately is not possible in real life.

    • Jason Brownlee February 1, 2017 at 10:17 am #

      Hi berkmeister,

      The level can be made stationary with order one differencing.

      The variance can be made stationary with log or box-cox transforms.

      Both methods can be used on test and training data.

  107. Abdulaziz Almalaq January 31, 2017 at 10:14 am #

    Hi Jason,

    Many thanks to your post and tutorial. I really got the most beneficial of ideas to apply the LSTM to my problem.

  108. Sam February 2, 2017 at 6:21 am #

    I believe I have made the stock data in my dataset stationary by taking the first difference of the log of the prices.

    However, if I want to include additional features such as
    volatility, a moving average, etc… would those be computed
    on the ORIGINAL stock prices or on the newly calculated
    log differences, which are stationary ?

  109. Sam February 7, 2017 at 10:32 am #

    Another question I had was on performing 2 or more day ahead forecasts on a stationary time series
    with first differences.
    For example, if we want to forecast 5 days ahead of day (instead of 1 day), would we instead use
    the differences between t and t-5 ?

    • Jason Brownlee February 8, 2017 at 9:32 am #

      Hi Sam,

      Forecasts would be made one time step at a time. The differences can then be inverted from the last known observation across each of the predicted time steps.

      I hope that answers your question.

      • Sam February 9, 2017 at 5:49 am #

        Unfortunately I’m not able to follow.

        Suppose we have the following stock price history

        Date Price Difference
        1/2/2017 100
        1/3/2017 102 2
        1/4/2017 104 2
        1/5/2017 105 1
        1/6/2017 106 1
        1/7/2017 107 1
        1/8/2017 108 1

        If we want to forecast what the price will be on January 8th STARTING from January 3rd (a 5 day horizon),
        how would build the differences to make the series stationary? If we continue with first differences, then I believe we would only be forecasting the change from Jan 7th to Jan 8, which is still
        a 1 day change, not a 5 day ?

        Thanks again.

        • Jason Brownlee February 9, 2017 at 7:30 am #

          Hi Sam,

          Off the cuff: The LSTM can forecast a 5-day horizon by having 5 neurons in the output layer and learn from differenced data. The difference inverse can be applied from the last know observation and propagated along the forecast to get back to domain values.

          • Sam February 10, 2017 at 4:53 am #

            Alright, so if I understand correctly, the 5 outputs from the output layer
            would correspond to the differences between days 0-1,1 -2, 2-3,3-4, 4-5 respectively?

            Thanks for your patience.

          • Jason Brownlee February 10, 2017 at 9:54 am #

            Correct Sam.

          • Sam February 11, 2017 at 3:52 am #

            One more question on that:
            Would I also need to modify the target values (trainY) so they
            contained 5 targets per sample, instead of just one ? That is to match up the
            5 RNN outputs ?


          • Jason Brownlee February 11, 2017 at 5:06 am #


  110. Kim February 8, 2017 at 3:10 am #

    Hi, Jason
    I have some question about using multivariable.

    Did I understand correctly?
    for example, if i have three variables and one window (just one day, continuous data)
    data structure is In this way,

    variable1 variable2 variable3 output1
    (input_shape=(1, 3))

    and, if i have three variables and two windows (two day, continuous data)
    data structure is In this way,

    variable1(t-1) variable2(t-1) variable3(t-1) output1(t-1)
    variable1 variable2 variable3 output1
    (input_shape=(2, 3))

    is it right way? thank in advance

  111. YS_XIE February 11, 2017 at 1:52 am #

    Many thanks to your post and tutorial. I really got the most beneficial of ideas to apply the LSTM to my problem.

    I have some questions:
    1): How to save the test data and predict dat to a text file?
    2): How to save the output image ?

    Thanks a lot.

    • Jason Brownlee February 11, 2017 at 5:05 am #

      You can save data to a file using Python IO functions, npy functions for saving the matrix, or wrap it in a dataframe and save that.

      You can save a plot using the matplotlib function savefig().

      • YS_XIE February 11, 2017 at 11:56 am #

        Thanks for your quickly reply. I have resolved the problem.

  112. Tony Zhang February 15, 2017 at 12:24 am #

    Hi, Jason
    It’s a great tutorial. I have learnt a lot from it. Thank you very much.
    By the way, is it possible to use the LSTM-RNN to obtain the predictions with a probability distribution? I think it will be even better if LSTM-RNN can do this.
    Please let me know if I have the wrong thinking.

    • Jason Brownlee February 15, 2017 at 11:36 am #

      Sure Tony, you could use a sigmoid on the output layer and interpret it as a probability distribution.

      • Tony Zhang February 15, 2017 at 12:26 pm #

        Thank you for your quickly reply.
        Maybe I have asked in a wrong way. I mean like the example above, is it possible we get the probability distributions of the predicted future passengers at the same time? In other words, how confident we are sure about the prediction accuracies.

  113. Amw 5G February 15, 2017 at 6:19 am #

    Thank you for this, it has been a great help in debugging my own keras RNN code. A suggestion for your root LSTM for Regression with Time Steps model, as examples of what else you could do:

    First, incorporate the month number as a predictor. This helps with the obvious seasonality in the time series. You can do this by creating an N-by-lookback shaped matrix where the value equals the month number (0 for January, …, 11 for December). I did it by adjusting the create_dataset function to look like
    def create_dataset(dataset, look_back=1):
    dataX, dataY, dataT = [], [], []
    for i in range(len(dataset)-look_back-1):
    a = dataset[i:(i+look_back), 0]
    dataY.append(dataset[i + look_back, 0])
    b = [x % (12) for x in range(i, i+(look_back))] #12 because tha