[New Book] Click to get Mastering Digital Art with Stable Diffusion!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

Multistep Time Series Forecasting with LSTMs in Python

The Long Short-Term Memory network or LSTM is a recurrent neural network that can learn and forecast long sequences.

A benefit of LSTMs in addition to learning long sequences is that they can learn to make a one-shot multi-step forecast which may be useful for time series forecasting.

A difficulty with LSTMs is that they can be tricky to configure and it can require a lot of preparation to get the data in the right format for learning.

In this tutorial, you will discover how you can develop an LSTM for multi-step time series forecasting in Python with Keras.

After completing this tutorial, you will know:

  • How to prepare data for multi-step time series forecasting.
  • How to develop an LSTM model for multi-step time series forecasting.
  • How to evaluate a multi-step time series forecast.

Kick-start your project with my new book Deep Learning for Time Series Forecasting, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Updated Apr/2019: Updated the link to dataset.
Multi-step Time Series Forecasting with Long Short-Term Memory Networks in Python

Multi-step Time Series Forecasting with Long Short-Term Memory Networks in Python
Photo by Tom Babich, some rights reserved.

Tutorial Overview

This tutorial is broken down into 4 parts; they are:

  1. Shampoo Sales Dataset
  2. Data Preparation and Model Evaluation
  3. Persistence Model
  4. Multi-Step LSTM


This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this example.

This tutorial assumes you have Keras v2.0 or higher installed with either the TensorFlow or Theano backend.

This tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help setting up your Python environment, see this post:

Next, let’s take a look at a standard time series forecasting problem that we can use as context for this experiment.

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Shampoo Sales Dataset

This dataset describes the monthly number of sales of shampoo over a 3-year period.

The units are a sales count and there are 36 observations. The original dataset is credited to Makridakis, Wheelwright, and Hyndman (1998).

The example below loads and creates a plot of the loaded dataset.

Running the example loads the dataset as a Pandas Series and prints the first 5 rows.

A line plot of the series is then created showing a clear increasing trend.

Line Plot of Shampoo Sales Dataset

Line Plot of Shampoo Sales Dataset

Next, we will take a look at the model configuration and test harness used in the experiment.

Data Preparation and Model Evaluation

This section describes data preparation and model evaluation used in this tutorial

Data Split

We will split the Shampoo Sales dataset into two parts: a training and a test set.

The first two years of data will be taken for the training dataset and the remaining one year of data will be used for the test set.

Models will be developed using the training dataset and will make predictions on the test dataset.

For reference, the last 12 months of observations are as follows:

Multi-Step Forecast

We will contrive a multi-step forecast.

For a given month in the final 12 months of the dataset, we will be required to make a 3-month forecast.

That is given historical observations (t-1, t-2, … t-n) forecast t, t+1 and t+2.

Specifically, from December in year 2, we must forecast January, February and March. From January, we must forecast February, March and April. All the way to an October, November, December forecast from September in year 3.

A total of 10 3-month forecasts are required, as follows:

Model Evaluation

A rolling-forecast scenario will be used, also called walk-forward model validation.

Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value for the next month from the test set will be taken and made available to the model for the forecast on the next time step.

This mimics a real-world scenario where new Shampoo Sales observations would be available each month and used in the forecasting of the following month.

This will be simulated by the structure of the train and test datasets.

All forecasts on the test dataset will be collected and an error score calculated to summarize the skill of the model for each of the forecast time steps. The root mean squared error (RMSE) will be used as it punishes large errors and results in a score that is in the same units as the forecast data, namely monthly shampoo sales.

Persistence Model

A good baseline for time series forecasting is the persistence model.

This is a forecasting model where the last observation is persisted forward. Because of its simplicity, it is often called the naive forecast.

You can learn more about the persistence model for time series forecasting in the post:

Prepare Data

The first step is to transform the data from a series into a supervised learning problem.

That is to go from a list of numbers to a list of input and output patterns. We can achieve this using a pre-prepared function called series_to_supervised().

For more on this function, see the post:

The function is listed below.

The function can be called by passing in the loaded series values an n_in value of 1 and an n_out value of 3; for example:

Next, we can split the supervised learning dataset into training and test sets.

We know that in this form, the last 10 rows contain data for the final year. These rows comprise the test set and the rest of the data makes up the training dataset.

We can put all of this together in a new function that takes the loaded series and some parameters and returns a train and test set ready for modeling.

We can test this with the Shampoo dataset. The complete example is listed below.

Running the example first prints the entire test dataset, which is the last 10 rows. The shape and size of the train test datasets is also printed.

We can see the single input value (first column) on the first row of the test dataset matches the observation in the shampoo-sales for December in the 2nd year:

We can also see that each row contains 4 columns for the 1 input and 3 output values in each observation.

Make Forecasts

The next step is to make persistence forecasts.

We can implement the persistence forecast easily in a function named persistence() that takes the last observation and the number of forecast steps to persist. This function returns an array containing the forecast.

We can then call this function for each time step in the test dataset from December in year 2 to September in year 3.

Below is a function make_forecasts() that does this and takes the train, test, and configuration for the dataset as arguments and returns a list of forecasts.

We can call this function as follows:

Evaluate Forecasts

The final step is to evaluate the forecasts.

We can do that by calculating the RMSE for each time step of the multi-step forecast, in this case giving us 3 RMSE scores. The function below, evaluate_forecasts(), calculates and prints the RMSE for each forecasted time step.

We can call it as follows:

It is also helpful to plot the forecasts in the context of the original dataset to get an idea of how the RMSE scores relate to the problem in context.

We can first plot the entire Shampoo dataset, then plot each forecast as a red line. The function plot_forecasts() below will create and show this plot.

We can call the function as follows. Note that the number of observations held back on the test set is 12 for the 12 months, as opposed to 10 for the 10 supervised learning input/output patterns as was used above.

We can make the plot better by connecting the persisted forecast to the actual persisted value in the original dataset.

This will require adding the last observed value to the front of the forecast. Below is an updated version of the plot_forecasts() function with this improvement.

Complete Example

We can put all of these pieces together.

The complete code example for the multi-step persistence forecast is listed below.

Running the example first prints the RMSE for each of the forecasted time steps.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

This gives us a baseline of performance on each time step that we would expect the LSTM to outperform.

The plot of the original time series with the multi-step persistence forecasts is also created. The lines connect to the appropriate input value for each forecast.

This context shows how naive the persistence forecasts actually are.

Line Plot of Shampoo Sales Dataset with Multi-Step Persistence Forecasts

Line Plot of Shampoo Sales Dataset with Multi-Step Persistence Forecasts

Multi-Step LSTM Network

In this section, we will use the persistence example as a starting point and look at the changes needed to fit an LSTM to the training data and make multi-step forecasts for the test dataset.

Prepare Data

The data must be prepared before we can use it to train an LSTM.

Specifically, two additional changes are required:

  1. Stationary. The data shows an increasing trend that must be removed by differencing.
  2. Scale. The scale of the data must be reduced to values between -1 and 1, the activation function of the LSTM units.

We can introduce a function to make the data stationary called difference(). This will transform the series of values into a series of differences, a simpler representation to work with.

We can use the MinMaxScaler from the sklearn library to scale the data.

Putting this together, we can update the prepare_data() function to first difference the data and rescale it, then perform the transform into a supervised learning problem and train test sets as we did before with the persistence example.

The function now returns a scaler in addition to the train and test datasets.

We can call this function as follows:

Fit LSTM Network

Next, we need to fit an LSTM network model to the training data.

This first requires that the training dataset be transformed from a 2D array [samples, features] to a 3D array [samples, timesteps, features]. We will fix time steps at 1, so this change is straightforward.

Next, we need to design an LSTM network. We will use a simple structure with 1 hidden layer with 1 LSTM unit, then an output layer with linear activation and 3 output values. The network will use a mean squared error loss function and the efficient ADAM optimization algorithm.

The LSTM is stateful; this means that we have to manually reset the state of the network at the end of each training epoch. The network will be fit for 1500 epochs.

The same batch size must be used for training and prediction, and we require predictions to be made at each time step of the test dataset. This means that a batch size of 1 must be used. A batch size of 1 is also called online learning as the network weights will be updated during training after each training pattern (as opposed to mini batch or batch updates).

We can put all of this together in a function called fit_lstm(). The function takes a number of key parameters that can be used to tune the network later and the function returns a fit LSTM model ready for forecasting.

The function can be called as follows:

The configuration of the network was not tuned; try different parameters if you like.

Report your findings in the comments below. I’d love to see what you can get.

Make LSTM Forecasts

The next step is to use the fit LSTM network to make forecasts.

A single forecast can be made with the fit LSTM network by calling model.predict(). Again, the data must be formatted into a 3D array with the format [samples, timesteps, features].

We can wrap this up into a function called forecast_lstm().

We can call this function from the make_forecasts() function and update it to accept the model as an argument. The updated version is listed below.

This updated version of the make_forecasts() function can be called as follows:

Invert Transforms

After the forecasts have been made, we need to invert the transforms to return the values back into the original scale.

This is needed so that we can calculate error scores and plots that are comparable with other models, like the persistence forecast above.

We can invert the scale of the forecasts directly using the MinMaxScaler object that offers an inverse_transform() function.

We can invert the differencing by adding the value of the last observation (prior months’ shampoo sales) to the first forecasted value, then propagating the value down the forecast.

This is a little fiddly; we can wrap up the behavior in a function name inverse_difference() that takes the last observed value prior to the forecast and the forecast as arguments and returns the inverted forecast.

Putting this together, we can create an inverse_transform() function that works through each forecast, first inverting the scale and then inverting the differences, returning forecasts to their original scale.

We can call this function with the forecasts as follows:

We can also invert the transforms on the output part test dataset so that we can correctly calculate the RMSE scores, as follows:

We can also simplify the calculation of RMSE scores to expect the test data to only contain the output values, as follows:

Complete Example

We can tie all of these pieces together and fit an LSTM network to the multi-step time series forecasting problem.

The complete code listing is provided below.

Running the example first prints the RMSE for each of the forecasted time steps.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the scores at each forecasted time step are better, in some cases much better, than the persistence forecast.

This shows that the configured LSTM does have skill on the problem.

It is interesting to note that the RMSE does not become progressively worse with the length of the forecast horizon, as would be expected. This is marked by the fact that the t+2 appears easier to forecast than t+1. This may be because the downward tick is easier to predict than the upward tick noted in the series (this could be confirmed with more in-depth analysis of the results).

A line plot of the series (blue) with the forecasts (red) is also created.

The plot shows that although the skill of the model is better, some of the forecasts are not very good and that there is plenty of room for improvement.

Line Plot of Shampoo Sales Dataset with Multi-Step LSTM Forecasts

Line Plot of Shampoo Sales Dataset with Multi-Step LSTM Forecasts


There are some extensions you may consider if you are looking to push beyond this tutorial.

  • Update LSTM. Change the example to refit or update the LSTM as new data is made available. A 10s of training epochs should be sufficient to retrain with a new observation.
  • Tune the LSTM. Grid search some of the LSTM parameters used in the tutorial, such as number of epochs, number of neurons, and number of layers to see if you can further lift performance.
  • Seq2Seq. Use the encoder-decoder paradigm for LSTMs to forecast each sequence to see if this offers any benefit.
  • Time Horizon. Experiment with forecasting different time horizons and see how the behavior of the network varies at different lead times.

Did you try any of these extensions?
Share your results in the comments; I’d love to hear about it.


In this tutorial, you discovered how to develop LSTM networks for multi-step time series forecasting.

Specifically, you learned:

  • How to develop a persistence model for multi-step time series forecasting.
  • How to develop an LSTM network for multi-step time series forecasting.
  • How to evaluate and plot the results from multi-step time series forecasting.

Do you have any questions about multi-step time series forecasting with LSTMs?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Time Series Today!

Deep Learning for Time Series Forecasting

Develop Your Own Forecasting models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Time Series Forecasting

It provides self-study tutorials on topics like:
CNNs, LSTMs, Multivariate Forecasting, Multi-Step Forecasting and much more...

Finally Bring Deep Learning to your Time Series Forecasting Projects

Skip the Academics. Just Results.

See What's Inside

542 Responses to Multistep Time Series Forecasting with LSTMs in Python

  1. Avatar
    Masum May 10, 2017 at 6:48 am #


    you are the best

    Did not had to wait for long. Asked for it in different blog few days back

    • Avatar
      Jason Brownlee May 10, 2017 at 8:53 am #

      I hope you find the post useful!

      • Avatar
        Masum May 10, 2017 at 9:59 am #

        I believe so. Things are getting deeper here.

        Will we get recursive LSTM MODEL for multi step forecasting soon?

        Will eagerly wait for that blog.


        • Avatar
          Jason Brownlee May 11, 2017 at 8:22 am #


          • Avatar
            Masum May 11, 2017 at 8:43 am #


            Hope to see that soon.

        • Avatar
          Xingying October 27, 2017 at 10:06 am #

          Hi Masum,
          I’m studying LSTM on website( https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/ )and found you on message board. Do you have any idea about Muti-step forecast? I run the code of the tutorial, but always got a over-fitting results using the history data.

          Thank you and looking forward for your reply.

        • Avatar
          Lau Bourne August 11, 2022 at 11:53 am #

          when you predict by using the recursive LSTM model, can you get a relatively precise result?
          I find it’s hard to get satisfying outcomes, maybe I am not good at training the model like that.

    • Avatar
      Harjot Singh March 11, 2019 at 7:17 pm #

      Hi, I’m completely new to RNN and neural networks. I have a project in hand with 9 years of monthly sales data of a project. I want to apply LSTM to forecast into future 6-7 months.
      I’ve used ARIMA and got a decent accuracy. But I want to try LSTM after reading so many articles in its favour.

      it is a uni-variate (contains sales history for 9 years monthly data) consistent time series data.

      Can you suggest me where should I start learning? or should I use this blog directly on my data.

      Your earliest response will be deeply appreciated.
      And thanks for all your blogs. They really help.

    • Avatar
      Steve May 23, 2019 at 4:11 pm #

      I am not sure why you would call the following multiple times with the SAME parameter?
      model.fit(X, y, epochs=1, batch_size=n_batch, verbose=0, shuffle=False)
      Shall X, and y actually need to be indexed by i at different epoch?

      • Avatar
        Jason Brownlee May 24, 2019 at 7:47 am #

        This is the standard process for training a neural net, e.g. showing the same dataset for multiple epochs, in this case we re doing so manually rather than automatically by the framework.

  2. Avatar
    jvr May 17, 2017 at 1:27 am #

    Thanks a lot for this post. I was trying to make this for my thesis since september, with no well results. But I’m having trouble: I’m not able to compile. Maybe you or someone who reads this is able to tell me why this happens: I’m getting the following error when running the code:

    The TensorFlow library wasn’t compiled to use SSE instructions, but these are available on your machine and could speed up CPU computations.

    The TensorFlow library wasn’t compiled to use SSE2 instructions, but these are available on your machine and could speed up CPU computations.

    The TensorFlow library wasn’t compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
    The TensorFlow library wasn’t compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.

    The TensorFlow library wasn’t compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

    The TensorFlow library wasn’t compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.

    Obviously it has something to do with Tensorflow (I have read about this problem and I think its becase is not installed on source, but have no idea about how to fix it).

    Thank you in advance.

  3. Avatar
    Shamsul May 17, 2017 at 9:17 pm #


    Can we say that multiple output strategy ( avoiding 1.direct, 2. Recursive, 3.direct recursive hybrid strategies) have been used here ?

    Am I right ?

    • Avatar
      Jason Brownlee May 18, 2017 at 8:36 am #

      I think the LSTM has implemented a direct strategy.

      • Avatar
        shamsul January 14, 2018 at 12:09 am #

        what can be done to make it iterative strategy? any example of code would be great.

      • Avatar
        antonio May 26, 2018 at 7:18 am #

        Isn’t this a multiple output strategy?

        From my understanding the number of outputs is built into the model. You feed it one sample and it returns the whole output based on that.

  4. Avatar
    jinhua zhang May 18, 2017 at 11:26 am #

    Your article is very useful! I have a problem, if the data series are three-dimensional data, the 2th line is the put -in data,and the 3th line is the forecasting data(all include the train and test data ),Do they can run the” difference”and “tansform”?
    Thank you very much!

    • Avatar
      Jason Brownlee May 19, 2017 at 8:11 am #

      Great question.

      You may want to only make the prediction variable stationary. Consider perform three tests:

      – Model as-is
      – Model with output variable stationary
      – Model with all variables stationary (if others are non-stationary)

    • Avatar
      jvr May 21, 2017 at 10:21 pm #

      I have discovered how to do it by asking some people. The object series is actually a Pandas Series. It’s a vector of information, with a named index. Your dataset, however, contains two fields of information, in addition to the time series index, which makes it a DataFrame. This is the reason why the tutorial code breaks with your data.

      To pass your entire dataset to MinMaxScaler, just run difference() on both columns and pass in the transformed vectors for scaling. MinMaxScaler accepts an n-dimensional DataFrame object:

      ncol = 2
      diff_df = pd.concat([difference(df[i], 1) for i in range(1,ncol+1)], axis=1)
      scaler = MinMaxScaler(feature_range=(0, 1))
      scaled_values = scaler.fit_transform(diff_df)

      So, with this, we can use as many variables as we want. But now I have a big doubt.

      When the transform or dataset into a supervised learning problem, we have a distribution in columns as shown in https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

      I mean, for a 2 variables dataset as yours, we can set, for example, this values:


      so we will have a supervised dataset like this:

      var1(t-1) var2(t-1) var1(t) var2 (t) var1(t+1) var2 (t+1)

      so, if we want to train the ANN to forecast var2 (which is the target we want to predict) with the var1 as input and the previous values of var2 also as input, we have to separate them and here is where my doubt begins.

      In the part of the code:

      def fit_lstm(train, n_lag, n_seq, n_batch, nb_epoch, n_neurons):
      # reshape training into [samples, timesteps, features]
      X, y = train[:, 0:n_lag], train[:, n_lag:]
      X = X.reshape(X.shape[0], 1, X.shape[1])

      I think that if we want to define X, we should use:


      this means we are selecting this as X from the previous example:

      var1(t-1) var2(t-1)

      (number of lags*number of variables), so: X=train[:,0:1*2]=train[:,0:2]


      Y=train[:,n_lag*n_vars:] is the vector of ¿targets?

      the problem is that, on this way, we are selecting this as targets:

      var1(t) var2(t) var1(t+1) var2(t+1)

      so we are including var1 (which we don’t have the aim to forecast, just use as input).

      I would like to know if there is any solution to solve this in order to use the variable 1,2…n-1 just as input but not forecasting it.

      Hope this is clear :/

  5. Avatar
    jvr May 19, 2017 at 3:16 am #

    Thanks for the previous clarification. I have a dubt in relation to the section “fit network” in the code. I’m having some trouble trying to plot the training graph (validation vs training) in order to see if the network is or not overfitted, but due to the “model.reset_states()” sentence, i can only save the last loss and val_loss from de history sentence. Is there any way to solve this?

    thank you in advance 🙂

    • Avatar
      jvr May 19, 2017 at 3:45 am #

      I reply to myself, if someone is also interested.

      Just creating 2 list (or 1, but i see it more clear on this way) and returning then on the function. Then, outside, just plot them. I’m sorry for the question, maybe the answer is obvious, but I’m starting on python and I’m not a programmer.

      # fit network
      for i in range(nb_epoch):
      history=model.fit(X, y, epochs=1, batch_size=n_batch,shuffle=True, validation_split=val_split)

      return model,loss,val_loss

      # fit model
      model,loss,val_loss=fit_lstm(train, n_lag, n_seq, n_batch, n_epochs, n_neurons)

      pyplot.title(‘cross validation’)
      pyplot.legend([‘training’, ‘test’], loc=’upper left’)

      • Avatar
        Jason Brownlee May 19, 2017 at 8:23 am #

        Nice to see you got there jvr, well done.

      • Avatar
        Andrew February 5, 2019 at 10:50 am #

        Hi jrv,

        I know this is a lot later but I was wondering whether you still have the full code for when you implemented a multivariate solution for this?

        If anyone else has a solution for a multivariate and multi-lagged input to predict just one column I would be very happy to talk!

        Thanks in advance

    • Avatar
      Jason Brownlee May 19, 2017 at 8:22 am #

      History is returned when calling model.fit().

      We are only fitting one epoch at a time, so you can retrieve and accumulate performance each epoch in the epoch loop then do something with the data (save/graph/return it) at the end of the loop.

      Does that help?

      • Avatar
        jvr May 19, 2017 at 9:17 pm #

        It does help, thank you.

        Now I’m trying to find a way to make the training process faster and reduce RMSE, but it’s pretty dificult (the idea is to make results better than in the NARx model implemented in the Matlab Neural Toolbox, but results and computational time are hard to overcome).

        • Avatar
          Jason Brownlee May 20, 2017 at 5:37 am #

          LSTMs often need to be trained longer than you think and can greatly benefit from regularization.

  6. Avatar
    DJ June 2, 2017 at 1:42 am #


    Thanks for the great tutorial, I’m wondering if you can help me clarify the reason you have


    (line 83)
    when fitting the model, I was able to achieve similar results without the line as well.


    • Avatar
      Jason Brownlee June 2, 2017 at 1:02 pm #

      It clears the internal state of the LSTM.

      • Avatar
        anurag August 30, 2017 at 3:41 pm #

        I have tried experimenting with and without mode.reset_states(), using some other dataset.
        I am doing multistep prediction for 6-10 steps, I am able to get better results without model.reset_states().

        Am i doing something wrong, or it completely depends on dataset to dataset.

        Thanks in advance.

        • Avatar
          Jason Brownlee August 30, 2017 at 4:20 pm #

          It completely depends on the dataset and the model.

          • Avatar
            anurag August 31, 2017 at 6:42 pm #

            Thank you so much. 🙂

  7. Avatar
    DJ June 2, 2017 at 4:11 pm #

    Thanks for the quick reply Jason :-). I’ve seen other places where reset is done by using callbacks parameter in model.fit.

    class ResetStatesCallback(Callback):
    def __init__(self):
    self.counter = 0

    def on_batch_begin(self, batch, logs={}):
    if self.counter % max_len == 0:
    self.counter += 1

    Then the callback is used by as follows:

    model.fit(X, y, epochs=1, batch_size=1, verbose=2,
    shuffle=False, callbacks=[ResetStatesCallback()])

    The ResetStatesCallback snippet was obtained from:

    Please let me know what you think.


    • Avatar
      Jason Brownlee June 3, 2017 at 7:21 am #

      Yes, there are many ways to implement the reset. Use what works best for your application.

  8. Avatar
    QQ June 2, 2017 at 5:00 pm #

    Hi Jason, greate post, and I have some questions:

    1. in your fit_lstm function, you reset each epoch state, why?
    2. why you iterate each epoch by yourself, instead of using model.fit(X, y, epochs)

    thx Jason

    # fit an LSTM network to training data
    def fit_lstm(train, n_lag, n_seq, n_batch, nb_epoch, n_neurons):
    # reshape training into [samples, timesteps, features]
    X, y = train[:, 0:n_lag], train[:, n_lag:]
    X = X.reshape(X.shape[0], 1, X.shape[1])
    # design network
    model = Sequential()
    model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
    model.compile(loss=’mean_squared_error’, optimizer=’adam’)
    # fit network
    for i in range(nb_epoch):
    model.fit(X, y, epochs=1, batch_size=n_batch, verbose=0, shuffle=False)
    return model

    • Avatar
      Jason Brownlee June 3, 2017 at 7:23 am #

      The end of the epoch is the end of the sequence and the internal state should not carry over to the start of the sequence on the next epoch.

      I run the epochs manually to give fine grained control over when resets occur (by default they occur at the end of each batch).

  9. Avatar
    J June 7, 2017 at 12:48 am #

    I’d like to clarify line 99 in the LSTM example:

    —– plot_forecasts(series, forecasts, n_test+2)

    Is the n_test + 2 == n_test + n_lag – n_seq?


    • Avatar
      jvr June 15, 2017 at 11:49 pm #

      I’d also like to know why using n_test + 2

      • Avatar
        M August 8, 2017 at 3:07 am #

        I thought it should be n_test + 2 == n_test+n_seq-1 (regardless of n_seq). It would be great if someone could clarify that.

        • Avatar
          Mrtn October 4, 2017 at 8:36 pm #

          M, you are right. Otherwise the RMS is incorrectly calculated and plotting is not aligned.

    • Avatar
      Daniel July 8, 2022 at 2:22 am #

      I would also very much like to see why n_test + 2 is used

  10. Avatar
    Kao June 10, 2017 at 5:46 pm #

    Hi jason,
    When I applied your code into a 22-year daily time series, I find out that the LSTM forecast result is similar to persistence one, i.e. the red line is just a horizontal bar. I’m sure I did not mess those two methods, I wonder what cause this?

    My key configure as follows:
    n_lag = 1
    n_seq = 3
    n_test = 365*3

    and my series length is 8035.

    • Avatar
      Jason Brownlee June 11, 2017 at 8:21 am #

      You will need to tune the model to your problem.

      • Avatar
        Kao June 25, 2017 at 6:55 pm #

        Thanks to your tutorial, I’ve been tuning the parameters such as numbers of epochs and neurons these days. However, I noticed that you mentioned the grid search method to get appropriate parameters, could you please explain how to implement it into LSTM? I’m confused about your examples on some other tutorial which has a model class, seems unfamiliar to me.

  11. Avatar
    MM June 13, 2017 at 6:44 am #


    Thank you for these tutorials. These are the best tutorials on the web. One question: what is the best way to forecast the last two values?

    Thank you

    • Avatar
      Jason Brownlee June 13, 2017 at 8:31 am #

      Thanks MM.

      No one can tell you the “best” way to do anything in applied machine learning, you must discover it through trial and error on your specific problem.

      • Avatar
        MM June 13, 2017 at 9:29 am #


        Understood. Let me re-phrase the question. In a practical application, one would be interested in forecasting the last data point, i.e. in the shampoo dataset, “3-12”. How would you suggest doing that?

        • Avatar
          Jason Brownlee June 14, 2017 at 8:41 am #

          Fit your model to all of the data then call predict() passing whatever lag inputs your model requires.

      • Avatar
        MM June 13, 2017 at 10:24 am #


        Should the line that starts the offset point in plot_forecasts() be

        off_s = len(series) – n_test + i + 1


        off_s = len(series) – n_test + i – 1

  12. Avatar
    Michael June 21, 2017 at 4:03 am #

    Hi Jason,

    Thanks for your excellent tutorials!

    I have followed a couple of your articles about LSTM and did learn a lot, but here is a question in my mind: can I introduce some interference elements in the model? For example for shampoo sale problem, there may be some data about holiday sales, or sales data after an incident happens. If I want to make prediction for sales after those incidents, what can I do?

    What’s more, I noticed that you will parse date/time with a parser, but you did not really introduce time feature into the model. For example I want to make prediction for next Monday or next January, how can I feed time feature?


    • Avatar
      Jason Brownlee June 21, 2017 at 8:18 am #

      Yes, see this post for ideas on adding additional features:

      • Avatar
        Michael June 22, 2017 at 5:53 pm #

        Thanks for clarification.

        I have two more specific questions:
        1) In inverse_transform, why index = len(series) – n_test + i – 1?

        2) In fit_lstm, you said “reshape training into [samples, timesteps, features]”, but I think the code in line 74 is a little different from your format:

        73 X, y = train[:, 0:n_lag], train[:, n_lag:]
        74 X = X.reshape(X.shape[0], 1, X.shape[1])

        In line 74, I think it should be X = X.reshape(X.shape[0], X.shape[1], 1)

        • Avatar
          Jason Brownlee June 23, 2017 at 6:52 am #

          Hi Michael,

          Yes, the offset finds one step prior to the forecast in the original time series. I use this motif throughout the tutorial.

          In the very next line I say: “We will fix time steps at 1, so this change is straightforward.”

          • Avatar
            Mark March 6, 2020 at 12:50 am #

            Hi Jason,

            Firstly, thanks for all the excellent tutorials.

            I’m stepping through this example in detail and have hit the same question as Michael in (2) above. I’m afraid I don’t quite understand the comment “We will fix time steps at 1”.

            We need X to have dimensions [samples, timesteps, features]

            Therefore, should line 74 not read:

            X = X.reshape(X.shape[0], X.shape[1], 1) (as suggested by Michael)

            I’m expecting X.shape[1] to be the same as n_lag (i.e. timesteps) and in this example there is only 1 feature.

            If, as in your example, timesteps = n_lag = n_features = 1 this wouldn’t make a difference, however, I’m trying with n_lag = 2.

            For 1 feature with n_lag = 2 I’m expecting X.shape to be [n_samples, 2, 1] where as the code is giving me [n_samples, 1, 2]

            Thanks in advance, Mark.

          • Avatar
            Jason Brownlee March 6, 2020 at 5:38 am #

            From memory, both the number of features and number of time steps are 1. They are equilivient.

            Also, perhaps this will help:

  13. Avatar
    Michael June 22, 2017 at 6:01 pm #

    Hi Jason,

    I would like to know how to do short term and long term prediction with minimum number of models?

    For example, I have a 12-step input and 12-step output model A, and a 12-step input and 1-step output model B, would model A gives better prediction for next first time step than model B?

    What’s more, if we have 1-step input and 1-step output model, it is more error prone to long term prediction.
    if we have multi-step input and 1-step output mode it is still more more error prone long term. So how to regard the long term and short term prediction?

    • Avatar
      Jason Brownlee June 23, 2017 at 6:53 am #

      I would recommend developing and evaluating each model for the different uses cases. LSTMs are quite resistant to assumptions and rules of thumb I find in practice.

  14. Avatar
    jzx June 25, 2017 at 1:17 pm #

    Hello, thanks for your tutorial
    If my prediction model is three time series a, b, c, I would like to use a, b, c to predict the future a, how can I build my LSTM model.
    thank you very much!

    • Avatar
      Jason Brownlee June 26, 2017 at 6:05 am #

      Each of a, b, and c would be input features. Remember, the shape or dimensions of input data is [samples, timesteps, features].

  15. Avatar
    Kedar June 26, 2017 at 6:03 pm #

    Does stationarizing data really help the LSTM? If so, what is the intuition behind that? I mean, I can understand that for ARIMA-like methods, but why for LSTM’s?

    • Avatar
      Jason Brownlee June 27, 2017 at 8:27 am #

      Yes in my experience, namely because it is a simpler prediction problem.

      I would suggest trying a few different “views” of your sequence and see what is easiest to model / gets the best model skill.

  16. Avatar
    Michael June 28, 2017 at 5:47 pm #

    Hi Jason,

    I want to train a model with the following input size: [6000, 4, 2] ([samples, timestamps, features])

    For example, I want to predict shampoo’s sale in next two years. If I have other feature like economy index of every year, can I concatenate sale data and index data in the above format? So my input will be a 3d vector. How should I modify the model to train?

    I always get such error: ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (6000, 2, 2).

    The error comes from this line: model.fit(X, y, epochs=1, batch_size=n_batch, verbose=0, shuffle=False). Can you provide some advices? Thanks!

    • Avatar
      Jason Brownlee June 29, 2017 at 6:32 am #

      Reshape your data to be [6000, 4, 2]

      Update the input shape of the network to be (4,2)

      Adjust the length of the output sequence you want to predict.

  17. Avatar
    shamsul July 11, 2017 at 11:31 am #


    To make one forecast with an LSTM, if we write

    oneforecast = forecast_lstm(model, X, n_batch)

    it says: undefined X

    what should be the value of X? we know the model and n_batch value?

    would you help?

    • Avatar
      Jason Brownlee July 12, 2017 at 9:38 am #

      X would be the input sequence required to make a prediction, e.g. lag obs.

  18. Avatar
    masum July 12, 2017 at 8:06 am #


    what if I want to tell the model to learn from train data (23 samples here) and want to forecast only 3 steps forward (Jan, Feb, Mar). I want to avoid persistence model in this case and only require 3 step direct strategy. hope you got that.

    any help would be grateful.

    tarin (past data)= forecast (Jan, Feb, Mar)

    • Avatar
      Jason Brownlee July 12, 2017 at 9:54 am #

      Perhaps I misunderstand, but this is the model presented in the tutorial. It predicts 3 time steps ahead.

      • Avatar
        masum July 12, 2017 at 11:00 am #

        # evaluate the persistence model
        def make_forecasts(model, n_batch, train, test, n_lag, n_seq):
        forecasts = list()
        for i in range(len(test)):
        X, y = test[i, 0:n_lag], test[i, n_lag:]
        # make forecast
        forecast = forecast_lstm(model, X, n_batch)
        # store the forecast
        return forecasts

        here if i would like to make only one forecast for 3 steps (jan,feb,march) what i have to change. i do not need the rest of the month(april, may, june, july,aug,……dec). one predictions or forecast for 3 steps.

        hope you got me

        • Avatar
          Jason Brownlee July 13, 2017 at 9:47 am #

          Pass in only what is required to make the prediction for those 3 months.

          • Avatar
            masum July 13, 2017 at 10:16 am #


            will be kind enough to simplify a little bit more.

            I did not get it.

  19. Avatar
    Devakar Kumar Verma July 24, 2017 at 4:23 am #

    I am getting an error while parsing the date at time of loading the data from csv file.
    The error is:
    ValueError: time data ‘1901-Jan’ does not match format ‘%Y-%m’

    Anyone please help me to resolve this issue.

    • Avatar
      Jason Brownlee July 24, 2017 at 6:56 am #

      I’m sorry to hear that. Confirm you have copied the code exactly and the data file does not have any extra footer information.

    • Avatar
      p July 30, 2017 at 8:05 pm #

      I have so this problem
      i have downloaded the dataset from the link in the text
      i think this error has occured because the data of our csv file is not in correct format!
      can anyone give us the dataset plz???

      • Avatar
        Jason Brownlee July 31, 2017 at 8:15 am #

        Here is the raw data ready to go:

        • Avatar
          Dongchan October 9, 2017 at 9:26 am #


          I have the same issue. How can I fix the parser to resolve this error?

          • Avatar
            manuel December 1, 2017 at 5:57 am #

            you have choose data csv separate with “,”, if is “;” will not work

        • Avatar
          J. Berglund May 25, 2018 at 10:37 pm #

          This also occurred for me. The problem for me was that the first column in the .cvs-file (“m-y”) was by default set to “1-Jan, 1-Feb, …. , 3-Dec”, and couldn’t match with “‘%Y-%m'”.

          However, by handcrafting the date column in excel, putting a ” ‘ ” before the date solved the problem. For example: ‘1-01, ‘2-01 .. etc.

          Hope this could help someone in the future. 🙂

  20. Avatar
    Devakar Kumar Verma July 24, 2017 at 2:34 pm #

    Data file doesn’t have any footer and i had simply copy paste the code but dateparser throwing the error. I have no idea why it is behaving strange.

    • Avatar
      Jason Brownlee July 25, 2017 at 9:27 am #

      Sorry, I don’t have any good ideas. It may be a Python environment issue?

  21. Avatar
    Josep July 31, 2017 at 8:15 pm #

    Hi Jason,
    Great explanation again. I have a doubt about this piece of code:

    # evaluate the persistence model
    def make_forecasts(model, n_batch, train, test, n_lag, n_seq):
    forecasts = list()
    for i in range(len(test)):
    X, y = test[i, 0:n_lag], test[i, n_lag:]
    # make forecast
    forecast = forecast_lstm(model, X, n_batch)
    # store the forecast
    return forecasts

    Why do you pass the parameter “n_seq” to the function if it has no use inside the function?

  22. Avatar
    Nara August 1, 2017 at 10:12 pm #

    How would I go about forecasting for a complete month. (Assuming I have daily data).
    Assuming I have around 5 years data 1.8k data points to train.

    I would like to use one year old data to forecast for the whole of next month?

    To do this should I change the way this model is trained?
    Is my understanding correct that this model tries to predict the next value by only using current value?

    • Avatar
      Jason Brownlee August 2, 2017 at 7:50 am #

      Yes, frame the data so that it predicts a month, then train the model.

      The model can take as input whatever you wish, e.g. a sequence of the last month or year.

      • Avatar
        Nara August 3, 2017 at 3:12 am #

        Hey, thanks for the reply.

        This post really helped me.
        Now the next question is how do we enhance this to consider exogenous variables while forecasting?
        If I simply add exogenous variable values at this step:
        train, test = supervised_values[0:-n_test], supervised_values[-n_test:], (and obviously make appropriately changes to batch_input_shape in model fit.)
        Would it help improve predictions?
        What is the correct way of adding independent variables.

        I have gone through this post of yours.
        It was helful but how to do this using neural networks that has LSTM?
        Can you please point me in the right direction?

  23. Avatar
    Kiran August 4, 2017 at 2:09 pm #

    Hi Jason, thanks for writing up such detailed explanations.
    I am using an LSTM layer for a time series prediction problem.
    Everything works fine except for when I try to use the inverse_transform to undo the scaling of my data. I get the following error:

    ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

    Not really sure how I can get past this problem. Could you please help me with this ?

    • Avatar
      Jason Brownlee August 4, 2017 at 3:45 pm #

      It looks like you are tring to perform an inverse transform on NaN values.

      Perhaps try some print statements to help track down where the NaN values are coming from.

      • Avatar
        Kiran August 5, 2017 at 12:01 pm #

        Thank you for the reply. Yes, there are some NaN values in my predictions. Does that indicate a badly trained model ?

        • Avatar
          Jason Brownlee August 6, 2017 at 7:36 am #

          Your model might be receiving NaN as input, check that.

          It may be making NaN predictions with good input, in which case it might have had trouble during training. There are methods like gradient clipping that can address this.

          Figure out which case it is first though.

          • Avatar
            Kiran August 14, 2017 at 11:05 pm #

            Thanks ! My inputs do not have any NaN. Will check out gradient clipping.

          • Avatar
            Jason Brownlee August 15, 2017 at 6:37 am #

            Let me know how you go Kiran.

          • Avatar
            Ami Tabak January 22, 2018 at 6:59 pm #

            Hi Jason
            I encountered data file format issue and similar NaN issues like Kiran saw
            the file format i downloaded doesnt have the 19 format
            Month,Sales of shampoo over a three year period

            So I changed the parser() just to return x , as is

            Then on the Multi-Step LSTM Network I got the following NaN

            ipdb> series
            01-Jan 266.0

            03-Nov 581.3
            03-Dec 646.9
            NaN NaN
            Sales of shampoo over a three year period NaN
            Name: Sales of shampoo over a three year period, dtype: float64

            I changed the call to use skipfooter , e.g.
            series = read_csv(‘shampoo-sales.csv’, header=0,skipfooter=2, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)

            The net runs but achieved a slightly different training RMSE

            t+1 RMSE: 97.719515
            t+2 RMSE: 80.742075
            t+3 RMSE: 110.313295

          • Avatar
            Jason Brownlee January 23, 2018 at 7:51 am #

            Nice work!

            The differences are reasonable minor given the stochastic nature of the method:

          • Avatar
            Yasmine Sayed May 15, 2018 at 8:06 am #

            Hey Jason,
            I’m encountering a similar problem. None of my inputs in my train_x are nan, but once i do the training, and i print train_predict – it gives me a whole array of nan values. and I also recieve this error:
            ValueError: Input contains NaN, infinity or a value too large for dtype(‘float32’).

            Please help…

            Note: I am using a dataset of dates, value in this format(which is daily instead of monthly) because i want to forecast daily values: not sure if this is affecting anything in the code:


            Ive got about 1500 records.

          • Avatar
            Jason Brownlee May 15, 2018 at 8:09 am #

            You must scale your data prior to modeling.

          • Avatar
            Yasmine Sayed May 15, 2018 at 9:17 am #

            I did normalize the data before modeling. I did exactly what you did here in this code for the LSTM forecast. the only difference is mine is daily not monthly.
            this is how my train_x looks before building the model
            [[[0.939626 ]
            [0.9441713 ]

            [0.5557002 ]
            [0.5948241 ]
            [0.5920827 ]]

            [[0.9441713 ]
            [0.9214866 ]

            [0.5948241 ]
            [0.5920827 ]
            [0.5772988 ]]

          • Avatar
            Jason Brownlee May 15, 2018 at 2:43 pm #

            Interesting that you are getting NaNs. Perhaps the model requires further tuning, experiment and see if you can learn more about why it is happening.

          • Avatar
            Yasmine Sayed May 16, 2018 at 4:19 am #

            Hmm, well alternatively,
            I just used the same model & dataframe preparation from the other example with the airline passengers, and then i just took the make_forecast function from here, called it there and i passed the testX set as input ( so i guess its using the last value from testX to forecast into the future…?) and I called the model we built in that example as well.
            It made predictions… but for some reason , the predictions were just constantly increasing, even though this data is very cyclical, it goes up and down. – its weird because when we did the validating of the model – the accuracy was extremely impressive. but now when i try to predict a few time steps into the future – its not even nearly as accurate. and its just going upwards ….
            How can I solve this? Am I approaching this wrong?

            Thank you so much for your responses – it is really helpful for me

          • Avatar
            Jason Brownlee May 16, 2018 at 6:08 am #

            I would recommend tuning the model to the problem.

          • Avatar
            Yasmine Sayed May 16, 2018 at 4:54 am #

            also my predictions become nearly constant after about 25-30 steps

  24. Avatar
    Nara August 8, 2017 at 9:34 pm #

    Hi Jason,

    When I try step by step forecast. i.e. forecast 1 point and then use this back as data and forecast the next point, my predictions become constant after just 2 steps, sometimes from the beginning itself.

    In detail there. Can you say why this is happening? And which forecast method is usually better. Step by step or window type forecasts?

    Also can you comment on when can ARIMA/ linear models perform better than netowrks/RNN?

    • Avatar
      Jason Brownlee August 9, 2017 at 6:30 am #

      Using predictions as input is bad as the errors will compound. Only do this if you cannot get access to the real observations.

      If your model has a linear relationship it will be better to model it with a linear model with ARIMA, the model will train faster and be simpler.

      • Avatar
        Nara August 11, 2017 at 10:09 pm #

        But that is how ARIMA models predict right?
        They do point by point forecast. And from my results ARIMA(or STL ARIMA or even XGBOOST) is doing pretty well when compared to RNN. 🙁

        But i haven’t considered stationarity and outlier treatment and I see that RNN performs pathetically when the data is non stationary/has outliers.

        Is this expected? I have read that RNN should take care of stationarity automatically?

        Also, will our results be bad if we do first order differencing even when there is no stationarity in the data?

        And as for normalization, is it possible that for some cases RNN does well without normalizing?
        When is normalization usually recommended? When standard deviation is huge?

        • Avatar
          Jason Brownlee August 12, 2017 at 6:49 am #

          I have found RNNs to not perform well on autoregression problems, and they do better with more data prep (e.g. removing anything systematic). See this post:

          Generally, don’t difference if you don’t need to, but test everything to be sure.

          Standardization if the distribution is Gaussian, normalization otherwise. RNNs like LSTMs need good data scaling, MLPs less so in this age of relu.

          • Avatar
            Nara August 13, 2017 at 1:34 am #

            Oh then a hybrid model using residuals from ARIMA for RNN should work well 🙂 ?
            The residuals will not have any seasonal components.(even scaling should be well taken care of)
            Or here also do you expect MLPs to work better?

          • Avatar
            Jason Brownlee August 13, 2017 at 9:55 am #

            It is hard to know for sure, I recommend using experiments to collect data to know for sure, rather than guessing.

  25. Avatar
    Nights August 13, 2017 at 5:37 am #

    I think there is an issue with inverse differencing while forecasting for multistep.(to deal with non stationary data)
    This example is adding previously forecasted(and inverse differenced) value to the currently forecasted value.Isn’t this method wrong when we have 30 points to forecast as it keeps adding up the results and hence the output will continuously increase.

    Below is the output I got.

    Instead should I just add the last known real observation to all the forecasted values? I dont suppose that would work either.

    • Avatar
      Jason Brownlee August 13, 2017 at 9:58 am #

      It could be an issue for long lead times, as the errors will compound.

      If real obs are available to use for inverse differencing, you won’t need to make a forecast for such a long lead time and the issue is moot.

      Consider contrasting model skill with and without differencing, at least as a starting point.

  26. Avatar
    Sandra August 14, 2017 at 5:46 pm #

    Hi, thank you for your helpful tutorial.

    I have a question regarding a seq to seq timeseries forcasting problem with multi-step lstm.

    I have created a supervised dataset of (t-1), (t-2), (t-3)…, (t-look_back) and (t+1), (t+2), (t+3)…, (t+look_ahead) and our goal is to forcast look_ahead timesteps.

    We have tried your complete example code of doing a dense(look_ahead) last layer but received not so good results. This was done using both a stateful and non-stateful network.

    We then tried using Dense(1) and then repeatvector(look_ahead), and we get the same (around average) value for all the look_ahead timesteps. This was done using a non-stateful network.

    Then I created a stepwise prediction where look_ahead = 1 always. The prediction for t+2 is then based on the history of (t+1)(t)(t-1)… This has given me better results, but only tried for non-stateful network.

    My questions are:
    – Is it possible to use repeatvector with non-stateful networks? Or must network be stateful? Do you have any idea why my predictions are all the same value?
    – What do network you recommend for this type or problem? Stateful or non stateful, seq to seq or stepwise prediction?

    Thanks in advance!

    • Avatar
      Jason Brownlee August 15, 2017 at 6:32 am #

      Very nice work Sandra, thanks for sharing.

      The RepeatVector is only for the Encoder-Decoder architecture to ensure that each time step in the output sequence has access the entire fixed-width encoding vector from the Encoder. It is not related to stateful or stateless models.

      I would develop a simple MLP baseline with a vector output and challenge all LSTM architectures to beat it. I would look at a vector output on a simple LSTM and a seq2seq model. I would also try the recursive model (feed outputs as inputs for repeating a one step forecast).

      It sounds like you’re trying all the right things.

      Now, with all of that being said, LSTMs may not be very good at simple autoregression problems. I often find MLPs out perform LSTMs on autoregression. See this post:

      I hope that helps, let me know how you go.

  27. Avatar
    Oscar August 16, 2017 at 1:28 am #

    Hi Jason,
    Thanks for your tutorials. I’m trying to learn ML and your webpage is very useful!

    I’m a bit confuse with the inverse_difference function. Specifically with the last_ob that I need to pass.

    Let’s say I have the following:

    Raw Data difference scaled Forecasted values
    raw_val2=.35 -.05 -.045 [0.80048585, 0.59788215, -0.13518856]
    raw_val3=.29 -.06 -.054 [0.65341175, 0.37566081, -0.14706305]
    raw_val4=.28 -.01 -.009 [[0.563694, -0.09381149, 0.03976132]

    When passing the last_ob to the inverse_difference function which observation do I need to pass to the function, raw_val2 or raw_val1?

    My hunch is that I need to pass raw_val2. Is that correct?

    Also, in your example, in the line:

    forecasts = inverse_transform(series, forecasts, scaler, n_test+2)

    What’s the reason of this n_test+2?

    Thanks in advance!

  28. Avatar
    Jaskaran August 17, 2017 at 10:57 am #

    Hi Jason,
    Great work.

    I had a question. When reshaping X for lstm (samples,timesteps,features) why did you model the problem as timesteps=1 and features=X.shape[1]. Shouldn’t it be timesteps = lag window size
    and the output dense layer have the size of horizon_window. This will give much better results in my opinion.

    Here is a link which will make my question more clear:

  29. Avatar
    hanoun August 18, 2017 at 11:37 am #

    Hi, I try to use this example to identify the shape switch an angle , its useful to use this tutorial and how I can test the model I train it,

  30. Avatar
    A August 19, 2017 at 7:53 am #

    Hi there – I love your blog and these tutorials! They’re really helpful.

    I have been studying both this tutorial and this one: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/.
    I have applied both codes to a simple dataset I’m working with (date, ROI%). Both codes run fine with my data, but I’m having a problem that has me completely stumped:

    With this code, I’m able to actually forecast the future ROI%. With the other, it does a lot better at modeling the past data, but I can’t figure out how to get it to forecast the future. Both codes have elements I need, but I can’t seem to figure out how to bring them together.

    Any insight would be awesome! Thank you!

  31. Avatar
    Ankit August 22, 2017 at 11:34 pm #

    Jason, first of all, I would like to thank you for the work you’ve done. It has been tremendously helpful.

    I have a question and seeking your expert opinion.

    How to handle a time series data set with multiple and variable granularity input of each time step. for instance, consider the dataset like below:

    Date | Area | Product category | Orders | Revenue | Cost

    so, in this case, there would be multiple records for a single day aggregated on date and this is the granularity I want.

    How should this kind of data be handled, since these features will contribute to the Revenue and Orders?

    • Avatar
      Jason Brownlee August 23, 2017 at 6:53 am #

      You could standardize the data and feed it into one model or build separate models and combine their predictions.

      Try a few methods and see what works best for your problem.

  32. Avatar
    Daniel August 24, 2017 at 2:07 am #

    I am using this framework for my first shot at an LSTM network for monitoring network response times. The data I’m working with currently is randomly generated by simulating API calls. What I’m seeing is the LSTM seems to always predict a return to what looks like the mean of the data. Is this a function of the data being stochastic?

    Separate question: since LSTM’s have a memory component built into the neurons, what are the advantages/disadvantages of using a larger n_in/n_lag than 1?

    • Avatar
      Jason Brownlee August 24, 2017 at 6:48 am #

      THe problem might be too hard for your model, perhaps tune the LSTM or try another algorithm?

      A key benefit of LSTMs is that they the lag can extend much longer than other methods, e.g. hundreds of time steps. This means you are modeling something like:

      yhat = f(t-1, …, t-500)

      And the model can reproduce something it saw 500 time steps ago if needed.

      • Avatar
        Daniel August 26, 2017 at 3:34 am #

        Thanks. I am playing with some toy data now just to make sure I’m understanding how this works.

        I am able to model a cosine wave very nicely with a 5 neuron, 100 epoch training run against np.cos(range(100)) split into 80/20 training set. This is with the scaling, but without the difference. I feed in 10 inputs, and get 30 outputs.

        Does calling model.predict change the model? I am calling repeatedly with the same 10 inputs and am seeing a different result each time. It looks like the predicted wave cycles through different amplitudes.

        • Avatar
          Daniel August 26, 2017 at 4:09 am #

          Ah ok, I got it. Since stateful is on, I would need to do an explicit reset_states between predictions. Makes sense, I think! Stateful was useful for training, but since I won’t be “online learning” and since I feed the network lag in the features, I should not rely on state for predictions.

        • Avatar
          Jason Brownlee August 26, 2017 at 6:48 am #

          Nice work!

          Yes, generally scaling is important, but if your cosine wave values are in [0,1] then you’re good.

      • Avatar
        Daniel August 26, 2017 at 6:03 am #

        I have a simple question. Trying to set up an a different toy problem, with data generated as y=x over 800 points (holding out the next 200 as validation). No matter how many layers, neurons, epochs that I train over, the results tend to be a that predictions start out fairly close to the line for lower values, but it diverges quickly and and approaches some fixed y=400 for higher values.

        Do you have any ideas why this would happen?

        • Avatar
          Jason Brownlee August 26, 2017 at 6:51 am #

          May be error accumulating. You’re giving the LSTM a hard time.

  33. Avatar
    Daniel September 1, 2017 at 2:47 am #

    Can I get your input on this issue I’m having? I would really like to make sure that I’m not implementing incorrectly. If there are network parameters I need to do, I can go through that exercise. But, I am not feeling confident about what I am on the right path with this problem. https://stackoverflow.com/questions/45982445/keras-lstm-time-series-multi-step-predictions-has-same-output-for-any-input

  34. Avatar
    lucius September 1, 2017 at 6:14 pm #

    Hi, there is a problem with the code. when doing data processing, i.e. calculate difference and min max scale. you should not use all data. in more real situation, you can only do this to train data. since you have no idea about test data.

    So I changed the code, cut the last 12 month as test. then only use 24 months data for difference, min max scale, fit the model and predict for month 25, 26, 27.

    Then I continue to use 25 months data for difference, min max scale, fit the model and predict for month 26, 27, 28.

    The final result is worse than baseline.!

    • Avatar
      Jason Brownlee September 2, 2017 at 6:04 am #

      Correct, this is a simplification I implemented to keep the tutorial short and understandable.

  35. Avatar
    Eldar M. September 17, 2017 at 1:47 am #

    Hi Jason, I was able to get slightly better results with a custom loss function (weighted mse)

    def weighted_mse(yTrue,yPred):

    ones = K.ones_like(yTrue[0,:])
    idx = K.cumsum(ones)
    return K.mean((1/idx)*K.square(yTrue-yPred))

    credit goes to Daniel Möller on Stack Overflow as I was not able to figure out the tensor modification steps on my own and he responded to my question there

  36. Avatar
    Alex September 23, 2017 at 1:53 am #

    def make_forecasts(model, n_batch, train, test, n_lag, n_seq):
    forecasts = list()
    for i in range(len(test)):
    X, y = test[i, 0:n_lag], test[i, n_lag:]
    # make forecast
    forecast = forecast_lstm(model, X, n_batch)
    # store the forecast
    return forecasts

    What is the point of the “train” data set as parameter in this function if it is not used?

    • Avatar
      Jason Brownlee September 23, 2017 at 5:43 am #

      Yep, looks like its not used. You can probably remove it.

  37. Avatar
    Fei September 24, 2017 at 1:51 am #

    Hello, It is very useful tutorial. I am starter for the python and programming. May I convert input of model into 4 or more than one variable? and change the n_batch into other number not 1?

  38. Avatar
    Fei September 26, 2017 at 4:33 am #

    But ,When I change the n_batch size, the model does not work. By the way, you said manually to epoch of model, would you tell me the how to do it?

  39. Avatar
    Fabian September 29, 2017 at 7:41 pm #

    Hi Jason,
    thanks a lot for your tutorials on LSTMs.
    Do you have a suggestion how to model the network for a multivariate multi-step forecast? I read your articles about multivariate and multi-step forecast, but combining both seems to be more tricky as the output of the dense layer gets a higher dimension.

    In words of your example here: if I want to forecast not only shampoo but also toothpaste sales T time steps ahead, how can I achieve the forecast to have the dimension 2xT? Is there an alternative to the dense layer?

    • Avatar
      Jason Brownlee September 30, 2017 at 7:38 am #

      I see. You could have two neurons in the output layer of your network, as easy as that.

  40. Avatar
    Camille September 30, 2017 at 9:07 am #

    Thanks for this great tutorial. Do you think this technique is applicable on the case of a many-to-many prediction?

    A toy scenario: Imagine a machine with has 5 tuning knobs [x1, x2, x3, x4, x5] and as a result we can read 2 values [y, z] as a response to a change of any of the knobs.

    I am wondering if I can use LSTM to predict y and z at with a single model instead of building one model for y and another for z? I am planning to follow this tutorial but I will love to hear what you think about it.

  41. Avatar
    Jean-Marc September 30, 2017 at 12:08 pm #

    Hi Jason, thank you very much for this tutorial. I am just starting with LSTM and your series on LSTM is greatly valuable.
    A question about multi-output forecasting: how to deal with a multi-output when plotting the true data versus the predicted data.
    Let’s say I have a model to forecast the next 10 steps (t, t+1…,t+9).
    Using the observation at time:
    –> t=0, the model will give a forecast for t =1,2,3,4,5,6,7,8,9,10
    and similarly, at
    –> t=1, a forecast will be outpout for t=2,3,4,5,6,7,8,9,10,11
    There is overlap in the timestep for the forecast from t=0 and from t=1. For example, if I want to know the value at t=2, should I use the forecast from t=1 or from t=0, or a weighted average of the forecast?

    May be using only the forecast from t=1 enough, because it already includes the history of the time series (i.e it already includes the observation at t=0).

    • Avatar
      Jason Brownlee October 1, 2017 at 9:06 am #

      I’m not sure I follow. Perhaps you might be better off starting with linear models then move to an LSTM to lift skill on a framing/problem that is already working:

    • Avatar
      Kai Ding February 15, 2019 at 2:09 am #

      Hello Jean-Marc

      “For example, if I want to know the value at t=2, should I use the forecast from t=1 or from t=0, or a weighted average of the forecast?”

      I have the same question, do you know how to fix this “overlap” problem?

      • Avatar
        Jason Brownlee February 15, 2019 at 8:10 am #

        I’m not sure I follow, can you elaborate what you are trying to achieve with an example, e.g. an input and output?

  42. Avatar
    mr October 1, 2017 at 9:53 pm #

    return datetime.strptime(‘190’+x, ‘%Y-%m’)

    gives me:

    ValueError: time data ‘1901/1’ does not match format ‘%Y-%m’

    Thanks in advance

    • Avatar
      Jason Brownlee October 2, 2017 at 9:38 am #

      Perhaps confirm that you downloaded the dataset in CSV format.

  43. Avatar
    wmbm October 4, 2017 at 10:29 pm #

    So you don’t actually need to split the data into test and training sets because you don’t use the training set in this code. So this then becomes an unsupervised problem?

  44. Avatar
    Noah yao October 16, 2017 at 2:33 pm #

    sorry i am confuse about the function inverse_transform why you use n_test+2 in the function but not n_test?

  45. Avatar
    RRighart October 20, 2017 at 9:12 pm #

    Hi Jason,

    Thank you very much for a very nice post!

    You explained that “A rolling-forecast scenario” will be used, also called walk-forward model validation. You said “Each time step of the test dataset will be walked one at a time. A model will be used to make a forecast for the time step, then the actual expected value for the next month from the test set will be taken and made available to the model for the forecast on the next time step”.

    What method / algorithm would you suggest doing in the scenario there are no such test/validation data available? In other words, I have a collection of time-series data that stops at a certain point, and I need to forecast the next points.

    Thank you very much in advance for your advice!

  46. Avatar
    Prakash Anand October 21, 2017 at 10:57 pm #

    Hi Jason,

    Thanks for this wonderful tutorial. I’m trying to solve a problem and wanted your input, which is something like this. I have 2 years of sales data on daily basis with some other predictor variables as holiday, promotion etc. lets say jan 2015 to jan 2017. and i wanted to forecast for month of Feb. i was thinking in something like data preparation would be take last 60 days data as input sequence and predict next 30 time steps. Since the dataset is very small. do you think it will work?. Whats you suggestion on this. ?

    • Avatar
      Jason Brownlee October 22, 2017 at 5:21 am #

      TRy it.

      Generally, predicting 30 days ahead is very hard unless you have a ton of data or the problem is relatively simple.

      • Avatar
        Prakash Anand October 22, 2017 at 6:24 am #

        yeah. that’s my concern too. because the dataset is very small.

  47. Avatar
    Bryant October 24, 2017 at 8:12 pm #

    Mr Jason
    I have two questions:
    1. In this example, three rmses are exported. What should I do if I want to output the three predictions for each time step and integrate all the output into a data box(Easy to observe)?
    2. What if I need to do 6- months, 12-month predictions? How do I change it?
    I’m sorry that my python is not very good.
    thank you so much!

  48. Avatar
    Derrick October 25, 2017 at 1:29 am #

    Hi Jason,

    I’m working through your tutorial but I’m running into an issue during the reshape in the ‘prepare_data’ function.

    My current shape of the data that I use is as follows:
    (156960, 3)

    But the reshape in the prepare_data function tells me this:

    ValueError Traceback (most recent call last)
    in ()
    —-> 1 train, test = prepare_data(X, 15696, 2, 4)

    in prepare_data(series, n_test, n_lag, n_seq)
    3 # extract raw values
    4 raw_values = series.values
    —-> 5 raw_values = raw_values.reshape(len(raw_values), 1)
    6 # transform into supervised learning problem X, y
    7 supervised = series_to_supervised(raw_values, n_lag, n_seq)

    ValueError: cannot reshape array of size 470880 into shape (156960,1)

    This array size of 470880 is three times 156960, which is the len(size of my data).

    Would you have advise how I could solve this issue?

  49. Avatar
    Kishore Kumar November 11, 2017 at 8:17 pm #

    Hi Jason,

    I am a beginner in machine learning. These tutorials are helping me so much to learn and improve. Thanks a ton for posting all your explorations.

    Now I have a question to ask you,

    We can 36 months data in this example. Now I require knowing the 37th-month forecast. How would I predict in this model?

    Should I reshape the new value before I predict or direct inject the new data into predict model?

    new_data = 145
    predicted_output = model.predict(new_data, verbose = 0)


    new_data = 145
    x = x.reshape(1,1,1)
    predicted_output = model.predict(x, verbose = 0)


    Do we need have any other method to do so?

    Note: Based on your answer, I would like to predict the 4 month predict.

    Thanks in advance for your time and help

    • Avatar
      Jason Brownlee November 12, 2017 at 9:04 am #

      This post has more advice on how to reshape input data:

      This post shows how to make predictions for final LSTM models:

      • Avatar
        Kishore Kumar November 12, 2017 at 8:24 pm #

        Thanks for your reply.

        I see two different prediction results when I save the model and try to predict the model which is loaded.

        But the forecast/predictions results are same when I run the model infinite times before saving the model.

        With the model that is saved and loaded, results the same prediction output everytime I run with that loaded model.

        The problem is, results given before saving the model is not matching/ same with the model that is loaded.

        Looks like something gets changed inside the trained model when saving it.
        Before saving the model, it provides 98% accuracy. While after saving the model, when we try to predict it give 90% accuracy.

        Can you help me to clarify this doubt. I have provided the code snippet with the output below. This code snippet of saving the model and loading it again is from one single python program only. not multiple python scripts.

        Note: I am experimenting with a different dataset, that contains prices in decimals and similar to this tutorial dataset.

        Program Code:
        value = [ 0.0568]
        value = array(value)
        value = value.reshape(1, 1, len(value))
        predicted_example = model.predict(value, batch_size=1, verbose = 0)
        print (“predicted example %s” % predicted_example)


        model_storage_1 = load_model(‘saved_keras_model_1.h5’)

        predicted_example_1 = model_storage_1.predict(value, batch_size=1, verbose = 0)

        print (“predicted example_1 %s” % predicted_example_1)


        output recieved:

        predicted example [[-0.0193442 0.01113211 -0.00196517 0.00191608 -0.00315076 0.0080449]]

        predicted example_1 [[-0.02511037 0.01445036 -0.00255096 0.00248715 -0.00408998 0.0104428]]

        • Avatar
          Jason Brownlee November 13, 2017 at 10:15 am #

          That is very interesting.

          I don’t have any good ideas. If it is mission critical, I would suggest designing experiments to further tease out the cause and limits of the effect.

          • Avatar
            Kishore Kumar November 13, 2017 at 9:32 pm #

            That’s fine. Between Why are these predicted values are in negative and positive. What does it mean. Do we need to further transform into any other function or do any operation.

  50. Avatar
    jiawenqi November 13, 2017 at 7:47 pm #

    model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
    When X.shape[1] =1,so step=1 . Lstm can lose its meaning,because it will become a regression model.

  51. Avatar
    Abdur Rehman Nadeem December 1, 2017 at 11:06 am #

    Hi Jason,

    Your blogs are really great. I have a learned and still learning a lot from them.

    I am trying to apply tweet sentiments to LSTM along with some numeric features (e.g price, volume) but still I did not succeed. I have read some blogs and papers but everywhere tweets and numeric features are feed separately but I want to feed both of them as my feature vector.
    Any good suggestions ?

    Best Regards,

  52. Avatar
    ktr December 1, 2017 at 9:39 pm #

    Thank you Jason
    I’ve been working though your tutorials which are quite useful and
    clear – even to a non-Python programmer In this one though I lost the thread around
    “Fit LSTM Network. I’m concerned about “fix time steps at 1”.

    What about when the timesteps are not a constant size? A specific example: I am
    driving, recording my position, acceleration, direction and time every five minutes.
    For various reasons the five minutes is approximate. Also, sometimes I lose the
    GPS, so I miss one or several records.

    Obviously position depends on time. Should I resample all my records so the time periods are equil? Should I interpolate to provide the missing ones? What if I stop overnight. Can I somehow stitch the two days data together?

    Second question: where in this tutorial are you providing the punishmenty feedback to the model? I want to use an asymmetric function. (If I want to drive up to the edge of a precipice, it is much worse to go too far than not quite far enough.)


  53. Avatar
    Vino Jose December 5, 2017 at 1:33 am #

    Thank you Jason for the wonderful blog post. Could you please give a hint about how to predict multi-steps for this multivariate input?

  54. Avatar
    Vino Jose December 10, 2017 at 5:14 pm #

    I have to predict the performance of an application. The inputs will be time series of past performance data of the application, CPU usage data of the server where application is hosted, the Memory usage data, network bandwidth usage etc. I’m trying to build a solution using LSTM which will take these input data and predict the performance of the application for next one week. I have followed your blog ‘https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/’ and understood how to work with multivariate data. I’m currently stumbled at the part where predicting multiple steps to the future, ie, next one week performance of the application. Even though multi-steps prediction is working for me with univariate time series examples, here it is not working. Not sure what I’m missing. Could you please give me some guidance in doing that?

    • Avatar
      Jason Brownlee December 11, 2017 at 5:24 am #

      What is the problem exactly? Where are you getting stuck?

      • Avatar
        Vino Jose December 12, 2017 at 9:51 pm #

        I’m getting only one data point in the predicted result, while I’m expecting one weeks data points.

  55. Avatar
    G Flash December 12, 2017 at 7:57 pm #

    Hi Jason,

    thanks for that great blog! I have a general question about multi-step predictions. Your prediction of t+3 is – as I understand it – independent from the prediction of t+2, which itself is independent of t+1.

    Is it meaningful to consider to feedback the former predictions into the network? If yes, how is such a model called?

  56. Avatar
    Yang December 27, 2017 at 6:15 pm #

    Hi Jason,
    Thanks for the great tutorial! I have several questions about the predictions. If I try to deal with a dataset which contains about 6000 observations, is it meaningful to make predictions from t+1 to t+500 (if n_test=1)?
    By the way, when plotting the predictions, there is a small shift from the last data point. Is it the result of the transform from series to supervised? Maybe I mistook something.


  57. Avatar
    Andreas January 16, 2018 at 6:10 am #


    Would it be beneficial to also use which time step (t+k) we are predicting on as input to the model? Since right now we are considering all data points in the the span specified by n_seq as “the same time step away from where we are predicting from”.

    Best Regards & Thanks,

  58. Avatar
    Martin January 29, 2018 at 8:35 am #

    Hi Jason
    Many thanks for your very helpful tutorials. I would be very happy to get some help regarding this problem:
    Given is a time series with 20 input variables and one output variable.
    The series length is about 500 samples. For 5 of the 20 variables, the are also future samples available. (50 samples). I wonder how I can use the future values of this 5 variables in order to improve the the prediction.
    Many thanks for a helpful hint.
    Best Regards

    • Avatar
      Jason Brownlee January 30, 2018 at 9:44 am #

      What do you mean by “future samples”?

      • Avatar
        Martin February 6, 2018 at 4:46 am #

        Hi Jason

        For 5 of the 20 input variables (x1..x5), I already have the values for the 50 next timesteps. (This values are given). So I don’t need to predict them, but I want to use it to improve the prediction for the (one) output variable y. (There is no need to predict also the other 15 input values x6–x20)

        x1….x5, x6..x20, y
        t0 1, .. 2, 4, .. 7, 10
        t1 1, .. 3, 4, .. 5, 11
        t500 2, … 5, 5, … 8, 14
        t501 2, … 4, ?????? ?
        t550 2, … 3, ?????? ?

        Many thanks in advance

  59. Avatar
    Mohammad February 6, 2018 at 5:59 am #

    Dear Jason thanks for awesome codes and explanation, I have one question for you. In this case, one wants to estimate multi-step in future, right? for example 10 steps ahead. But all of the 10 steps are unknown. The model should find them without using the actual value. But what I see here in test sets or train sets is that the model estimates data points considering actual values not predicted.
    Let’s see some of data together:
    [[ 342.3 339.7 440.4 315.9]
    [ 339.7 440.4 315.9 439.3]
    [ 440.4 315.9 439.3 401.3]]

    let’s imagine model predicts that for first row [ 342.3 339.7 440.4 315.9] the predicted value is 439.4 but actually the correct and actual value is 439.3 (which we don’t know!). So in the second row we should consider [ 339.7 440.4 315.9 439.4] instead of [ 339.7 440.4 315.9 439.3].

    Please elaborate this for me more.

    • Avatar
      Jason Brownlee February 6, 2018 at 9:23 am #

      Sure, what is the question exactly?

      • Avatar
        Mohammad February 6, 2018 at 11:36 am #

        The question is this, when you say this method is capable of multiple step ahead forecasting, you mean which of these two:
        1) the one which uses no information of future (no actual value ) and just use its own predictions
        2) the one that predicts a point for the next step and calculate the error, but forget about the prediction and uses the realization of that point (the actual value) for steps after that.

        I believe the model here is the second one, right?
        I want to make sure.

        I am concern about the fact that the good result, showing here is because of the fact that model is seeing the results in the test set.

        In other words, model predicts the shampoo price of Jan, at price 1000, but it actual price is 1200. for February prediction the model uses 1200, ( the correct price) instead of what it predicted (1000)

        The difference after periods of time would become significant.

  60. Avatar
    Lak February 22, 2018 at 4:45 am #

    Hi Jason,

    Thanks for posting this nice tutorial. Can you check if you calculation of using (n_test + 2) in line 172 and 174 in the complete code is correct?

    I think that should be (n_test-n_lag+2). That would be 11 instead of 12.

    So for example:

    d: difference where d[i] = d[i+1] – d[i]
    f: forecast
    s: original series

    The training data is
    d0 : d1,d2,d3
    d1: d2,d3,d4
    d21: d22,d23,d24

    Test data:
    d22: d23,d24,d25

    forecast[0] = f_d23,f_d24,f_d25

    f_d23 should be s24-s23 => s24 = f_d23 + s23

    So the last_ob value is s23, but your code gives s22.

    That can be corrected by using (n_test – n_lag + 2).

    Let me know if I misunderstand something.

    Thanks for your time!


    • Avatar
      Lak February 22, 2018 at 7:38 am #

      Actually the generic form should be (n_test+n_seq) for inverse_transform and (n_test+n_seq-n_lag) for plotting.

  61. Avatar
    Monty Shaw March 7, 2018 at 12:29 pm #

    Can you show how to add another layer of lstm?, I tried just duplicating the model.Add(LSTM line, but I get an error about expecting 3 dims but only getting 2

    Also I am taking your 7 days course (although a bit slower than 7 days)


      • Avatar
        Sebastian Olbrich June 5, 2018 at 8:19 am #

        Jason, thank you, really, for the great work! It helped me a lot within the last months.
        However, I managed to add layers in other LSTM models I used. Stil, I am not able to add layers in the code above, where the LSTM fit is wrapped into a separate function. Always, when I add LSTM layers to the code, there is the

        IndentationError: unindent does not match any outer indentation level

        Any ideas? I could rewrite the code and resolve your “def fit_lstm”, although this would make the code so ugly. So how do I implement more layers without that?

        Thanks in advance…
        and keep it up, it is a great thing you are doing!


  62. Avatar
    char March 10, 2018 at 8:25 am #

    This example only uses one timestep to predict the next 3 steps? To use more timesteps to predict, the series_to_supervised should have the n_in argument to be more than 1? Also, do n_in and n_out arguments correspond to the lag and seq parameters in the same function in your other articles on LSTM forecasting? Thanks.

  63. Avatar
    MLT March 11, 2018 at 7:49 pm #

    Hi Jason,

    I tried turning parameter in your code to optimize result. First, I check if there is underfit or overfit.

    I add below code in your program.

    history = model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_test, y_test))

    22/22 [==============================] – 0s 2ms/step – loss: 0.0988 – val_loss: 0.2584
    t+1 RMSE: 90.210739
    t+2 RMSE: 79.713680
    t+3 RMSE: 107.812684

    It seems validation loss is much higher than the training loss. I did one of test to rescale data to (0, 1) with linear activation func.

    scaler = MinMaxScaler(feature_range=(0, 1))
    model.add(LSTM(n_neurons, activation=’linear’, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
    model.add(Dense(y.shape[1], activation=’linear’))

    I have run twice. The result is quite different. May I ask two question here please?
    1. Why the result is very unstable with the same code?
    Run 1 t+2 RMSE: 123.765729 is almost double to Run 2 t+2 RMSE: 69.944902

    2. Metric shows better improvement( changed version loss: 0.0248 – val_loss: 0.0709 vs loss: 0.0988 – val_loss: 0.2584), but rmse does not show much improvement ( changed version t+2 RMSE: 69.944902 vs t+2 RMSE: 79.713680).

    Run 1:
    22/22 [==============================] – 0s 2ms/step – loss: 0.0241 – val_loss: 0.0651
    t+1 RMSE: 158.873657
    t+2 RMSE: 123.765729
    t+3 RMSE: 186.785670

    Run 2:
    22/22 [==============================] – 0s 2ms/step – loss: 0.0248 – val_loss: 0.0709
    t+1 RMSE: 93.477638
    t+2 RMSE: 69.944902
    t+3 RMSE: 113.995648

    Thanks in advance.

    • Avatar
      Jason Brownlee March 12, 2018 at 6:29 am #

      Re the high variance of model skill, perhaps the model is under specified for the problem. Perhaps the model is a bad fit for the problem.

  64. Avatar
    char March 13, 2018 at 5:34 am #

    Will inverting the difference cause the data to be short by one? For example differencing [5,4,3,2,1] will produce [1,1,1,1] but inverting only produces [4,3,2,1].

    • Avatar
      Jason Brownlee March 13, 2018 at 6:32 am #

      Yes, the first observation is lost (I think).

      • Avatar
        char March 14, 2018 at 12:52 am #

        How to predict the only the last timestep? It seems like you are only predicting to t-2 timesteps (looking at the plot). Thanks!

        • Avatar
          char March 14, 2018 at 3:35 am #

          From reading some of the comments above, it seem like n_test+2 should be n_test+n_seq-1 (regardless of n_seq) instead. This looks like the predictions start from the last step. Could you confirm this?

  65. Avatar
    MLT March 14, 2018 at 9:01 pm #

    Hi Jason,

    For online training, how can I update the model with the latest data please?

    May I input new_X and new_y of the latest month data to fit model and never rest_states of the model? Or if there is a better way to do it please? Thanks.

    For example, the model was train with the data from one year ago until May.
    In July, I have the sales data of the June. New_X is May sales and new_y is June sales.

    model.fit(new_X, new_y, epochs=1, batch_size=1, verbose=0, shuffle=False)

    July_sales = model.predict(new_y, 1) #new_y is June sales.

  66. Avatar
    Mark Stevenson March 17, 2018 at 3:08 am #

    Hi Jason,

    Thanks so much for posting this. I have a quick question. I’m using this model on some market data. When I use n_seq = 3, the “actual” values reconcile with my data. When I change n_seq to 5, the output for “actual” doesn’t correspond to anything in my dataset, although it is similar. What could be causing this?

    Thanks again,

    Mark Stevenson

    • Avatar
      Jason Brownlee March 17, 2018 at 8:44 am #

      The model will need to be tuned for your specific problem.

  67. Avatar
    Haylee Ham March 27, 2018 at 5:07 am #

    I also want to apply this is a multivariate time series forecasting and have read through your multivariate post (https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/).

    I am interesting in predicting gas prices. So the output I am interested in is only one variable, however I am inputing about 15 variables. In order to predict more than one time period in the future, do I need to train the LSTM to be predicting all of the variables (input and output) rather than just my output variable of gas price?

    Thanks so much.

    • Avatar
      Jason Brownlee March 27, 2018 at 6:41 am #

      No, you can frame the problem any way that you wish.

      In the other post we take multiple inputs and predict one output, you can extend that to predict a sequence for that single output feature.

      • Avatar
        Haylee Ham March 27, 2018 at 8:51 am #

        Thanks for the reply!

        In order to do that would I set up the problem as each row of data being t, t+1, t+2, etc. for the gas prices and then t-1 of all of the input variables?

        Do you have a post that details this method of outputting a sequence?

        • Avatar
          Jason Brownlee March 27, 2018 at 4:16 pm #

          Yes, this very post (above) shows you how to output a sequence.

  68. Avatar
    Jenny April 3, 2018 at 1:10 pm #

    Hi Jason! Thank you for the great post!
    I’m wondering if we need to remove seasonality before using LSTM.

    • Avatar
      Jason Brownlee April 4, 2018 at 6:04 am #

      I would recommend it. Anything to make the problem easier to model is a good idea.

  69. Avatar
    Marco April 6, 2018 at 1:27 am #

    Hi Jason, in your code you use a batch size of 1 since you have just few data. In my case i have a much bigger number of data, so i want to use a bigger batch size. I just want to understand one thing, if i use a batch size of 72 for example, i also have to change the make forecast function, because in your example you use a for cicle to make forecast of one example at each time, while in my case i should make forecast of 72 examples at each time? Is this correct?

    • Avatar
      Jason Brownlee April 6, 2018 at 6:32 am #

      The batch is the collection of samples.

      Perhaps you mean time steps for a given sample/sequence?

  70. Avatar
    Eric April 26, 2018 at 3:48 am #

    Hi Jason,

    Thank you for all the great content – extremely helpful and thorough.

    I’m trying to understand how to generalize the input shaping for varying 1) number of features and 2) lags.

    In the example above, you do
    X = X.reshape(X.shape[0], 1, X.shape[1])

    Where X.shape[0] represents the number of rows in X (samples), 1 is hardcoded as we’re only looking at the prior timestep for prediction, and X.shape[1] represents the number of columns in X (which represents number of features *only* when we are looking at 1 prior timestep)

    If we are considering a lag of more than one timestep, we’ll have to change the second and third components of the reshaping, right? For instance, say we are considering a lag of 3 in your example above. Then our supervised X dataset will have 3 columns. But this is still technically one original feature (shampoo sales), just spread out over 3 timesteps. So our required reshaping would then be X.reshape(X.shape[0],3,1), correct?


  71. Avatar
    Han Yi May 7, 2018 at 12:52 am #

    Hi, Dr.Brownlee!
    Thanks for your share. It’s very helpful.
    I got a problem recently when I treid to use multi-step LSTM to forecasting something.
    The time series I have as training set is about 3000 days long. However, I need to predict the future 600 days. Additionally, another 8 useful features for each day are needed to be considered.
    I used Recursive Multi-step Forecast(t-3,t-2,t-1 for t+1) you’ve introduced , but results are very bad.
    Can you give me some advice for this problem??

  72. Avatar
    Mo May 11, 2018 at 9:39 am #

    Hello! I think you have made the best, most readable and extensible LSTM RNN example that I have ever seen (and I have seen a few!).

    Just one note: I think it would be better to change the following line in the code:

    plot_forecasts(mid_prices, forecasts, n_test+2)


    plot_forecasts(mid_prices, forecasts, n_test + (n_seq – 1))

    As it now accounts for the number of observations held back for any number of forecasts (n_seq).

    Thanks again!

  73. Avatar
    Andrea May 21, 2018 at 2:57 am #

    Hi Jason,

    Thanks a lot for your tutorials.
    They are incredibly useful and educational.

    I have a question that might be silly, but i don’t quite get how the predictions are actually evaluated by the LSTM.

    I can see you set n_lag=1, and that such value is used to split the test set in the make_forecasts method.

    You wrote:
    >X, y = test[i, 0:n_lag], test[i, n_lag:]
    >forecast = forecast_lstm(model, X, n_batch)

    Does this mean that the lstm is able to predict three months in the future with only one single value to start predicting from?

    Thanks in advance for your time

  74. Avatar
    Abhinandan Nabera May 21, 2018 at 3:57 pm #

    Hello Jason,

    I have a data sample like this one!

    Sample Time w d ywn
    1 0 -0.10056 0.18784 -0.032737
    1 1 -0.039381 0.97014 -0.049748
    1 2 0.12412 -0.77848 0.029185
    1 3 0.019026 0.13856 0.013822
    1 4 -0.23032 0.84811 0.058235
    1 5 0.97489 0.24698 0.01231

    2 0 -0.59973 0.34736 -0.013221
    2 1 0.32069 0.11464 0.074709
    2 2 -0.12189 0.75243 -0.022599
    2 3 -0.63586 0.04404 0.056563
    2 4 -0.84312 0.17943 0.051038
    2 5 -0.28347 -0.34718 0.01531

    … etc.. Like these I have 500 samples and w,d are inputs and ywn is output. How can I train and test my output? Please help. Too confused. By the way need to use RNN with Keras and tensorflow.

  75. Avatar
    MLT May 25, 2018 at 7:27 pm #

    Hi Jason,

    May I ask why the shape of data scaling and reverse scaling is different please? In scaling, it uses (len(diff_values), 1). In reverse scaling, it becomes (1, len(forecast)). Thanks in advance

    def prepare()
    diff_values = diff_values.reshape(len(diff_values), 1)
    # rescale values to -1, 1
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaled_values = scaler.fit_transform(diff_values)

    def inverse_transform():
    inverted = list()
    for i in range(len(forecasts)):
    # create array from forecast
    forecast = array(forecasts[i])
    forecast = forecast.reshape(1, len(forecast))
    # invert scaling
    inv_scale = scaler.inverse_transform(forecast)

  76. Avatar
    Siddharth May 25, 2018 at 7:56 pm #

    Hi Jason,

    Thank you for this tutorial, it’s very helpful! I ran the model code above and have a few questions. (Pertaining to this dataset)

    1) The RMSE largely varies after each run. Is this normal?

    2) I removed reset_states() and seem to get lesser RMSE scores for every run. Shouldn’t it be the opposite?

    3) What changes do I need to make to exploit the fact that LSTMs don’t require a fixed sampling window to learn and can continually incorporate larger windows with time while learning?

  77. Avatar
    Jack May 29, 2018 at 6:28 pm #

    Hi, Jason,
    Thank you for this tutorial! My question here is about the batch size. Why is it fixed at 1? Is it because we have to make predictions every time step? If I just want to make a multi-step prediction at the end of the data, do I have to change the batch size? My understanding is that batch size is the number of samples being put into the network, is that correct?
    I’m trying to solve a multivariate multi-step prediction problem. I have 7 variable, one of which is the target. I’m confused how to set batch size here. If I want to predict every time step, is it still set at 1?

  78. Avatar
    Nimish Verma June 2, 2018 at 1:40 am #

    Hi Jason,
    I am trying to build an LSTM network for predicting a time series of price changes, right now I am trying it with a multi step LSTM with latest 3 inputs, but I wish to create a network where input for ith layer is all the series till (i-1)th layer. Example if the series is 10,9,5,2,6,7….
    and I am training my model right now,
    Ill input 10 for first layer, 10,9 for 2nd, 10,9,5 for 3rd and so on..

    Is it logically possible to create such network?

  79. Avatar
    Sarra June 4, 2018 at 7:47 pm #

    it is a nice tutorial. Any code for multivariate case please?

  80. Avatar
    MLT June 14, 2018 at 1:03 am #

    Hi Jason,

    I encountered validation loss is smaller than the training loss in LSTM model. May I ask if you have some link or article to talk about it please? Thanks in advance.

  81. Avatar
    Y.Ran June 16, 2018 at 9:21 pm #

    Hi, Jason,
    Thanks for your great tutorial.
    Shamsul asked how we can do MIMO (multiple variables as an input and multiple variables as an output). You suggested using the link https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/ as a template. As far as I understand, the tutorial you suggested shows how to predict t+1, t+2, t+3 by given t. It is not suitable for my MIMO use case.
    Let me take the example you wrote in the https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/. For instance, at time t, I have an input PM2.5 concentration, Dew Point and Temperature (multiple variables as an input). I want to predict PM2.5 concentration, Dew Point and Temperature (multiple variables as an output) at time t+1. How can we do that?

    • Avatar
      Jason Brownlee June 17, 2018 at 5:40 am #

      You could change the model to be seq2seq, such as an encoder-decoder model or an RNN autoencoder.

  82. Avatar
    Kapil K June 18, 2018 at 9:57 pm #

    Hi Jason – First of all great article. I have tried using it on a different dataset.

    It seems to be working with n_seq = 1. However, the moment i change that n_seq = 3 or a higher number, i get an error such as below:

    ValueError: cannot reshape array of size 3 into shape (1,1).

    I assume that the code inherently takes care of this that’s why it worked fine on the shampoo dataset. I have tried to modify the code specifically this part below but to no effect:

    # reshape training into [samples, timesteps, features]
    X, y = train[:, 0:n_lag], train[:, n_lag:]
    X = X.reshape(X.shape[0], 1, X.shape[1])

    Could you please guide me?

    Full Error here:
    /opt/conda/lib/python3.6/site-packages/sklearn/utils/validation.py:560: DataConversionWarning: Data with input dtype object was converted to float64 by MinMaxScaler.
    warnings.warn(msg, DataConversionWarning)
    ValueError Traceback (most recent call last)
    in ()
    34 #forecasts = forecasts.reshape((len(forecasts), 1))
    —> 36 forecasts = inverse_transform(series, forecasts, scaler, n_test+2)

    in inverse_transform(series, forecasts, scaler, n_test)
    115 # create array from forecast
    116 forecast = numpy.array(forecasts[i])
    –> 117 forecast = forecast.reshape(1, len(forecast))
    118 # invert scaling
    119 inv_scale = scaler.inverse_transform(forecast)

    ValueError: cannot reshape array of size 3 into shape (1,1)

  83. Avatar
    Alex June 19, 2018 at 7:44 am #

    Hi Jason-

    Thanks for another great article. I’ve been learning a lot from these this year. I am still having trouble conceptually wrapping my head around multi-variate time series data and how it is fed into a neural network.

    Here is a very simplified example of my data (formatted for ease of interpretation), where I am trying to predict the electrical load for different houses (thousands of them) two hours from now based on: current weather observations, the average load for the prior three hour periods, and info about the house:

    house/time/temp/sun load(t-2) load(t-1) load(t) y_load(t+2)

    1 1 28 610 5 6 5 3
    1 2 28 599 6 5 4 3
    1 3 27 587 5 4 3 2
    1 4 26 576 4 3 3 1
    1 5 26 565 3 3 2 1

    2 1 23 587 7 7 6 5
    2 2 23 576 7 6 5 4
    2 3 22 565 6 5 5 3
    2 4 22 576 5 5 4 1
    2 5 22 565 5 4 3 1

    3 1 33 565 4 4 4 2
    3 2 34 503 4 4 3 1
    3 3 34 492 4 3 2 1
    3 4 35 481 3 2 1 1
    3 5 35 469 2 1 1 1


    I’ve had a hard time even relating to examples such as complex multivariate stock predictions, because using that analogy I am trying to use multivariate time series data to make prediction on a suite of many stocks (or houses here), instead of just one.

    Using train_test_split(), I would like to train on complete sets of data for X_num of houses, and then test on completely unseen data for y_num houses.

    I know I want shuffle = False, so that time is sequential, but how do models differentiate between houses? Would using a batch_size = 5 (corresponding to the 5 time intervals per house) be useful? Would doing so mean that one house’s complete daily profile is fed in at a time and trained on together as a time series.

    After doing ML involving non-time-series dependent data, I suppose I am most confused on how models capture that sequential time element, and then in my case, how they can learn different time series corresponding to unique elements (houses)?

    Thank you so much for ANY suggestions or explanations you might have.


  84. Avatar
    MLT June 27, 2018 at 6:15 am #

    Hi Jason,

    I need to predict y(t+1) .. y(t+n) from feature x1 and x2.
    x1 is historical data
    x2 is future data provided by external source.

    f(x1(t) … x1(t-m), x2(t+1) … x2(t+n)) = y(t+1) .. y(t+n)

    Do you have any suggest which algorithm will be suitable for this case please? May I refer to this LSTM multi step implementation please? Thanks a lot in advance.

    • Avatar
      Jason Brownlee June 27, 2018 at 8:24 am #

      Try a suite of methods and discover what works best for your specific dataset.

  85. Avatar
    Mohammad Abuzar June 29, 2018 at 5:09 am #

    I have a question:
    In your example the prediction depend on only one previous timestep with various features.

    If I am right you are trying to predict 1 variable (1 feature), for many future steps, based on many past time steps.

    if “[samples, timesteps, features]” is the meaning of the 3D shape input to LSTM model.
    I would like to understand why the #of time steps is 1 and # of features > 1?

    • Avatar
      Jason Brownlee June 29, 2018 at 6:14 am #

      It is just an example on a simple univariate problem. You can change the model to be anything you wish.

  86. Avatar
    Ray li July 2, 2018 at 3:33 am #

    Hi Jason,

    Thanks for this article.

    I have a problem based on this article. Lets say we have multiple shampoos rather than just one, and we have the sale records for each shampoo and information about each shampoo.
    What model should we use to solve this problem?



    • Avatar
      Jason Brownlee July 2, 2018 at 6:26 am #

      Try a suite and see what works best.

      • Avatar
        Ray li July 2, 2018 at 8:21 pm #

        Could you please give more information? What do you mean by suit?


  87. Avatar
    zijin July 9, 2018 at 7:32 am #

    Hi Jason
    thank you very much for your very helpful tutorials. I read all your LSTM forecast related tutorials. I was confused by the batch_size in the prediction. I know when training model, batch_size is a collection of samples model will process to update the weight. But why after the model is trained, when we do the forecast, we still need the batch_size and the same batch_size when we training model. Could you please explain how the batch_size play a role in the forecast after the model is trained. thanks again.

    • Avatar
      Jason Brownlee July 10, 2018 at 6:36 am #

      Often, the model is defined with a fixed batch size, meaning that it expects to process that many records at a time. It is an efficiency of the implementation, not something inherent in the algorithm.

      • Avatar
        zijin July 10, 2018 at 11:20 am #

        Yes. When training the model, it expects to process the batch size records at a time. Let me say we have 1-8 time series, if the time step is 2, we just forecast one step forward, the batch size is 3. then we will reformat the data to be
        X1 X2 Y
        1 2 3
        2 3 4
        3 4 5
        4 5 6
        5 6 7
        6 7 8
        model will calculate the loss for the first 3 Y(Y=3 4 5) estimation then update the weights, then calculate the last 3 Y(Y=6 7 8) estimation loss to update weights again. this is one epoch. after certain mount of epoch. The model is trained. Then weights and architecture is fixed. Now we know the X1=7,X2=8, we can use the model to do the one step forecast, we only need to know X1, X2(the 2 time steps), weights, and model architecture. we should be able to do the forecast without batch. But why in Keras, I use your code “forecast = model.predict(X, batch_size=n_batch)”, we have to pass the same batch_size to model.predict. I know some people will just save the weights and model architecture, like he build another model, then he can use different batch size to walk around the issue. I just don’t understand the background theory why the batch size matters when we use model.predict. Could you please explain it or direct me to some paper or tutorial. thank you very much for your time and help.

  88. Avatar
    zijin chen July 10, 2018 at 9:59 pm #

    Got it. thank you very much for your answers.

  89. Avatar
    Eric Gou July 13, 2018 at 5:11 pm #

    Hi Jason,
    Thank you for share these articles about LSTM.
    I have one problem while trying to predict the future data.
    While doing the prediction, I only use the first actual value as input. and use the output for next prediction. the predicted value became almost constant value after several steps.
    Do you have any idea about this kind of prediction?

    Thank you!


    • Avatar
      Jason Brownlee July 14, 2018 at 6:13 am #

      You might need to further tune the model to your specific problem.

    • Avatar
      Hao Chen January 14, 2019 at 11:45 pm #

      Hi,Gou,I have the same problem.Have you solved it now?

  90. Avatar
    Trung Anh July 16, 2018 at 1:03 pm #

    Hi Jason,

    I’ve been following your tutorial for a while. I’m doing a time series classification problem using LSTM with a softmax classifier.
    My data shapes are as follows: (3154, 30, 6) (3154, 30) (1352, 30, 6) (1352, 30).
    My model includes a LSTM layer and a dense(30).
    However when I run the model, I got the error: “ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (30,)”
    Is it because of my model? how do I fix this error?
    Thank you very much!

    • Avatar
      Jason Brownlee July 16, 2018 at 2:13 pm #

      Perhaps the output shape needs to be [n, 30, 1]?

      • Avatar
        Sundeep Nayakanti July 17, 2018 at 5:39 am #

        HI Dr.Jason,

        Thanks for your wonderful blog post.

        However, I am not still not able figure out how I can forecast into future(eg: sales of a product for upcoming three months) where my input variables are historical sales of that product+ number of quotes received for that product+ price points…+ other numerical variables… Is it fair to say LSTM can be used to forecast this kind of problem(considering all inputs)? Thanks in advance.

  91. Avatar
    ezgi August 2, 2018 at 7:25 pm #

    Hi, thank you for the tutorial it made LSTM much more clear for me now. But I have a confusion regarding the number of sequence and number of lags. Currently, I have a univariate time series dataset with 547 daily sales data. I want to predict the next 3 months(91 days) by using LSTM. I have set the n_lags as 3, 5 and 7. As I understand, this is the number of data that we look back while doing prediction. However, I could not understand what is the number of sequences and how should I set it. I would be so glad if you can answer my question. Thank you!

  92. Avatar
    xiaowanzi August 6, 2018 at 10:20 pm #

    sir Jason:
    Thank you very much for your article, which has helped me a lot, but my data has a periodic and complex sequence, which is a combination of sinx and cosx. I want to predict how to do one cycle or more. I have 100,000 data, 500 data per cycle, how do I want to predict the same, the same type, how to do it

    • Avatar
      Jason Brownlee August 7, 2018 at 6:27 am #

      Perhaps start with some classical methods like SARIMA and ETS, then try some ML methods, then try MLP, CNN and eventually an LSTM.

  93. Avatar
    Darkwind August 23, 2018 at 11:24 pm #

    Hi Jason,

    Thank you for the nice article.

    May I ask in the following function:

    # make one forecast with an LSTM,
    def forecast_lstm(model, X, n_batch):
    # reshape input pattern to [samples, timesteps, features]
    X = X.reshape(1, 1, len(X))
    # make forecast
    forecast = model.predict(X, batch_size=n_batch)
    # convert to array
    return [x for x in forecast[0, :]]

    Why is it X = X.reshape(1, 1, len(X)) instead of X = X.reshape(X.shape(0), 1, X.shape(1))

    Though the result does not change in the article, I cannot understand the logic.

    Thank you in advance for your time

  94. Avatar
    Loong August 30, 2018 at 12:25 am #

    Hello Dr Jason,

    I would like to thank you for your wonderful tutorial.

    I am not sure why I am getting the wrong prediction


    where else I should be getting


    The source codes and dataset was originated from this web site.

    I am using tensorflow 1.10.0 and keras 2.2.2.


    • Avatar
      Jason Brownlee August 30, 2018 at 6:30 am #

      You may need to run the example a few times?

      • Avatar
        Loong September 7, 2018 at 10:29 am #

        Hello Dr Jason,

        I apologize because it was my mistake.

        I have copied the wrong part of the code.


  95. Avatar
    summer August 30, 2018 at 12:44 pm #

    Hi Jason,

    Thanks very much for the nice article.

    May I ask in the following function:
    # evaluate the RMSE for each forecast time step
    def evaluate_forecasts(test, forecasts, n_lag, n_seq):
    for i in range(n_seq):
    actual = test[:,(n_lag+i)]
    predicted = [forecast[i] for forecast in forecasts]
    rmse = sqrt(mean_squared_error(actual, predicted))
    print(‘t+%d RMSE: %f’ % ((i+1), rmse))

    the function output the t+1,t+2,t+3…. RMSE for the test data
    [[ 342.3 339.7 440.4 315.9]
    [ 339.7 440.4 315.9 439.3]
    [ 440.4 315.9 439.3 401.3]
    [ 315.9 439.3 401.3 437.4]
    [ 439.3 401.3 437.4 575.5]
    [ 401.3 437.4 575.5 407.6]
    [ 437.4 575.5 407.6 682. ]
    [ 575.5 407.6 682. 475.3]
    [ 407.6 682. 475.3 581.3]
    [ 682. 475.3 581.3 646.9]]
    but how can evaluate the RMSE for the total test value and predicted value?

    • Avatar
      Jason Brownlee August 30, 2018 at 4:52 pm #

      Make predictions for the entire test set, then calculate the RMSE for the predictions.

  96. Avatar
    Mike C August 30, 2018 at 10:42 pm #

    Hi Jason,

    I’ve been trying to follow this guide as well as your one linked here: https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/ , but have run into some issues.

    To begin, my end goal is to have a multivariate multi-step forecasting time series LSTM. Specifically, I’m using a dataset indexed/sorted by date similar to your pollution.csv and it has 9 other fields per row that I’d like to use in training. Through training, my goal is to be able to give the model data from the target day as well as 2 prior days (so 3 lag days total) and then have it make predictions on the following 7 days. If the size/# of rows in the dataset matters at all, this particular one has 6375 entries.

    I’m unfortunately unable to figure out how to convert your example that I linked above to work in a multi-step fashion and I’m also unable to get the example in this article to work in a multivariate environment. Would you please be able to show me how to convert one of these two examples?

    Thank you! And as an aside, I think it’s awesome of you to be consistently replying to new questions posted to your article despite it being a year+ in age 🙂

  97. Avatar
    segun September 9, 2018 at 6:39 am #

    Thanks for this informative tutorial. I have a question. How can update LSTM be done? As explained below in your article?

    Update LSTM. Change the example to refit or update the LSTM as new data is made available. A 10s of training epochs should be sufficient to retrain with a new observation.

    Basically I want the new observation be fed into the model for next prediction, or does your article it anywhere?

  98. Avatar
    Al September 13, 2018 at 6:12 am #

    Hi Jason,

    Thank you for posting all of this. I have created a model using a compilation of several of your tutorials, wherein I forecast the high temperature for the next 3 days based on several decades of daily high temperature values, daily low temp, month of the year and precipitation. For the models I am generating, when I try to predict for t+1 (the next day), the value ends up very closely mimicking the value from the previous day (the graph basically looks like the same graph duplicated, with a time lag of 1 step introduced). What parameters can I tune to help deal with this issue?

    Thank you!

  99. Avatar
    Pranay September 17, 2018 at 11:42 pm #

    Hey! How can I predict a week into the future because the above procedure seems to work exclusively on test data. I mean the function “make_forecasts” takes into account test data and the same is evident from (X, y = test[i, 0:n_lag], test[i, n_lag:]). All I wish to ask is there’s no test data. All I have is training data.So, how do I forecast a week into the future now?

    • Avatar
      Jason Brownlee September 18, 2018 at 6:16 am #

      Call model.predict() and pass in the last n observations.

      • Avatar
        Pranay September 18, 2018 at 3:35 pm #

        But that leads to one-step forecast and I’m concerned about multi-step forecast.

        • Avatar
          Jason Brownlee September 19, 2018 at 6:14 am #

          If your model predicts multiple time steps, it will be a multi-step forecast.

  100. Avatar
    Monte September 27, 2018 at 12:36 am #

    Hi,Jason. I’m a new study. But, I still don’t konw how to make a multivariate-multi-step-time-series-forecasting with LSTM? Can you help me?

  101. Avatar
    Mohammad Ali Bagheri October 18, 2018 at 11:33 am #

    Thanks for all your nice tutorials. For this one, however, I don’t understand why some parts are written in a difficult way!
    For example, instead of writing the “difference” function, why you didn’t use:
    numpy.diff(dataset, n= interval)?

    • Avatar
      Jason Brownlee October 18, 2018 at 2:33 pm #

      Thanks for the feedback.

      There are many ways to solve a given problem and I try not to assume too much about what the reader knows.

  102. Avatar
    Francis Kim October 24, 2018 at 12:51 pm #

    Hi Jason,

    Thanks for sending me to this page. The code runs well.

    Is changing the forecast length (eg. from 3 months to 12 months) as easy as changing the n_seq value to 12?

    • Avatar
      Jason Brownlee October 24, 2018 at 2:47 pm #

      It may be, it’s been a while. Perhaps try it and see.

  103. Avatar
    Kartheek October 24, 2018 at 11:56 pm #

    how come we get an rmse values for future values. Rmse is based on our predicted values and the actual values ,But in this case we are predictiing for the future and we dont know the actual values.

    • Avatar
      Jason Brownlee October 25, 2018 at 7:56 am #

      You can only calculate the error of the model if you have ground truth.

      You can estimate how well the model is expected to perform by evaluating it on historical data.

  104. Avatar
    Jing Li October 25, 2018 at 3:57 pm #


    Why we need to invert the scale of the test data. I think they second line is not required.

    actual = [row[n_lag:] for row in test]
    actual = inverse_transform(series, actual, scaler, n_test+2)

    Best regards,

    • Avatar
      Jason Brownlee October 26, 2018 at 5:31 am #

      We invert the scale so that we can evaluate the error of the model in the original units of the dataset.

  105. Avatar
    saravana October 31, 2018 at 9:47 pm #

    Hi Jason,

    can anyone explain me this line
    n_vars = 1 if type(data) is list else data.shape[1]


    • Avatar
      Jason Brownlee November 1, 2018 at 6:09 am #

      It sets the number of variables to 1 if the input is a list otherwise it sets the number of variables to the shape of the second dimension (columns) in the case of a numpy array.

  106. Avatar
    Harry November 2, 2018 at 9:36 pm #

    Hi Jason,

    “A model will be used to make a forecast for the time step, then the actual expected value for the next month from the test set will be taken and made available to the model for the forecast on the next time step”

    Can you point in the method where the model is updated (retrain) on the next step that has included the previous datapoint, which whas in the test dataset?

    I would expect every time a datapoint in the test dataset being available to be used for retraining.

    • Avatar
      Jason Brownlee November 3, 2018 at 7:05 am #

      The model is not retrained each step of the walk forward validation, often it is too computationally expensive.

      Instead, the data is added to the history to be used as input to make the next forecast. E.g. we are simulating the fact that a real observation was made after we predicted, and we use the observation instead of the prediction to make the subsequent prediction.

  107. Avatar
    Junzhi Xue November 7, 2018 at 1:00 pm #

    Thanks a lot!
    I am confused about some aspects. Is the time_steps equal to batch_size? And I have seen some of your blogs about LSTM taking 1 as times_step by function reshape, if I change the time_step to another number, what would happen to the sample?
    I an just unclear about time_steps and samples in [samples,time_steps,features]. Thanks for your help.

    • Avatar
      Jason Brownlee November 7, 2018 at 2:48 pm #

      No timesteps is different from batch size.

      A batch is 1 or more samples, a sample is one or more time steps, a time step is one or more features.

      • Avatar
        Junzhi Xue November 9, 2018 at 12:16 am #

        In my mind, time_steps decides the memory of LSTM, so does taking 1 as time_steps make sense? in other words, how can we choose a better time_steps?
        Thanks for your help!

        • Avatar
          Jason Brownlee November 9, 2018 at 5:24 am #

          The LSTMs have memory that is reset between batches, or manually if you choose.

          Conceptually, this memory is separate from the number of time steps in one sample.

  108. Avatar
    Kiko November 18, 2018 at 9:56 pm #

    Hi Jason,

    Thanks for the blog.
    I have question regarding your code. I got the following question after running the code in “prepare_data(series, n_test, n_lag, n_seq)
    TypeError Traceback (most recent call last)
    8 n_test = 10
    9 # prepare data
    —> 10 train, test = prepare_data(series, n_test, n_lag, n_seq)
    11 print(test)
    12 print(‘Train: %s, Test: %s’ % (train.shape, test.shape))

    TypeError: ‘NoneType’ object is not iterable.

    One thing to mention is that I did not use the “parser” function that you provided as it throws another error regarding the %Y-%M format. So I just removed the last parameter in the parser function.
    ValueError: time data ‘190Sales of shampoo over a three year period’ does not match format ‘%Y-%m’

    Appreciate your help in advance!

  109. Avatar
    Leon November 29, 2018 at 4:19 am #


    Thanks for posting this tutorial.

    Hpw easily could this be adapted for a ‘within multiple subjects; design? So having 100 separate brand of shampoo at each monthly time measure point.

  110. Avatar
    Mudassar December 18, 2018 at 9:41 pm #

    Hi jason
    I have a question. Which one is better in power prediction or estimation using time series data. CNN or LSTM.?

    • Avatar
      Jason Brownlee December 19, 2018 at 6:33 am #

      Try both and discover which works best for your specific dataset.

      • Avatar
        Mudassar December 19, 2018 at 12:43 pm #

        Would you suggest any link for power forecast by both techniques?

        • Avatar
          Jason Brownlee December 19, 2018 at 2:29 pm #

          Yes, I have many examples on the blog, try the search box.

  111. Avatar
    mk January 3, 2019 at 4:40 pm #

    number of layers,how to set multi LSTMs’ layers?Could you give me some of
    your posts for this?

  112. Avatar
    mk January 4, 2019 at 1:13 pm #

    I have an idea.instead of lstm
    step1 random substitution of values in a sequence to 0 in each layer
    step2 use resnet to keep information complete
    Please point out the unreasonable points.

  113. Avatar
    Hao Chen January 15, 2019 at 12:13 am #

    Hi, Jason.
    Recently, I have been trying to use LSTM to make recursive prediction, but the result is very bad. In fact, the model I predicted is very simple, exponential function. Do you have relevant Suggestions and guidance?

  114. Avatar
    Murali February 18, 2019 at 7:50 am #

    How can the code be modified to forecast the future ? Here the forecast stops at “Dec’. How to get forecast for next three months ?

  115. Avatar
    Doosun Hong February 19, 2019 at 7:31 pm #

    HI, Thanks for your awesome tutorials.

    I have some questions about multi-step LSTM compare to normal LSTM which I followed at: https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/

    I guess the main differences between two models are the number of output values. In this tutorial 3, previous tutorial 1.

    1. What is the main purpose(advantage) of this multi-step LSTM compare to normal LSTM? ex) for better accuracy, or the advantage of predicting t+2, t+1 values earlier than before?

    2. In this example 3-step LSTM, do three output values affect the memory’s weight each time step when training a model?

    3. Is multi-step LSTM’s t+1 RMSE better than normal LSTM’s t+1 RMSE usually?

    • Avatar
      Jason Brownlee February 20, 2019 at 7:58 am #

      If the other variates are predictive for the target variable, then a multivariate model can be useful.

      Difference in performance really depends on the specifics of the prediction problem and choice of model.

      • Avatar
        Doosun Hong February 20, 2019 at 3:41 pm #

        So you mean performance does not only depend on how many outputs the model give but also specifics of the data(prediction problem).

        1. Then does that mean I have to use normal(1-step) LSTM and 3-step LSTM both and then compare evaluation between those two models and choose the better one?

        2. In addition, I am confused with validation and evaluation. RMSE score that you calculated is validation approach, not an evaluation. Did I understand right?

        If possible please answer each question 1 and 2. Thanks!!

  116. Avatar
    Bross February 22, 2019 at 1:15 am #

    Dear professor:
    After learning from other passages,I found that in the following code we can make