Multivariate Time Series Forecasting with LSTMs in Keras

Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.

This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems.

In this tutorial, you will discover how you can develop an LSTM model for multivariate time series forecasting in the Keras deep learning library.

After completing this tutorial, you will know:

  • How to transform a raw dataset into something we can use for time series forecasting.
  • How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
  • How to make a forecast and rescale the result back into the original units.

Let’s get started.

  • Updated Aug/2017: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.
  • Update Oct/2017: Added a new example showing how to train on multiple prior time steps due to popular demand.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. Air Pollution Forecasting
  2. Basic Data Preparation
  3. Multivariate LSTM Forecast Model

Python Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed.

If you need help with your environment, see this post:

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

1. Air Pollution Forecasting

In this tutorial, we are going to use the Air Quality dataset.

This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.

The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:

  1. No: row number
  2. year: year of data in this row
  3. month: month of data in this row
  4. day: day of data in this row
  5. hour: hour of data in this row
  6. pm2.5: PM2.5 concentration
  7. DEWP: Dew Point
  8. TEMP: Temperature
  9. PRES: Pressure
  10. cbwd: Combined wind direction
  11. Iws: Cumulated wind speed
  12. Is: Cumulated hours of snow
  13. Ir: Cumulated hours of rain

We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.

This dataset can be used to frame other forecasting problems.
Do you have good ideas? Let me know in the comments below.

You can download the dataset from the UCI Machine Learning Repository.

Download the dataset and place it in your current working directory with the filename “raw.csv“.

2. Basic Data Preparation

The data is not ready to use. We must prepare it first.

Below are the first few rows of the raw dataset.

The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas.

A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now.

The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “pollution.csv“.

Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have.

The code below loads the new “pollution.csv” file and plots each series as a separate subplot, except wind speed dir, which is categorical.

Running the example creates a plot with 7 subplots showing the 5 years of data for each variable.

Line Plots of Air Pollution Time Series

Line Plots of Air Pollution Time Series

3. Multivariate LSTM Forecast Model

In this section, we will fit an LSTM to the problem.

LSTM Data Preparation

The first step is to prepare the pollution dataset for the LSTM.

This involves framing the dataset as a supervised learning problem and normalizing the input variables.

We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.

This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:

  • Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
  • Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

We can transform the dataset using the series_to_supervised() function developed in the blog post:

First, the “pollution.csv” dataset is loaded. The wind speed feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.

The complete code listing is provided below.

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

  • One-hot encoding wind speed.
  • Making all series stationary with differencing and seasonal adjustment.
  • Providing more than 1 hour of input time steps.

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.

Define and Fit Model

In this section, we will fit an LSTM on the multivariate input data.

First, we must split the prepared dataset into train and test sets. To speed up the training of the model for this demonstration, we will only fit the model on the first year of data, then evaluate it on the remaining 4 years of data. If you have time, consider exploring the inverted version of this test harness.

The example below splits the dataset into train and test sets, then splits the train and test sets into input and output variables. Finally, the inputs (X) are reshaped into the 3D format expected by LSTMs, namely [samples, timesteps, features].

Running this example prints the shape of the train and test input and output sets with about 9K hours of data for training and about 35K hours for testing.

Now we can define and fit our LSTM model.

We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 8 features.

We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent.

The model will be fit for 50 training epochs with a batch size of 72. Remember that the internal state of the LSTM in Keras is reset at the end of each batch, so an internal state that is a function of a number of days may be helpful (try testing this).

Finally, we keep track of both the training and test loss during training by setting the validation_data argument in the fit() function. At the end of the run both the training and test loss are plotted.

Evaluate Model

After the model is fit, we can forecast for the entire test dataset.

We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers.

With forecasts and actual values in their original scale, we can then calculate an error score for the model. In this case, we calculate the Root Mean Squared Error (RMSE) that gives error in the same units as the variable itself.

Complete Example

The complete example is listed below.

NOTE: This example assumes you have prepared the data correctly, e.g. converted the downloaded “raw.csv” to the prepared “pollution.csv“. See the first part of this tutorial.

Running the example first creates a plot showing the train and test loss during training.

Interestingly, we can see that test loss drops below training loss. The model may be overfitting the training data. Measuring and plotting RMSE during training may shed more light on this.

Line Plot of Train and Test Loss from the Multivariate LSTM During Training

Line Plot of Train and Test Loss from the Multivariate LSTM During Training

The Train and test loss are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset is printed.

We can see that the model achieves a respectable RMSE of 26.496, which is lower than an RMSE of 30 found with a persistence model.

This model is not tuned. Can you do better?
Let me know your problem framing, model configuration, and RMSE in the comments below.

Update: Train On Multiple Lag Timesteps Example

There have been many requests for advice on how to adapt the above example to train the model on multiple previous time steps.

I had tried this and a myriad of other configurations when writing the original post and decided not to include them because they did not lift model skill.

Nevertheless, I have included this example below as reference template that you could adapt for your own problems.

The changes needed to train the model on multiple previous time steps are quite minimal, as follows:

First, you must frame the problem suitably when calling series_to_supervised(). We will use 3 hours of data as input. Also note, we no longer explictly drop the columns from all of the other fields at ob(t).

Next, we need to be more careful in specifying the column for input and output.

We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

Next, we can reshape our input data correctly to reflect the time steps and features.

Fitting the model is the same.

The only other small change is in how to evaluate the model. Specifically, in how we reconstruct the rows with 8 columns suitable for reversing the scaling operation to get the y and yhat back into the original scale so that we can calculate the RMSE.

The gist of the change is that we concatenate the y or yhat column with the last 7 features of the test dataset in order to inverse the scaling, as follows:

We can tie all of these modifications to the above example together. The complete example of multvariate time series forecasting with multiple lag inputs is listed below:

The model is fit as before in a minute or two.

A plot of train and test loss over the epochs is plotted.

Plot of Loss on the Train and Test Datasets

Plot of Loss on the Train and Test Datasets

Finally, the Test RMSE is printed, not really showing any advantage in skill, at least on this problem.

I would add that the LSTM does not appear to be suitable for autoregression type problems and that you may be better off exploring an MLP with a large window.

I hope this example helps you with your own time series forecasting experiments.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to fit an LSTM to a multivariate time series forecasting problem.

Specifically, you learned:

  • How to transform a raw dataset into something we can use for time series forecasting.
  • How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
  • How to make a forecast and rescale the result back into the original units.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.

287 Responses to Multivariate Time Series Forecasting with LSTMs in Keras

  1. zorg August 14, 2017 at 7:08 pm #

    except wind *dir*, which is categorical.

  2. Francois AKOA August 15, 2017 at 7:16 am #

    Great post Jason. Thank you so much for making this material available for the community..

  3. yao August 15, 2017 at 2:02 pm #

    hi, jason. There were some problems under my environment which were keras2.0.4and tensorflow-GPU0.12.0rc0.

    And Bug was that “TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.”

    The sentence that “model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))” was located.

    Could you please help me with that?



    • Jason Brownlee August 15, 2017 at 4:54 pm #

      I would recommend this tutorial for setting up your environment:

      • yao August 16, 2017 at 7:18 pm #

        Thx a lot, doctor, it works! fabulous! 🙂

        • Jason Brownlee August 17, 2017 at 6:40 am #

          I’m glad to hear that.

          • Shirley Yang August 18, 2017 at 12:00 pm #

            Dr.Jason, I update TensorFlow then it works!
            Sorry to bother you.
            Thank you very much !
            Best wishes !

          • Jason Brownlee August 18, 2017 at 4:40 pm #

            I’m glad to hear that!

        • Shirley Yang August 17, 2017 at 8:54 pm #

          I met the same problem .

          Did you uninstall all the programs previously installed or just set up the environment again?

          Thx a lot!

      • Shirley Yang August 18, 2017 at 11:43 am #

        Hi Jason,I set up my environment as the your tutorial.

        scipy: 0.19.0
        numoy: 1.12.1
        matplotlib: 2.0.2
        pandas: 0.20.1
        statsmodels: 0.8.0
        sklearn: 0.18.1

        tensorflow: 0.12.1
        Using TensorFlow backend.
        keras: 2.0.5

        But the bug still existed.Is the version of tensorFlow too odd?How could I do?

        • Jason Brownlee August 18, 2017 at 4:39 pm #

          It might be, I am running v1.2.1.

          Perhaps try running Keras off Theano instead (e.g. change the backend in the ~/.keras.jason config)

  4. Songbin Xu August 15, 2017 at 10:42 pm #

    It seems that inv_y = scaler.inverse_transform(test_X)[:,0] is not the actual, should inv_yhat be compared with test_y but not pollution(t-1)? Because I think this inv_y here means pollution(t-1). Is this prediction equals to only making a time shifting from the current known pollution value (which means the models just take pollution(t) as the prediction of pollution(t+1))?

    • Jason Brownlee August 16, 2017 at 6:35 am #

      Sorry, I’m not sure I follow. Can you please restate your question, perhaps with an example?

      • Songbin Xu August 16, 2017 at 7:36 pm #

        Sorry for the confusing expression. In fact, the series_to_supervised() function would create a DataFrame whose columns are: [ var1(t-1), var2(t-1), …, var1(t) ] where ‘var1’ represents ‘pollution’, therefore, the first dimension in test_X (that is, test_X[:,0]) would be ‘pollution(t-1)’. However, in the code you calculate the rmse between inv_yhat and test_X[:,0], even though the rmse is low, it could only shows that the model’s prediction for t+1 is close to what it has known at t.
        I am asking this question because I’ve ran through the codes and saw the models prediction pollution(t+1) looks just like pollution(t). I’ve also tried to use t-1, t-2 and so on for training, but still changed nothing.
        Do you think the model tends to learn to just take the pollution value at current moment as the prediction for the next moment?

        thanks 🙂

        • Jason Brownlee August 17, 2017 at 6:42 am #

          If we predict t for t+1 that is called persistence, and we show in the tutorial that the LSTM does a lot better than persistence.

          Perhaps I don’t understand your question? Can you give me an example of what you are asking?

          • Songbin Xu August 17, 2017 at 10:53 am #

            Hmm, it’s difficult to explain without a graph.

            In a word, and also it’s an example, I want to ask two questions:

            1. In the “make a prediction” part of your codes, why it computes rmse between predicted t+1 and real t, but not between predicted t+1 and real t+1?

            2. After the “make a prediction” part of your codes run, it turns out that rmse between predicted t+1 and real t is small, is it an evidence that LSTM is making persistence?

          • Jason Brownlee August 17, 2017 at 4:52 pm #

            RMSE is calculated for y and yhat for the same time periods (well, that was the intent), why do you think they are not?

            Is there a bug?

          • David Righart August 18, 2017 at 5:30 am #

            I think Songbin Xu is right. By executing the statement at line 90: inv_y = inv_y[:,0], you compare the inv_yhat with inv_y. inv_y is the polution(t-1) and inv_yhat is the predicted polution(t).

            On line 50 the second parameter the function series_to_supervised can be changed to 3 or 5, so more days of history are used. If you do so, an error occurs in the scaler.inverse_transform (line 89).

            No worries, great tutorial and I learned a lot so far!

          • Jason Brownlee August 18, 2017 at 6:54 am #

            I see now, you guys are 100% correct. Thank you!

            I have updated the calculation of RMSE and the final score reported in the post.

            Note, I ran a ton of experiments on AWS with many different lag values > 1 and none achieved better results than a simple lag=1 model (e.g. an LSTM model with no BPTT). I see this as a bad sign for the use of LSTMs for autoregression problems.

  5. Simone August 16, 2017 at 1:11 am #

    Hi Jason, great post!

    Is it necessary remove seasonality (by seasonal differentiation) when we are using LSTM?

  6. Slavenya August 16, 2017 at 5:18 am #

    Good article, thank.

    Two questions:
    What changes will be required if your data is sporadic? Meaning sometimes it could be 5 hours without the report.

    And how do you add more timesteps into your model? Obviously you have to reshape it properly but you also have to calculate it properly.

    • Jason Brownlee August 16, 2017 at 6:41 am #

      You could fill in the missing data by imputing or ignore the gaps using masking.

      What do you mean by “add more timesteps”?

      • Slavenya August 16, 2017 at 7:00 pm #

        But what should I do if all data is stochastic time sequence?

        For example predicting time till the next event – when events frequency is stochastically distributed on the timeline.

  7. Jack Dan August 16, 2017 at 5:48 am #


    Thank you for an awesome post.
    (I was practicing on load forecast using MLP and SVR (You also suggested on a comment in your other LSTM tutorials). I also tried with LSTM and it did almost perform like SVR. However, in LSTM, I did not consider time lags because I have predicted future predictor variables that I was feeding as test set. I will try this method with time lags to cross validate the models)

  8. Adam August 16, 2017 at 1:03 pm #

    Hi Jason,

    Can I use ‘look back'(Using t-2 , t-1 steps data to predict t step air pollution) in this case?
    If it’s available,that my input data shape will be [samples , look back , features] isn’t it?

    • Jason Brownlee August 16, 2017 at 5:00 pm #

      You can Adam, see the series_to_supervised() function and its usage in the tutorial.

      • Adam August 18, 2017 at 6:07 pm #

        Hi Jason,

        If I used n_in=5 in series_to_supervised() function,in your tutorial the input shape will be [samples, 1 , features*5].Can I reshape it to [samples, 5 , features]?If I can, what is the difference between these two shape?

        • Jason Brownlee August 19, 2017 at 6:09 am #

          The second dimension is time steps (e.g. BPTT) and the third dimension are the features (e.g. observations at each time step). You can use features as time steps, but it would not really make sense and I expect performance to be poor.

          Here’s how to build a model multiple time steps for multiple features:

          And that’s it. I just tested and it looks good. The RMSE calculation will blow up, but you guys can fix that up I figure.

          • George Khoury August 19, 2017 at 11:55 pm #

            Jason, great post, very clear, and very useful!! I’m about 90% with you and think a few folks may be stuck on this final point if they try to implement multi-feature, multi-hour-lookback LSTM.

            Seems like by making adjustments above, I’m able to make a prediction, but the scaling inversion doesn’t want to cooperate. The reshape step now that we have multiple features and multiple timesteps has a mismatch in the shape, and even if I make the shape work, the concatenation and inversion still don’t work. Could you share what else you changed in this section to make it work? I’m not so concerned about the RMSE as much as that I can extract useful predictions. Thank you for any insight since you’ve been able to do it successfully.

            # make a prediction
            yhat = model.predict(test_X)
            test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
            # invert scaling for forecast
            inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
            inv_yhat = scaler.inverse_transform(inv_yhat)
            inv_yhat = inv_yhat[:,0]

          • Lg September 2, 2017 at 12:40 am #

            Hi Jason,

            Great and useful article.

            I am somewhat puzzled by the number of features you specify to forecast the pollution rate based on data from the previous 24 hours.

            Do not we have 8 features for each time-step and not 7?

            After generating data to supervise with the function series_to_supervised(scaled,24, 1), the resulting array has a shape of (43800, 200) which is 25 * 8.

            To invert the scaling for forecast I made few modifications. I used scaled.shape[1] below but in my opinion it could be n_features. Moreover, I don’t know if the values concatenated to yhat and test_y really matter, as long as they have been scaled with fit_transform and the array has the right shape.

            yhat = model.predict(test_X)
            test_X = test_X.reshape((test_X.shape[0], n_obs))

            # invert scaling for forecast
            inv_yhat = concatenate((yhat, test_X[:, 1:scaled.shape[1]]), axis=1)
            inv_yhat = scaler.inverse_transform(inv_yhat)
            inv_yhat = inv_yhat[:,0]

            # invert scaling for actual
            test_y = test_y.reshape((len(test_y), 1))
            inv_y = concatenate((test_y, test_X[:, 1:scaled.shape[1]]), axis=1)
            inv_y = scaler.inverse_transform(inv_y)
            inv_y = inv_y[:,0]

            The model has 4 layers with dropout.
            After 200 epochs I have got
            loss: 0.0169 – val_loss: 0.0162
            And a rmse = 29.173


          • Jason Brownlee September 2, 2017 at 6:13 am #

            We have 7 features because we drop one in section “2. Basic Data Preparation”.

          • lg September 2, 2017 at 5:59 pm #

            Hi Jason,

            It’s really weird to me :(, as I used your code to prepare the data (pollution.csv) and I have 9 fields in the resulting file.

            [date, pollution, dew, temp, press, wnd_dir, wnd_spd, snow, rain]


          • Jason Brownlee September 3, 2017 at 5:40 am #

            Date and wind direction are dropped during data preparation, perhaps you accidentally skipped a step or are reviewing a different file from the output file?

          • Lg September 3, 2017 at 6:22 pm #

            Hi Jason,

            So that’s fine, in my case I have 8 features.

            When reading the file, the field ‘date’ becomes the index of the dataframe and the field ‘wnd_dir’ is later label encoded, as you do above in “The complete example” lines 42-43.

            It is now much clearer for me. I am not puzzled anymore. 😉

            Thanks a lot for all the information contained in your articles and your e-books.

            They are really very informative.


          • Jason Brownlee September 4, 2017 at 4:26 am #

            I’m glad to hear that!

          • Cloud September 20, 2017 at 8:06 pm #

            Hi Jason,
            I think the output is column var1(t), that means:
            train_X, train_y = train[:, 0:n_obs], train[:, -(n_features+1)]
            am I right?
            In case the “pollution” is in the last column, it is easy to get train[:, -1]
            am i right?
            I just want to verify that I understand your post.
            Thank you, Jason

          • Hesam October 11, 2017 at 9:39 pm #

            I have some confusion for this problem.

            I want to use a bigger windows (I want to go back in time more, for example t-5 to include more data to make a prediction of the time t) and use all of this to predict one variable (such as just the pollution), like you did. I think predicting one variable will be more accurate than predicting many. Such as pollution and temperature.

            What should I do to apply more shift?

          • Jason Brownlee October 12, 2017 at 5:29 am #

            I show in another comment how to update the example to use lab obs as input.

            I will update the post and add an example to make it clearer.

          • Kentor October 19, 2017 at 10:01 pm #

            First of all, thanks for your work and the effort you put in!

            I tried to implement your suggestion for increasing the timesteps (BPTT). I have intergrated your code but I keep getting this error in when reshaping test_X in the prediction step:

            test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
            ValueError: cannot reshape array of size 490532 into shape (35038,7)

            Do you have any tips on how to proceed?

          • Jason Brownlee October 20, 2017 at 5:34 am #

            I will update the post with a worked example. Adding to trello now…

  9. Arun August 18, 2017 at 12:45 am #

    Hi Jason, I get the following error from line # 82 of your ‘Complete Example’ code.

    ValueError: Error when checking : expected lstm_1_input to have 3 dimensions, but got array with shape (34895, 8)

    I think LSTM() is looking for (sequences, timesteps, dimensions). In your code, line # 70, I believe 50 is timesteps while input_shape (1,8) represents the dimensions. May be it’s missing ‘sequences’ ?

    Appreciate your response.

    • Jason Brownlee August 18, 2017 at 6:25 am #

      Ensure that you first prepare the data (e.g. convert “raw.csv” to “pollution.csv”).

  10. Neal Valiant August 18, 2017 at 2:35 am #

    Hi Jason, I am wondering what the issue that I’m getting is caused by, maybe a different type of dataset then the example one. basically when I run the history into the model, When i check the History.history.keys() I only get back ‘loss’ as my only key.

    • Jason Brownlee August 18, 2017 at 6:27 am #

      You must specify the metrics to collect when you compile the model.

      For example, in classification:

  11. Aman Garg August 18, 2017 at 4:18 pm #

    Hello Jason,

    Thank you for such a nice tutorial.

    Since you have published a similar topic and few other related topics in one of your paid books (LSTM networks), should the reader also expect some different topics covered in it?

    I’m an ardent fan of your blogs since it covers most of the learning material and therefore, it makes me wonder that will be different in your book?

    • Jason Brownlee August 18, 2017 at 4:42 pm #

      Thanks Arman.

      The book does not cover time series, instead it focuses on teaching you how to implement a suite of different LSTM architectures, as well as prepare data for your problems.

      Some ideas were tested on the blog first, most are only in the book.

      You can see the full table of contents here:

      The book provides all the content in one place, code as well, more access to me, updates as I fix bugs and adapt to new APIs, and it is a great way to support my site so I can keep doing this.

  12. Songbin Xu August 18, 2017 at 6:54 pm #

    Thank you for accepting my opinions, such a pleasure!

    Running the codes u modified, still something puzzles me here,

    1. Have u drawn the waveforms of inv_y and inv_yhat in the same plot? I think they looks quite like persistence.

    2. Curiously, I computed the rmse between pollution(t) and pollution(t-1) in test_X, it’s 4.629, much lower than your final score 26.496, does it mean LSTM performs even worse than persistence?

    3. I’ve tried to remove var1 at t-1, t-2, … , and I’ve also tried to use lag values>1, and also assign different weights to the inputs at different timesteps, but none of them improved, they performed even worse.

    Do you have any other ideas to avoid the whole model to learn persistence?

    Looking forward to your advices 🙂

  13. Varuna Jayasiri August 19, 2017 at 2:51 pm #

    Why are you only training with a single timestep (or sequence length)? Shouldn’t you use more timesteps for better training/prediction? For instance in they use 40 (maxlen) timesteps

    • Jason Brownlee August 20, 2017 at 6:05 am #

      Yes, it is just an example to help you get started. I do recommend using multiple time steps in order to get the full BPTT.

      • Long.Ye August 23, 2017 at 11:06 am #

        Hi Jason and Varuna,

        When the timesteps = 1 as you mentioned, does it mean the value of t-1 time was used to predict the value of t time? Is moving window a method to use multiple time steps? Is there any other way? Has Keras any functions of moving window?

        Thank you very much.

        • Jason Brownlee August 23, 2017 at 4:23 pm #

          Keras treats the “time steps” of a sequence as the window, kind of. It is the closest match I can think of.

  14. lymlin August 20, 2017 at 4:28 pm #

    Hi Jason,
    I met some problem when learning your codes.

    dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)
    Traceback (most recent call last):
    File “”, line 1, in
    dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)
    NameError: name ‘parse’ is not defined

    • Jason Brownlee August 21, 2017 at 6:04 am #

      It looks like you have specified a function “parse” but not defined it.

  15. guntama August 21, 2017 at 11:30 am #

    Hi Jason,
    Can I use “keras.layers.normalization.BatchNormalization” as a substitute for “sklearn.preprocessing.MinMaxScaler”?

  16. Naveen Koneti August 21, 2017 at 10:56 pm #

    Hi Jason, Its a very Informative article. Thanks. I have a question regarding forecasting in time series. You have used the training data with all the columns while learning after variable transformations and the same has been done for the test data too. The test data along with all the variables were used during prediction. For instance, If I want to predict the pollution for a future date, Should I know the other inputs like dew, pressure, wind dir etc on a future date which I’m not aware off? Another question is, Suppose we have same data about multiple regions(let us consider that the pollution among these regions is not negligible), How can we model so that the input argument while prediction is the region name along with time to forecast just for that one region.

    • Jason Brownlee August 22, 2017 at 6:43 am #

      It depends on how you define your model.

      The model defined above uses the variables from the prior time step as inputs to predict the next pollution value.

      In your case, maybe you want to build a separate model per region, perhaps a model that improves performance by combining models across regions. You must experiment to see what works best for your data.

      • Naveen Koneti August 24, 2017 at 4:12 pm #

        Thanks! I missed the trick of converting the time-series to supervised learning problem. That alone is sufficient even for multiple regions I guess. We just have to submit the input parameters of the previous time stamp for the specific region during prediction. We may also try one-hot encoding on the region variable too during data preprocessing.

      • LY September 7, 2017 at 8:12 pm #

        Thank you for your excellent blog, Jason. I’ve really learnt a lot from your nice work recently. After this post, I’ve already known how to transform data into data that formates LSTM and how to construct a LSTM model.

        Like the question aksed by Naveen Koneti, I have the same puzzle.
        Recently I’ve worked on some clinical data. The data is not like the one we used in this demo. It is consist of hunderds of patients, each patient has several vital sign records. If it is about one individual’s records through many years, I can process the data as what you told us. I wonder how I can conquer this kind of data. Could you give me some advice, or tell me where I can find any solutions about it?
        If I didn’t state my question clearly and you’re interested it, pls let me know.
        Thanks in advance.

        PS. the data set in my situation is like this
        [ID date feature1 feature2 feautre3 ]
        [patient1 date1 value11 value12 value13 ]
        [patient1 date2 value21 value22 value23 ]
        [patient2 date1 value31 value32 value33 ]
        [patient2 date2……………………………………..]
        [patient3 ……………………………………………..]

  17. Chris August 21, 2017 at 11:23 pm #

    again a nice post for the use of lstm’s!

    I had the following idea when reading.

    I would like to build a network, in which each feature has its own LSTM neuron/layer, so that the input is not fully connected.
    My idea is adding a lstm layer for each feature and merge it with the merge layer and feed these results to the output neurons.

    Is there a better way to do this? Or would you recommend to avoid this because the features are poorly abstracted? On the other hand, this might also be interesting.

    Thank you!

    • Jason Brownlee August 22, 2017 at 6:44 am #

      Try it and see if it can out-perform a model that learns all features together.

      Also, contrast to an MLP with a window – that often does better than LSTMs on autoregression problems.

  18. Tryfon August 22, 2017 at 5:20 am #

    Hi Jason,

    I have two questions:

    1) I have a question/ notice regarding the scaling of the Y variable (pollution). The way you implement the rescaling between [0-1] you consider the entire length of the array (all of the 43799 observations -after the dropna-).

    Is it rightto rescale it that way? By doing so we are incorporating information of the furture (test set) to the past (train set) because the scaler is “exposed” to both of them and therefore we introduce bias.

    If you agree with my point what could be a fix?

    2) Also the activation function of the output (Y variable) is sigmoid, that’s why we rescale it within the [0,1] range. Am I correct?

    Thanks for sharing the article!

    • Jason Brownlee August 22, 2017 at 6:49 am #

      No, ideally you would develop a scaling procedure on the training data and use it on test and when making predictions on new data.

      I tried to keep the tutorial simple by scaling all data together.

      The activation on the output layer is ‘linear’, the default. This must be the case because we are predicting a real-value.

  19. WCH August 22, 2017 at 5:25 pm #

    Thank you very much for your tutorial.

    I have one question,

    but I failed to read the NW in pollution. csv.(cbwd column)

    values = values.astype(‘float32’)
    ValueError: could not convert string to float: NW

    How do you fix it?

    • WCH August 22, 2017 at 5:30 pm #

      sorry, I saw the text above and solved it.

  20. Dmitry August 22, 2017 at 5:58 pm #

    Hi Jason!
    I assume there is little mistake when you calculate RMSE on test data.
    You must write this code before calculate RMSE:

    inv_y = inv_y[:-1]
    inv_yhat = inv_yhat[1:]

    Thus, RMSE equals 10.6 (on the same data, in my case), that is much less than 26.5 in your case.

    • Jason Brownlee August 23, 2017 at 6:44 am #

      Sorry, I don’t understand your comment and snippet of code, can you spell out the bug you see?

      • Tommy November 12, 2017 at 2:50 pm #

        This beats further exploration

  21. jan August 22, 2017 at 11:01 pm #

    Hi Jason,

    great post! I was waiting for meteo problems to infiltrate the machinelearningmastery world.

    Could you write something about the changed scenareo where, given the weather conditions and pollution for some time, we can predict the pollution for another time or place with given weather conditions?

    For example: We have the weather conditions and pollution given for Beijing in 2016, and we have the weather conditions given for Chengde (city close to Bejing) also in 2016. Now we want to know how was the pollution in Chengde in 2016.

    Would be great to learn about that!

    • Jason Brownlee August 23, 2017 at 6:52 am #

      Great suggestion, I like it. An approach would be to train the model to generalize across geographical domains based only on weather conditions.

      I have tried not to use too many weather examples – I came from 6 years of work in severe weather, it’s too close to home 🙂

  22. Simone August 23, 2017 at 9:43 am #

    Hi Jason,
    I have read many of your posts about LSTM. I have not completely clear the difference between the parameters batch_size and time_steps. Batch_size means when the memory is reset (right?), but this shouldn’t have the same value of time_steps that, if I have understood correctly, means how often the system makes a prediction?

    • Jason Brownlee August 23, 2017 at 4:22 pm #

      Great question!

      Batch size is the number of samples (e.g. sequences) to that are used to estimate the gradient before the weights are updated. The internal state is reset at the end of each batch after the weights are updated.

      One sample is comprised of 1 or more time steps that are stepped over during backpropagation through time. Each time step may have one or more features (e.g. observations recorded at that time).

      Time steps and batch size and generally not related.

      You can split up a sequence to have one-time step per sequence. In that case you will not get the benefit of learning across time (e.g. bptt), but you can reset state at the end of the time steps for one sequence. This an odd config though and really only good to showing off the LSTMs memory capability.

      Does that help?

      • Simone August 24, 2017 at 6:26 am #

        Thanks, now it’s more clear!

  23. Pedro August 23, 2017 at 8:58 pm #

    Hi,I ger this error at this step, could you help me please?

    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    TypeError Traceback (most recent call last)
    in ()
    —-> 1 model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    C:\Anaconda3\lib\site-packages\keras\ in add(self, layer)
    431 # and create the node connecting the current layer
    432 # to the input layer we just created.
    –> 433 layer(x)
    435 if len(layer.inbound_nodes) != 1:

    C:\Anaconda3\lib\site-packages\keras\layers\ in __call__(self, inputs, initial_state, **kwargs)
    241 # modify the input spec to include the state.
    242 if initial_state is None:
    –> 243 return super(Recurrent, self).__call__(inputs, **kwargs)
    245 if not isinstance(initial_state, (list, tuple)):

    C:\Anaconda3\lib\site-packages\keras\engine\ in __call__(self, inputs, **kwargs)
    556 ‘‘)
    557 if len(input_shapes) == 1:
    –> 558[0])
    559 else:

    C:\Anaconda3\lib\site-packages\keras\layers\ in build(self, input_shape)
    1010 initializer=bias_initializer,
    1011 regularizer=self.bias_regularizer,
    -> 1012 constraint=self.bias_constraint)
    1013 else:
    1014 self.bias = None

    C:\Anaconda3\lib\site-packages\keras\legacy\ in wrapper(*args, **kwargs)
    86 warnings.warn(‘Update your ' + object_name +
    87 '
    call to the Keras 2 API: ‘ + signature, stacklevel=2)
    —> 88 return func(*args, **kwargs)
    89 wrapper._legacy_support_signature = inspect.getargspec(func)
    90 return wrapper

    C:\Anaconda3\lib\site-packages\keras\engine\ in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint)
    389 if dtype is None:
    390 dtype = K.floatx()
    –> 391 weight = K.variable(initializer(shape), dtype=dtype, name=name)
    392 if regularizer is not None:
    393 self.add_loss(regularizer(weight))

    C:\Anaconda3\lib\site-packages\keras\layers\ in bias_initializer(shape, *args, **kwargs)
    1002 self.bias_initializer((self.units,), *args, **kwargs),
    1003 initializers.Ones()((self.units,), *args, **kwargs),
    -> 1004 self.bias_initializer((self.units * 2,), *args, **kwargs),
    1005 ])
    1006 else:

    C:\Anaconda3\lib\site-packages\keras\backend\ in concatenate(tensors, axis)
    1679 return tf.sparse_concat(axis, tensors)
    1680 else:
    -> 1681 return tf.concat([to_dense(x) for x in tensors], axis)

    C:\Anaconda3\lib\site-packages\tensorflow\python\ops\ in concat(concat_dim, values, name)
    998 ops.convert_to_tensor(concat_dim,
    999 name=”concat_dim”,
    -> 1000 dtype=dtypes.int32).get_shape(
    1001 ).assert_is_compatible_with(tensor_shape.scalar())
    1002 return identity(values[0], name=scope)

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    668 if ret is None:
    –> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    671 if ret is NotImplemented:

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    174 as_ref=False):
    175 _ = as_ref
    –> 176 return constant(v, dtype=dtype, name=name)

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in constant(value, dtype, shape, name, verify_shape)
    163 tensor_value = attr_value_pb2.AttrValue()
    164 tensor_value.tensor.CopyFrom(
    –> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
    167 const_tensor = g.create_op(

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in make_tensor_proto(values, dtype, shape, verify_shape)
    365 nparray = np.empty(shape, dtype=np_dt)
    366 else:
    –> 367 _AssertCompatible(values, dtype)
    368 nparray = np.array(values, dtype=np_dt)
    369 # check to them.

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in _AssertCompatible(values, dtype)
    300 else:
    301 raise TypeError(“Expected %s, got %s of type ‘%s’ instead.” %
    –> 302 (, repr(mismatch), type(mismatch).__name__))

    TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.

  24. Neal Valiant August 24, 2017 at 2:49 am #

    Hi Jason,
    I was curious if you can point me in the right direction for converting data back to the actual values instead of scaled.

    • Jason Brownlee August 24, 2017 at 6:48 am #

      Yes, you can invert the scaling.

      This tutorial demonstrates how to do that Neal.

      • Neal Valiant August 25, 2017 at 7:34 am #

        Hi Jason, I did have an issue converting back to actual values, but was able to get past it using the drop columns on the reframed data which got me past it.

        When looking at my predicted values vs actual values, I’m noticing that my first column has a prediction and a true value, but for every other variable, I only see what I can assume is a prediction? does this make a prediction on every column, or just one particular one.

        Im sorry for asking a question such as this, I just think I’m confusing myself looking at my results.

        • Jason Brownlee August 25, 2017 at 3:56 pm #

          The code in the tutorial only predicts pollution.

  25. Jack Dan August 24, 2017 at 3:24 am #

    Dr. Jason,
    I have been trying with my own dataset and I am getting an error “ValueError: operands could not be broadcast together with shapes (168,39) (41,) (168,39)” when I try to do inv_yhat = scaler.inverse_transform(inv_yhat) as you have in line 86 in your script. I still can not figure out where my issue is. I have yhat.shape as (168,1) and test_X.shape as (168,38). When I do this, inv_yhat = np.concatenate((yhat, test_X[:, 1:]), axis=1), my inv_yhat.shape is (168,39). I still can not figure why inverse_transform gives that error.

    • Jason Brownlee August 24, 2017 at 6:50 am #

      The shape of the data must be the same when inverting the scale as when it was originally scaled.

      This means, if you scaled with the entire test dataset (all columns), then you need to tack the yhat onto the test dataset for the inverse. We jump through these exact hoops at the end of the example when calculating RMSE.

      • Jay Regalia August 24, 2017 at 7:29 am #

        This seems to be the same issue I am having at the moment also. i concatenate my inv_yhat with my test_X like you said, but the shape of inv_yhat after is still not taking into account the 2nd numbers(in posts case (41,).

        • Jack Dan August 26, 2017 at 6:00 am #

          Ask a question in stackoverflow and post the link, I should be able to help. I spent lots of time on this and have a decent idea now.

      • Jack Dan August 24, 2017 at 7:39 am #

        Yes, you’re right! I did that and it worked, nice! Thank you for your comment!

      • John Regilina August 24, 2017 at 8:38 am #

        I am having the same problem, but cannot solve the issue. everytime i try to concatenante them together, there is not change to my inv_yhat variable. i still am unable to understand this issue if you can expand a bit more that would be amazing

        • Jack Dan August 26, 2017 at 6:08 am #

          @John Regilina,
          Check the shape of data after you scale the data and then check the scale again after you do the concatenation. Remember, when your yhat shape will be (rowlength,1) and after concatenation inv_yhat should be the same shape after you scaled the data. Look at Dr.Jason’s answer to my comment/question. Hope that will help. (Thanks to Dr.Jason saved a lot of my time)

    • Shan September 19, 2017 at 1:59 pm #

      I am also stuck with same thing. How did you fix it?

  26. Lizzie August 24, 2017 at 4:23 am #

    Hi Jason, In dataset.drop(‘No’, axis =1, inplace = True), what is the purpose of ‘axis’ and ‘inplace’?

    • Jason Brownlee August 24, 2017 at 6:50 am #

      Great question.

      We specify to remove the column with axis=1 and to do it on the array in memory with inplace rather than return a copy of the array with the column removed.

  27. Lizzie August 24, 2017 at 4:44 am #

    Fabulous tutorials Jason!

  28. Jaskaran August 24, 2017 at 5:19 am #

    Can you show how the multi variate forecast looks like?
    Looks like you missed it in the article.

    • Jason Brownlee August 24, 2017 at 6:56 am #


      You can plot all predictions as follows:

      You get:

      It’s a mess, you can plot the last 100 time steps as follows:

      You get:

      The predictions look like persistence.

      • BEN BECKER August 29, 2017 at 1:33 pm #

        Jason, what am I missing, looking at your plot of the most recent 100 time steps, it looks like the predicted value is always 1 time period after the actual? If on step 90 the actual is 17, but the predicted value shows 17 for step 91, we are one time period off, that is if we shifted the predicted values back a day, it would overlap with the actual which doesn’t really buy us much since the next hour prediction seems to really align with the prior actual. Am I missing something looking at this chart?

        • Jason Brownlee August 29, 2017 at 5:16 pm #

          This is what a persistence forecast looks like, that value(t) = value(t-1).

          • BECKER August 29, 2017 at 9:22 pm #

            So how would you get the true predicted value(t)? I am thinking of the last record in the time series where we are trying to predict the value for the next hour.

          • Jason Brownlee August 30, 2017 at 6:15 am #

            Sorry, I don’t follow. Perhaps you can restate your question?

          • Anna October 2, 2017 at 4:38 pm #

            Hello Jason Brownlee

            Thank you for your great posts. I run the model above for my data and it works perfectly, how ever when I draw the real data (blue one – inv_y) and the prediction (the orange one – inv_yhat), the result shows the prediction is delay after 1 step. it should be predicted one step before as your graph. your model is the same with the matlab tool:

            And after running the model, I applyed realtime this model for my problem to compute the inv_yhat in every step. I got the result is really bad, since I have never had the real inv_y. I took the prediction to feed the input ( instead of real data inv_y)

            My problem is: I received some signals as inputs, then I labeled offline to have output (real data inv_y or the first column in train_X)

            Do you have the model that trains without the real data in the first column?????? thank you

          • Jason Brownlee October 3, 2017 at 5:40 am #

            Your model may have low skill and be simply predicting the input as the output (e.g. persistence).

            You may need to continue to develop your model, I list some ideas for lifting model skill here:

      • Tyler Byers October 26, 2017 at 3:40 am #

        It’s definitely similar to a persistence model since we trained the model using the var1(t-1) feature (i.e. the lagged pollution feature). The model certainly found that to be the strongest predictor. This would be ok if we were doing predictions later on an hour-by-hour basis. But, if, say we want to predict the pollution 20 hours from now, we aren’t yet going to know what the hour-19 pollution is. So it seems like cheating to include this variable in the training and prediction sets.

        I removed this variable to train the model, leaving other parameters about the same, and was then only able to get a minimum validation loss of 0.55 and test RMSE of 87.02

  29. gammarayburst August 24, 2017 at 11:32 pm #

    Wind dir is label encoded not wind speed!!!

  30. Filipe August 27, 2017 at 4:16 am #

    First of all, thanks. All of this material on the blog is super interesting, and helpful and making me learn a lot.

    Of course… I have a question.

    I’m surprised by the use of LSTMs here. The property of them being “stateful” I guess is being used. But is there “sequence” information flowing?

    So when I used LSTMs in Keras for text classification tasks (sentence, outcome), each “sentence” is a sequence. Each observation is a sequence. It’s an ordered array of the words in the sentence (and it’s outcome).
    In this example, I could not see a sense in which var1(t-1) is linked to var1(t-2). Aren’t they being treated as independent Xs in a regression problem? (predicting var8(t))

  31. STYLIANOS IORDANIS August 27, 2017 at 5:23 am #

    Awesome article, as always.
    Btw, what is your view on using an autoencoder/ restricted Boltzmann layer compressing features/ features before feeding an LSTM network ? For example, if one has a financial timeseries to forecast, e.g. a classifier trying to predict increase or decrease in a look ahead time window, via numerous technical indicators and/or other candidate exogenous leading indicators…..
    Could you write an article based on that idea?

    • Jason Brownlee August 27, 2017 at 5:53 am #

      I have seen better results from large MLPs, nevertheless, try it and see how you go.

      • STYLIANOS IORDANIS August 27, 2017 at 7:25 am #

        autoencoder/ restricted Boltzmann layers also deal with multicollinearity issues… do MLPs also deal with multicollinearity if you have multicollinearity in the features, right?

        • Jason Brownlee August 28, 2017 at 6:46 am #

          MLPs are more robust to multicollinearity than linear models.

  32. Hee Un August 29, 2017 at 12:28 am #

    Hi, I am always amazed at your article. Thank you.
    I have a question.
    Is this LSTM code now weighted for each features?
    Nowdays, I’m predicting precipitation, that is the trend is correct, but the amount is not right.
    What’s wrong with that?:(

    • Jason Brownlee August 29, 2017 at 5:06 pm #


      Sorry, I’m not sure I understand the question, perhaps you could rephrase it?

      I can say that I would expect better skill if the data was further prepared – e.g. made stationary.

  33. Vipul August 30, 2017 at 7:53 pm #

    Hi Jason,

    Thanks for wonderful explanation!
    Could you please help me to understand dimensionality reduction concept. Should PCA or statistical approach be used before feeding the data to LSTM OR LSTM will learn correlation with the inputs provided on its own? how to approach regression problem in LSTM when we have large set of features?

    Your reply is greatly appreciated!

    • Jason Brownlee August 31, 2017 at 6:18 am #

      Generally, if you make the problem simpler using data preparation, the LSTM or any model will perform better.

  34. Nader August 31, 2017 at 2:42 am #

    How can I predict a single input ?
    for example :

    [0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001]

    how do i reshape and do a model.predict () ?

    Thank you

    • Jason Brownlee August 31, 2017 at 6:23 am #

      Perhaps this post will make it clearer:

      • Nader August 31, 2017 at 12:48 pm #

        Thank you, Jason.
        I applied:

        my_x = np.array([0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001])
        print(my_x.shape) # (8,)
        my_x = my_x.reshape((1, 1, 8))
        my_pred = model.predict(my_x)

        The answer is the “scaled” answer which is 0.03436

        I tried applying the scaler.inverse_transform(my_pred) to GET the actual number

        But I get the following error:

        on-broadcastable output operand with shape (1,1) doesn’t match the broadcast shape (1,8)

        Thank you

        • Jason Brownlee September 1, 2017 at 6:40 am #

          Yes, the transform requires data in the same form as when you “fit” it.

          • David September 23, 2017 at 3:27 pm #

            Then what if I use multi-time step prediction? (use several lags for prediction)
            The y_hat and X_test can not have the same dimension.

          • Jason Brownlee September 24, 2017 at 5:13 am #

            If the size of X or y must vary, you can use padding.

  35. Fejwin August 31, 2017 at 3:52 am #

    Hi Jason,
    Thanks for the tutorial!
    Maybe I missed something, but it seems that you provided the model with all of remaining data as ‘testdata’ and then tried predicting it? Isn’t that kind of pointless, since we should be interested in predicting unknown data in the future, instead of data that the model has already seen? Wouldn’t it make more sense to try the model to predict a first timestep into the future that neither the training nor the test data knew anything about? (Perhaps only give the model training data, but no test data, and afterwards ask it to predict first time step after training data?) How would I have to change the code to achieve that?

    • Jason Brownlee August 31, 2017 at 6:25 am #

      The model is fit on the training data, then makes a prediction for each step in the test data. The model did not “know” the answer to the test data prior to making each prediction.

      Normally we would use walk-forward validation:

      I did use walk forward validation on other LSTM examples (use the blog search) but it confuses readers more than helps it seems.

      • Guillermo November 8, 2017 at 9:19 pm #

        Hi Jason.

        I am digging into your example and maybe missing something because I agree with Fejwin.

        I mean, as long as real Pollution in t-1 is introduced in the test_X set, instead of predicted Pollution in t-1, when you run model.predict(test_X) each output is not considered for future prediction.

        This is with all the features, including real Pollution(t-1) the model predicts an output: predicted Pollution(t). But on the next step, when the model predicts Pollution(t+1) it doesn´t take predicted Pollution(t), it takes real Pollution(t) instead.

        Can you clarify this point please?

        Thank you.

        • Jason Brownlee November 9, 2017 at 9:58 am #

          Yes, the assumption in the setup of the problem is that each prior hours pollution is available when predicting t+1.

          You could change the framing of the problem if you wish.

    • David September 24, 2017 at 1:01 pm #

      Can I use part of trainX to predict testY ? (lags needed to predict testY is in trainX) Not sure if it is a logical way to do it.

  36. hadi September 1, 2017 at 12:08 pm #

    Dear Jason Brownlee,

    I have a little different question, Actually I have a sequence of characters as input and I want to project it into a multidimensional space.
    I mean I want to project each sequence of chars (let say word) to an vector of 100 real numbers along my corpus, so my input is a sequence of chars (any char-emedding is welcome) and my output is a vector for each sequence (which is a word ) and Im really confused how to define the model,
    I would appreciate if you give any clue help or sample code to define my model.

    Thanks a lot in advance.

  37. Sai k September 2, 2017 at 12:12 am #

    Hi Jason,

    Thanks for the wonderful tutorial!
    Could you please explain how to deal the problem when situation is “Predict the pollution for the complete month (assume month has 30 days. t+1…t+30) and given the “expected” weather features for that month…assuming we have been provided historic data of pollution and weather data on daily basis”

    How should the data be prepared and how it should be feed into LSTM?

    As I new to LSTM model, I have problem understanding the data preparation and feeding to LSTM.

    Thanks in advance for your response

  38. Adrian September 5, 2017 at 5:29 am #

    Hi Jason,

    Thanks for sharing. I added accuracy info to model while training using ‘ metrics=[‘accuracy’] ‘.

    So model.compile(loss=’mae’, optimizer=’adam’) becomes :

    model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])

    This adds acc & val_acc to output. After 100 epochs the acc value appears quite low : (0.0761) :
    Epoch 100/100
    1s – loss: 0.0143 – acc: 0.0761 – val_loss: 0.0132 – val_acc: 0.0393

    The accuracy of the model appears very low ? Is this expected ?

    Further info on acc & val_acc values : “acc is the accuracy of a batch of training data and val_acc is the accuracy of a batch of testing data.”

    • Jason Brownlee September 7, 2017 at 12:38 pm #

      This is a regression problem. Accuracy does not make sense.

  39. Eric H September 5, 2017 at 6:33 am #

    Hi Jason, I’ve recently discovered your site and have been so pleased with your information – thank you. I’ve been trying to model data which is much like the air quality data described here, but every few time steps there will be a change in the number of features present.
    Example: in my data a time step = 1 day and a sequence can be 800 – 1200 days long. Normally the data consists of features
    – pm2.5: PM2.5 concentration
    – DEWP: Dew Point
    – TEMP: Temperature
    – PRES: Pressure
    – cbwd: Combined wind direction
    – Iws: Cumulated wind speed
    – Is: Cumulated hours of snow
    – Ir: Cumulated hours of rain

    But then every (random-ish amount of time) there will be an additional number of features for a day and then back to the baseline number of features.

    I’ve no idea on how to handle variable feature length. I’ve seen and played with plenty of variable sequence length examples, but I have both variable sequenceS and features. I’d love your input!

    • Jason Brownlee September 7, 2017 at 12:40 pm #

      You will need to normalize the number of features to be consistent for all time.

      • Eric Hiller September 10, 2017 at 5:21 am #

        Is it possible to use (what in TensorFlow – land is called) SparseFeatures or SparseTensors to represent sparse datasets, or is there a fundamental issue with handling sparse datasets within RNNs?

        • Jason Brownlee September 11, 2017 at 12:04 pm #

          Good question, I’m not sure off the cuff. Keras may support sparse numpy arrays – try it and see?

  40. Ali Haidar September 8, 2017 at 1:56 am #

    Hi Jason,

    Thanks for the amazing articles. They are really helpful.

    Lets say I want to forecast with lead 2. I mean by that forecasting values at time t using t-2 values, without using t-1 elements. I have to remove columns from reframed after running function series_to_supervised right ? To remove all columns with values t-1?


  41. Inna September 11, 2017 at 7:53 pm #

    Thanks for articles.

    I have a question related with time series. Is it possible to forecast all variables? For example, I have ‘pollution’, ‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’ and want to predict all of them for the next hour. We know about trends and common rules (because of data amount: few years), so we can do forecasting. Where can I find more info about it?

    • Jason Brownlee September 13, 2017 at 12:22 pm #

      Yes, this example can be modified to predict each variable.

  42. appreciator September 12, 2017 at 10:59 am #

    Thank you Jason for the great tutorial! I’m adapting it for different data, and i’m trying to use >1 time step. However I noticed something strange in the series-to-supervised: Since the first loops ends at 0 and the last loops starts at 0, won’t there be two columns that are the same?

  43. Eric September 12, 2017 at 11:49 am #

    Hi Jason,

    Thanks for the tutorial. I had just one question though.
    I’ve seen tutorial using multivariate time series to train a lot of dataset (all have correlation between each other) at the same time and were able to predict for each dataset used.

    For sake of argument let’s say than one of the dataset is broke, the sensor that get the information to feed it is out of service (let’s say at some point one of the column of data only have 0 instead of whatever value). Do you think that we could use the other spot to continue to predict the broken one? (there is correlation between them and there would be a lot of non broken data from before the bug)

    Best regards,

    • Jason Brownlee September 13, 2017 at 12:27 pm #

      Yes, you could try it and see. Or impute the missing data and see if that is better.

      • Eric September 14, 2017 at 2:22 pm #

        Thank you Jason,

        I shall try that as soon as possible.I guess that the overall accuracy will lower for every set prediction (since my goal is to use multivariate, feed it every spot data set and predict each of them (with possibility to predict a broken one)) so one spot being fed “wrong” data should lower each spot accuracy no?

        Best regards,

  44. Shan September 13, 2017 at 3:46 am #

    Is there any time parser like date parser? I am working with data which is in milliseconds.

    • Jason Brownlee September 13, 2017 at 12:33 pm #

      It can handle parsing dates and times I believe.

  45. kumar September 13, 2017 at 10:00 pm #

    i got this error when i tried to run the program

    pyplot.plot(history.history[‘val_loss’], label=’test’)
    KeyError: ‘val_loss’

  46. Simon September 15, 2017 at 9:55 pm #

    Hi Jason,

    Wouldn’t it be better to scale the data after you run the series_to_supervised function? As it stands now, the inverse scaling doesn’t work if n_in > 1 since the dimensions don’t line up anymore.

    • Jason Brownlee September 16, 2017 at 8:41 am #

      It would, but the scaling would be column-wise and incorrect.

      • Simon September 17, 2017 at 11:26 am #

        Could you expand more on this and how the code might be modified to incorporate multi-step? I’m also playing around with turning this into a classification problem, would it still work if the feature we are trying to predict is a classifier?

        • Jason Brownlee September 18, 2017 at 5:42 am #

          I give the code to do this in another comment.

          For classification, you will need to change the number of neurons in the output layer, the activation function in the output layer and the loss function.

  47. Agrippa Sulla September 16, 2017 at 5:18 am #

    I have a little question. I’ve successfully built my own LSTM multivariate NN using your code as a basis (thanks!). It forecasts export growth for the UK using past export growth and GDP. It perform decently but the financial crisis kinda messes things up.

    Now I want to add data to this model, but I can’t go further back than 1980 for the time-series (not for now at least). So what I want to do is add the GDP growth rate of all the UK’s major trading partners. Should I be worried about adding another 20 input neurons (e.g. countries)? Do you have a post talking about the risks of using data that is low in rows (e.g. years) but high in columns (e.g. inputs).

    I hope my question makes sense.


    • Jason Brownlee September 16, 2017 at 8:46 am #

      I don’t have posts on the topic of more columns than rows. It does require careful handling.

      As a start, I would recommend developing a strong test harness, then try adding data and see how it impacts the model skill. Experiment.

  48. Ed September 16, 2017 at 6:00 am #

    Thanks a lot for your tutorial!
    Is there a feature importance plot for cases like this?
    sometimes is very important to know it

    • Jason Brownlee September 16, 2017 at 8:47 am #

      Good question. I’m not sure about feature importance plots for LSTMs. I would expect that if feature importance can be calculated for MLPs, then it could be calculated for LSTMs, but this is not something I have looked into sorry.

  49. Kuldeep September 20, 2017 at 12:53 am #

    Hi Jason,

    Great post as always!

    I have a question regarding scaling. My problem is quite different as I have to apply series to supervised function first on the data coming from different source and then combine the data… my question is, can I apply scaling at the end? Should scaling be applied column wise or on complete matrix/array?

    • Jason Brownlee September 20, 2017 at 5:58 am #

      The key is being able to scale the data consistently. The place in the pipeline is less important.

  50. Nejra September 21, 2017 at 1:25 am #

    Hi Jason thank you very much for your tutorials!
    I’m trying to develop an LSTM for time prediction having as input 3 features (2 measurements and a third one is a sort of control of the system) and the output (value to predict) is not a single value but a vector of 6 values. So, at every time step my network should be able to predict this entire vector. Two questions:
    1. Since my inputs are not correlated between them, their order in the input array will not influence my predictions?
    2. How can I shape my output in order to estimate all the 6 values of the vector for each time step?
    Thanks for any kind of help!

  51. Mitchel Myers September 22, 2017 at 5:34 am #

    I replicated the example described on this page, and saved my test_y and yhat vectors to csv so that I could manually check how my prediction compared with the true values. However, when I did this, I discovered that every yhat value in my array is the exact same value (~34). I was expecting a unique yhat value for each input vector. Do you have any suggestions to help fix this?

  52. Mitchel Myers September 23, 2017 at 3:25 am #

    Follow up on this — when this error arose, I was using my own data set that I want to perform time series forecasting on. When I duplicated the guide exactly as described above, the issue goes away. Do you have any idea why this issue comes up (where every predicted yhat value is the exact same) when I use a different data set?

    • Jason Brownlee September 23, 2017 at 5:44 am #

      Perhaps the model needs to be tuned to your specific dataset?

  53. zwj September 25, 2017 at 1:10 pm #

    Hi Jason thank you very much for your tutorials! I try to delete the columns [‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’] from the train_X data, and I also get the almost same test RMSE. It is 26.461. It seems to show that the 8 weather conditions have no affect on the prediction result. The code is below.

    # fit an LSTM network to training data
    def fit_lstm(train, test, batch_size, neurons):
    # split into input and outputs
    train_X, train_y = train[:, 0:1], train[:, -1]
    test_X, test_y = test [:, 0:1], test [:, -1]

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(neurons, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss=’mae’, optimizer=’adam’)

    # fit network
    history =, train_y, epochs=50, batch_size=batch_size, validation_data=(test_X, test_y), verbose=2, shuffle=False)
    #history =, train_y, epochs=50, batch_size=72, verbose=2, shuffle=False)

    return model

    # make a prediction
    def make_forecasts(model, test_X):
    test_X = test_X[:, 0:1]
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    forecasts = model.predict(test_X)

    return forecasts

    • Jason Brownlee September 25, 2017 at 3:26 pm #

      Nice one!

      The real motivation for me writing this post was to help the 100s of people asking how to develop a multivariate LSTM.

      • Tommy November 13, 2017 at 4:07 am #

        This is more substantial than I think is being acknowledged. What is the point of creating a multivariate lstm if all of the other variables don’t have an impact on the outcome? Has this been attempted with other data sets?

        • Jason Brownlee November 13, 2017 at 10:19 am #

          It is an example for those who want to explore the approach.

          I don’t have more examples because it turns out the method is outperformed by MLPs for autoregression problems. At least in my experience.

  54. Mitchel September 27, 2017 at 1:39 am #

    Can you explain why the train_X and test_X data sets are reshaped to this?

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

  55. Lino September 28, 2017 at 12:59 pm #

    Hi Jason

    Great post.
    Suppose i want to predict the next 24h using previous one year dataset. How can we do it?

  56. Nels September 29, 2017 at 5:56 am #

    I think I’m missing something fundamental in my understanding of LSTM/s and BPTT. I’ve read through many of your posts and have come to understand RNN’s and LSTM in particular much better because of them, so thank you for that!

    My question that I hope you can shed some light on is what is the difference between passing the past information, i.e. var(t-n)…var(t-1) in the input vector for a single sample, and passing multiple sequences, of length n as a single sample?

    To help clarify, using temsteps of length N, I have a configuration that looks like this:

    Input to LSTM is [samples, timesteps, features].
    Each sample/observation consists of a vector of timestamps (of size N+1) where each of these vector’s values corresponds to the input feature’s values I.e.

    Observations for each time t, with features f and r
    time t
    [ f(t-N) r(t-N) ]
    [ f(t-N+1) r(t-N+1) ]
    [ f(t-N+2) r(t-N+2) ]
    . .
    . .
    . .
    [ f(t) r(t) ]
    And for each observation/sequence the target is Y(t).

    Or, as many of your examples do, you can include the the past information in the form of a windowed input, with a single time step, so something like:

    Input is [samples, 1, features]. So for every observation, we include previous time values as features

    Observations for each time t, with features f and r
    time t
    [ f(t-N), r(t-N), f(t-N+1), r(t-N+1), f(t-N+2), r(t-N+2), f(t), r(t) ]
    And again, for each observation, the target is Y(t).

    I understand that having sequences longer than 1 allows BPTT to work over the length of those sequences, but I don’t think I really understand the difference in these two methods.

    I have tried the described two options, and I find the the latter is performing better based on preliminary tests. I can use a window size of 3 and a sequence length of 1 and get good results, but if I use the first approach and a window size of 12, the model actually fails to learn within the same amount of time.

    Hence, I wonder if I don’t have a fundamental misconception. If you have some time, I would like to hear your explanation on this difference and how the LSTM responds in terms of “memory” based on these two different types of input setup. (I have read a lot of articles, blogs, git hub issues, and stack overflow posts trying to wrap my head around this, but I haven’t found anything that address this directly.)


  57. Paul September 29, 2017 at 12:28 pm #

    With this line…

    # drop columns we don’t want to predict
    reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

    I don’t understand the numbers used here, doesn’t the data not even have that many columns? There are 8 feature columns and 1 index column.

    I’m adapting this code for my own use and have very different features but I’m not sure I’m getting that line adapted right.

    Thanks for the great post!

    • Paul September 29, 2017 at 1:29 pm #

      Nevermind! I figured it out.

    • Jason Brownlee September 30, 2017 at 7:33 am #

      It does have that many columns after we reshape it to be a supervised learning problem.

  58. Wenhan Wang September 30, 2017 at 2:05 pm #

    This is awesome!
    Helping me a lot in my real work!

  59. Vilmara Sanchez October 4, 2017 at 3:54 pm #

    Hi Dr. Jason, I am working on a project for sleep stage classification where the number of timesteps (observations) in the input series (ECG signal) is different than the number of timesteps in the output series (sleep stage scores).

    The issue here is that the input and output time series are not equal in terms of timesteps as the examples you have shown in your problems.

    I have tried to frame the problem in different ways without getting results that make sense. Could you please provide guidance on how to approach this problem?.



  60. Devakar Verma October 6, 2017 at 6:06 pm #

    Hi Jason,
    If we want to predict multiple features as output and having multiple feature as input. How can we solve this problem. For example input variables are temperature and humidity and want to predict both temperature and humidity, can we solve this with single LSTM model.

    Thanks for your anticipated response.

    • Jason Brownlee October 7, 2017 at 5:50 am #

      Yes you can. Change the multivariate input model to output more than one value in the output layer.

  61. Brent October 7, 2017 at 5:55 am #

    Hi Jason,

    Thank you for taking the time to write such an excellent post and follow up with questions. The mechanics of the data conversion & training work great.

    However, my first reaction is that the LSTM doesn’t seem to have learned anything more than to copy the previous value. As BECKER states:

    > it looks like the predicted value is always 1 time period after the actual?

    These are the same results as in your Shampoo example: the predicted value appears to be equal to the previous value (possibly with some constant offset).

    Have you found a different network architecture that performs better than a DNN without LSTM layers?

  62. sathvik October 9, 2017 at 1:34 pm #

    Thank you so much Jason for the wonderful article, learnt a lot… I wanted to have a comparison shown on multivariate statistical methods and neural networks and I was looking for some post/article on multivariate time series model using ARIMA. I would be glad to know if anything you know of the same.

    Thank you

    • Jason Brownlee October 9, 2017 at 4:46 pm #

      You will need to look into using SARIMAX, sorry I do not have an example at this stage.

  63. Shan October 12, 2017 at 4:34 am #

    Hi Jason, is there any library available to perform feature extraction/ dimensionlity reduction for sequential LSTM model?

    • Jason Brownlee October 12, 2017 at 5:37 am #

      Often an embedding layer is used to project observations at each time step prior to feeding them into the LSTM.

  64. Terry October 12, 2017 at 6:15 pm #

    How does multivariate LSTM compare to Multivariate ARIMAX? Are there use cases where one model outperforms the other?

    • Jason Brownlee October 13, 2017 at 5:45 am #

      I would recommend using a linear model first and only moving to a neural net if it delivers better results on your specific problem.

  65. Hesam October 13, 2017 at 4:27 am #


    There are some problem of scaling back when we use more than one shift in time, I mean something like this:

    reframed = series_to_supervised(scaled, 6, 1)

    I can train and test the model, but some errors appears in the scaling back section which I couldn’t fix.

    Please have a look. I really appreciate it.

  66. Anil Maddala October 13, 2017 at 9:59 am #

    Hi Jason, thanks for the great series of articles. How should I modify the code from changing the LSTM code from preiction to classification?

    One sample input data is 60 time steps over 2 features and I want to classify the 60 step input sequence into 3 classes. To start with is LSTM the right approach?

    Hoping that you wold take any requests, I would definetly love to see an article on Multivariate classification in Keras using LSTM/GRU and it would be really helpful for analyzing sensor data. You could look at the Human Activity Recognition dataset

    • Jason Brownlee October 13, 2017 at 2:55 pm #

      Change the loss function and the activation function of the output layer to categorical_crossentropy and softmax respectively.

  67. heeun October 13, 2017 at 6:31 pm #

    Hi Jason, thanks yor nice article.

    I have a question!

    That algorithm is many to one right?

    How can I slove many to many?? for example, i want predict pollution and rain

    • Jason Brownlee October 14, 2017 at 5:42 am #

      It is many-to-one in terms of features.

      You can change it to be many-to-many by outputting multiple features.

  68. Pau October 14, 2017 at 1:13 pm #

    3 Things:
    1) Thanks so much for this. I’ve used this as a basis for some code I’m writing and it gave me a great head start.
    2) One thing that would be great to help with understanding the meanings of variables you’re using is to first put them into variables rather than using the integers. For example,

    x_size = 1
    train_X, train_y = train[:, :-x_size], train[:, -x_size:]
    test_X, test_y = test[:, :-x_size], test[:, -x_size:]

    This way, as people are reading the code they understand why it’s “-1” in case their adapted usage has different dimensions, they can change one variable and have it used everywhere it’s needed.

    3) For instance, I’m trying to make this code output multiple predictions and am having a bit of trouble figuring out all the variables I need to change.

    I have 368 columns of data, the first 168 are what will be predicted based on the other 200 points.

    x_size = 200
    # split into input and outputs
    train_X, train_y = train[:, :-x_size], train[:, -x_size:]
    test_X, test_y = test[:, :-x_size], test[:, -x_size:]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    I get the error:
    ValueError: Error when checking target: expected dense_1 to have shape (None, 1) but got array with shape (659, 200)

    Should the Dense(1) be Dense(x_size) where for me that is 200? (this is why it would be great to use variables so I know what that 1 means). When I try it as 168 (which is what it seems like it should be), I get an error.

    When I switch to x_size, it actually runs without errors, but I’m not sure if that means I’m correct or not.

    I’m so confused.


    • Jason Brownlee October 15, 2017 at 5:18 am #

      I have an example of multiple timestep outputs here that you could use as a starting point:

      • Paul October 16, 2017 at 4:35 pm #

        Rather than trying to predict many timestep outputs, I’m looking to output multiple predicted values per timestep.

        One thing I don’t understand is this section:

        # invert scaling for forecast
        inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
        inv_yhat = scaler.inverse_transform(inv_yhat)
        inv_yhat = inv_yhat[:,0]

        Why is it inserting the yhat values as the *first* column? The scaler has a different scale per column so positioning is important, and the Y data had been the last column in the row, hadn’t it? So won’t it get scaled incorrectly?

        • Jason Brownlee October 17, 2017 at 5:38 am #

          The first column is the pollution value, we remove it from the test data, concat our prediction so we have enough columns for the transform’s expectations, then invert the transform and get the predicted pollution values in the correct scale.

          Does that help?

  69. Rui October 14, 2017 at 9:35 pm #

    First of all ,thanks a lot for the great tutorial Jason.

    I just have one question regarding the achieved predictions using the LSTM network.

    I just don’t understand why are you making “trainPredict = model.predict(trainX)” .

    I get the predict method using the testset testX, but using this method for trainX is not like if you were in some way cheating? I say this because we train the network using the trainX and trainY and trainY corresponds to the labels you are trying to predict in the predict method using trainX.

    Is it performed for validation purposes only?

    I’m still learning to work with the Keras API so I might be confused with the syntax of it

    Many thanks

  70. Kai Li October 17, 2017 at 1:05 pm #

    Thanks a lot for your tutorial!
    I still have some question,looking forward to your answer.
    If I want use the feature(t) 、 feature(t-1) and pollution(t-1) to predict pollution (t), how can I do to reshape my input?

  71. DC October 17, 2017 at 8:21 pm #

    Hi Jason, Thank you very much for the wonderful post. I have a few questions.

    1. You did not de-trend by using diff for above example. Diff from multi step only works for series. Can you please share how can we de-trend of multivariate time series?

    2. I’d like to use past 3 days of above data to predict 3 time steps for multivariate data as above. Can you please let me know how I can do that with the example above?

    Thanks for your help.

  72. Xie October 19, 2017 at 12:30 am #

    Hi, Jason. First of all, any thanks for your post. And I have some problems.

    1. I don’t really get the meaning of hidden_units? Can you please explain a little bit.
    2. I am building a lstm network as you do. I just follow your ways and build the network but got an error, as described here you please help me?


    • Jason Brownlee October 19, 2017 at 5:37 am #

      A hidden unit is a neuron or cell in a hidden layer.

      A hidden layer is a layer that is not the output or the input layer.

      Change your code to set “return_sequences” to be “False”.

  73. Argie October 19, 2017 at 3:16 am #

    So in your example you are using the data this way:

    No,year, month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
    1, 2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0

    Is possible to use the data in a way that lets say we could have multiple input numbers in one of the columns like for example, having
    No, year, month, day, hour, pm2.5, newVariable
    and in the new variable position instead of having just one integer like 20
    to have a sequence of integers like (5,10,3,50,23)

    Would that be possible using it on the same context, or is there any scenario that we could
    use the data the way I mentioned ?

    • Jason Brownlee October 19, 2017 at 5:40 am #

      If you mean, can you predict a sequence output, then yes. Here is an example:

      • Argie October 19, 2017 at 7:31 am #

        I might have not been clear enough, and sorry for that.

        What I mean is that as an input I will have 4 different categories of data lets call them A, B, C, and D, that each one of them will have more than one integer, to be exact they will have 10 integers
        so for example:

        A = {3,4,6,8,34,65,43,1,54} and so on with the other three categories.

        The sequence of numbers within the four categories belong on different time stamps, for example 3 -> t0 , 4-> t1 and so on.

        So what I need is to classify them for different data samples.

        • Jason Brownlee October 19, 2017 at 3:55 pm #

          These would be parallel series (columns) that could be all fed to one LSTM model like the example in the above tutorial.

          The model will process the parallel series one at a time step at a time.

          If the series extends beyond 200-400 time steps, then they could be split into multiple samples (e.g. multiple sub-parallel series).

          Does that help?

          • Argie October 20, 2017 at 11:31 am #

            So so helpful, I tried it and worked like a charm.

            Great job, and so helpful all the material you provide, and the way you do it !!

            Thanks a lot Jason !!

          • Jason Brownlee October 21, 2017 at 5:23 am #

            I’m glad to hear that, well done!

  74. Tim October 19, 2017 at 4:59 am #

    Really appreciate all the work you have done!

  75. Abhinav October 19, 2017 at 6:36 am #

    Hi Dr Brownlee. Thank you for this tutorial.

    inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

    inv_yhat = scaler.inverse_transform(inv_yhat)

    what does these steps do?

    Because I am getting a ValueError: operands could not be broadcast together with shapes (1822,11) (6,) (1822,11) on this step.
    I am applying on my own dataset

    • Jason Brownlee October 19, 2017 at 3:52 pm #

      These steps add the prediction to the test input data so that we can inverse the transform and get the prediction back into the scale we care about.

  76. TvT October 19, 2017 at 8:08 pm #

    Hi Jason,

    Thanks for sharing your awesome work, I’ve been learning a lot from you!

    I have been struggling with increasing the second dimension to fully benefit from the BPTT though. I keep getting lost in the shapes. Would you mind sharing your code for multiple time steps aswell?
    That would be awesome!

    Keep up the good work!

  77. Dirk October 20, 2017 at 7:42 pm #

    Awesome work, thanks for sharing it!

    Could it be possible that you switched up the chronological order of your predictions?
    It looks to me that you predict the pollution of the previous hour, instead of predicting the future.

    • Jason Brownlee October 21, 2017 at 5:33 am #

      That is what a persistence model looks like exactly.

  78. Craig October 21, 2017 at 3:22 am #

    Hi Jason, I’m new to Deep Learning, so sorry if this is a fundamental question. I am trying to use an LSTM NN to create a super fast surrogate for a coastal circulation model (something sort of similar to this, but with time dependency:

    My training set looks something like this:

    -samples: 2000 – (I modeled a year with hourly output)
    -timesteps: 7 – (t-6, t-5, …, t)
    -features: 4 – (offshore boundary tide, 1st derivative of offshore boundary tide, boundary river discharge for river-1, and boundary river discharge for river-2)

    Currently, my target is velocity magnitude for one node in my model domain ([2000,1]

    My question is: When you do this tutorial, you assign the time steps as additional features (i.e. for my problem, our train_X = [2000,1,28]). I did this and it works fine, but eventually I’d like to scale this, and I thought I’d try to reshape my data to it’s intended shape for the model (i.e. [2000,7,4]). However, when I do this, my training time goes way down (it’s probably 3-4x slower.

    Does the model treat these two shapes differently? If not, why does it take so much longer to train with the latter shape?

  79. Amir Aaron October 22, 2017 at 5:58 pm #

    Hi Jason,
    Great article.
    I have a small question:
    In previous article you pointed out that we need to make the data stationary,
    Do we need to do it for multi-variant as well?

  80. Andriy October 24, 2017 at 12:39 pm #

    Nice article! I think one question remains unanswered. Why use RNNs if we only use one previous step to predict the next step? Why not SVM for example?

    • Jason Brownlee October 24, 2017 at 4:00 pm #

      No reason at all, we cannot what will work best for a given problem.

      Try it and compare the results!

  81. Ali Abdul October 25, 2017 at 7:39 pm #

    Hi Jason,

    Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

    Thanks in advance!

    • Jason Brownlee October 26, 2017 at 5:25 am #


      • Ali Abdul October 27, 2017 at 1:28 am #

        Great! One more small thing. When dealing with tails (let’s say 0 for lower, 1 for other than tail, 2 for upper tail), the classes and the features of course will be highly imbalanced. What would your approach be?

        • Jason Brownlee October 27, 2017 at 5:23 am #

          You might need to adjust the distribution via rescaling to make the least represented classes better represented.

  82. Mehmet Abd October 26, 2017 at 8:28 pm #

    Hi Jason,

    Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

    Thanks in advance!

  83. Hesam October 29, 2017 at 8:22 pm #


    What we should do if the time itself would be a value that we must predict, such as predicting time and date for the next rainfall?

    • Jason Brownlee October 30, 2017 at 5:37 am #

      You could predict the likelihood of rainfall for each hour and then use code (an if statement) to interpret those predictions and only output the predictions with a probability above a given threshold.

  84. Thabet October 30, 2017 at 3:33 am #

    Hello Jason,

    Could you perhaps show me exactly where to change as to predict the temperature instead of pollution?

    • Jason Brownlee October 30, 2017 at 5:42 am #

      You can change the column used as the output variable when fitting the model.

      Around line 52 in the full example where we drop columns we don’t care about. Change it to drop the pollution as well and not drop temperature.

      • Thabet October 31, 2017 at 10:14 am #

        Can you please help me further as i can’t manage to find where to change to predict for the temperature instead of pollution

        “” Next, we need to be more careful in specifying the column for input and output.
        We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

        # split into input and outputs
        n_obs = n_hours * n_features
        train_X, train_y = train[:, :n_obs], train[:, -n_features]
        test_X, test_y = test[:, :n_obs], test[:, -n_features]
        print(train_X.shape, len(train_X), train_y.shape)

        Where and how should i change to chose the temperature column?

  85. Allen November 1, 2017 at 7:03 pm #

    Hi Jason,

    Thanks for sharing your awesome work, I’ve been learning a lot from you!

    I have a small question:

    In previous article you pointed out that “Predict the pollution for the next hour as above and
    given the “expected” weather conditions for the next hour.” , eg “pollution,dew,temp”.

    What would your approach be?

    • Jason Brownlee November 2, 2017 at 5:11 am #

      For the case: “Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.”

      You would not need to transform the dataset, you would simply pretend that the actual weather conditions for the next hour are a forecast and predict the pollution value at that time.

  86. Ali November 2, 2017 at 3:42 am #

    first thanks for the post I learned a lot. I have a fundamental question about LSTM. lets say, I have 3 variables X, Y, and Z. I want to predict on Z.

    if I make the input(train_X in example above) time lagged. So I pass it x(t), x(t-1), x(t-2), x(t-3) etc…. then will the time component of LSTM matter or not? For example we have:

    t, x, y, x-1, x-2, y-1, y-2, z-1, z-2, z
    1, 1, 2, 0, 0, 0, 0 , 0, 0, 3
    2, 2, 4, 1, 0, 2. 0, 3 0, 3
    3, 3, 6, 2, 1, 4, 2, 3, 3, 6
    4, 4, 8, 3, 2 6, 4 6, 3, 6
    5, 5, 10, 4, 3, 8, 6 6, 6, 9

    traditionally we would train on variables (x, y, x-1, x-2, y, y-1, y-2, z-2, z-2) on the first 4 time-steps then evaluate on the 5th.

    my question is if I train it on time step,(1, 2, 4, 5) and evaluate on step 5, will I have the same result? mainly if I add the time-lag as an input can I reshuffle the data?

  87. Ali November 2, 2017 at 4:40 am #

    Hi Jason,

    if we pass in previous time lag can we shuffle the data around in the model? in other words make the input timeless?

    • Ali November 2, 2017 at 4:41 am #

      sorry when I refreshed my question didn’t appear, I thought it did not go through….did not mean to impatiently spam. apologies.

      • Jason Brownlee November 2, 2017 at 5:14 am #

        No problem, I moderate comments so there is some delay before they appear.

  88. Gus C November 3, 2017 at 3:41 am #

    Thanks for this great post.
    So how do you assess graphically your forecast with the actual?

  89. Num November 3, 2017 at 4:44 am #

    Hello, I have a problem that’s highly related to this guide.

    I have a time series where the predicted variable is (allegedly) in part dependant on some features from that time step, and these features are known before it (they are “planned prices” and “expected value” for different feature). I would like to include them as input into the LSTM.
    For one output, this turned out to be easy (just keep them in), but if I try to predict several outputs, I am having troubles formating the input correctly.

    For better understanding, the desired input would be features x1 through x8 for t-1,t-2…etc and then x1 through x7 for t,t+1,t+2…etc.

    Is this even possible with the example given here?

  90. Geoffrey Anderson November 3, 2017 at 4:58 am #

    PM2.5 is just one time series to predict, clearly. Predicting say 3 (or even 100,000) time series would be nice to look at too. An real life example where it’s useful is inventory management in retailing businesses. How many units will be sold in the next day of eggs, mascara, paper plates, frozen corn, 2% milk, skim milk, etc etc. Many of these TS will be correlated. Might need multi-tasking neural network outputs. LSTM would offer more automatic feature engineering than, say, using a boosted tree traditional machine learning algorithm which is natively unaware of time series. The latter needs manual feature creation of time-windowed aggregates by the data scientist. The LSTM just inputs the raw time series values directly by contrast, finding its own features. A bonus when using the LSTM is there may be some time-window or other features the human didn’t know about in advance. Another bonus is multiple-output (multitasking) that neural networks can naturally provide, unlike boosted trees for example. I’d suggest to start with only 2 or 3 TS at first, because a whole grocery store’s worth of items for even just a one day example is way too cumbersome to look at and manipulate easily on one small monitor screen. Just a warning: This may be frontier research, believe it or not.

    • Jason Brownlee November 3, 2017 at 5:23 am #

      Thanks for the suggestion Geoffrey. I hope to spend more time on this soon.

  91. Lu November 6, 2017 at 8:35 pm #

    I plot inv_yhat and inv_y in a same figure, and I found an interesting fact, that the training result is shifted to right for an hour compared with the ground truth. That’s to say the predicted result is almost the one hour ago data, or X_t = X_{t-1} approximately.
    Actually, the best estimation for RNN is to output the latest result, without doing any prediction. How do you think about this?

  92. Rafael November 7, 2017 at 6:32 am #

    I’m using my own dataset and I’m not using the series_to_supervised method because I already have the dataset prepared in 2 files, train and test files. I still have the error:

    Traceback (most recent call last):
    File “”, line 64, in
    inv_yhat = scaler.inverse_transform(inv_yhat)
    File “C:\Users\rafae\AppData\Local\Programs\Python\Python35\lib\site-packages\sklearn\preprocessing\”, line 385, in inverse_transform
    X -= self.min_
    ValueError: operands could not be broadcast together with shapes (52,12585) (12586,) (52,12585)

    • Rafael November 7, 2017 at 6:34 am #

      To load the datasets

      #Train dataset
      dataset = read_csv(‘trainning_small.csv’, header=None, index_col=None)
      dataset.drop(dataset.columns[[0]], axis=1, inplace=True)
      train = dataset.values

      encoder = LabelEncoder()
      train[:,-1] = encoder.fit_transform(train[:,-1])
      train = train.astype(‘float32’)

      scaler = MinMaxScaler(feature_range=(0, 1))
      train = scaler.fit_transform(train)

      #Test dataset
      dataset_test = read_csv(‘test_passare.csv’, header=None, index_col=None)
      dataset_test.drop(dataset_test.columns[[0]], axis=1, inplace=True)
      test = dataset_test.values

      encoder = LabelEncoder()
      test[:,-1] = encoder.fit_transform(test[:,-1])
      test = test.astype(‘float32’)

      test = scaler.fit_transform(test)

      train_x, train_y = train[:, :-1], train[:, -1]
      test_x, test_y = test[:, :-1], test[:, -1]

      train_x = train_x.reshape((train_x.shape[0], 1, train_x.shape[1]))
      test_x = test_x.reshape((test_x.shape[0], 1, test_x.shape[1]))
      print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)

      (838, 1, 12585) (838,) (52, 1, 12585) (52,)

  93. Fred November 7, 2017 at 4:30 pm #

    Dr. Brownlee,

    First of all, thanks for this wonderful post. I have applied your code with the following parameters:
    lags=8, features=8, epochs=50, batch=104, neurons=150

    And got almost perfect match between train and test. The test RMSE is 26.526.

    My question is that what does this result stand for?

    • Jason Brownlee November 8, 2017 at 9:18 am #

      Well done. The result is a summary of the error between predicted and expected values.

  94. Vlad November 12, 2017 at 5:37 am #

    I launched this example on my notebook (AMD FX-8800P Radeon R7, 8GB RAM), it runs already 4 hours and I even can’t see what is going on with the model training and how long will it run. Is it possible to include in the example some monitoring and visualization of the training process, ex. using callbacks.RemoteMonitor ?

    P.S. previously I worked with Matlab, it was so nice to see number of epochs, accuracy, error, and many other parameters during the training process. It helped a lot to understand should I continue training, or should I change the model.

    • Jason Brownlee November 12, 2017 at 9:08 am #

      You should see the progress for each epoch and across epochs as output on the command line.

  95. Vlad November 12, 2017 at 7:56 am #

    Hm, relaunched the example step-by-step and found out it’s stuck not at training, but at model compilation. Working for hours at 100% CPU load on block:
    # design network
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss=’mae’, optimizer=’adam’)
    What’s wrong?
    Ubuntu 16.4, Keras 2.0.6, Theano 0.9.0, Python 3.6.2, Anaconda custom

    • Jason Brownlee November 12, 2017 at 9:09 am #

      Are you running on the command line? If you run in a notebook, you may hide error or verbose messages.

  96. Vlad November 12, 2017 at 9:57 am #

    I updated all libraries and anaconda and python and now it works! Sorry for disturbance 🙂 BTW, monitoring tool can be used for callbacks.RemoteMonitor is hualos-master

  97. Tommy November 13, 2017 at 5:20 am #

    Thanks for the very well written article. I really appreciate the detailed walkthrough.

    I have been looking for a way to apply multivariate input to a machine learning prediction model of any sort. I’m doing this in order to predict the growth of compute systems in excess of hundreds of thousands of nodes bases on 6 years of daily samples. Simply looking at the Y growth over time and feeding that into something like Facebook prophet has proved somewhat insufficient because it only looks at the problem as a function of past behavior.

    In reality there are more variables at play that control or effect that line of growth. As such, simple univariate approaches fall short and the predictions can be very good or very bad.

    When I found this article I thought to myself, Eureka! I will be able to use this approach in order to feed in multivariate data along with the growth of my systems in order to get better predictions. However I was somewhat crestfallen at the revelation of 2 key problems discussed over the last several months here in the comments…

    One problem you acknowledged as a potential/known issue and linked to another article explaining why autoregression time series problems may not be best solved with lstm neural networks. The article posits that better results might be obtained by stacking or using more layers. Have you tried this? If so, what did it look like and what results did you get?

    The second and more concerning problem was when one commenter performed the same exercise as laid out in this article, but removed all of the multivariate data and still obtained the same rmse rate as you did. It was as if none of the other variables had any bearing on the prediction. This is deeply concerning, because as I see it, either this event was anomalous and driven by the input data, or the overall approach itself may be flawed, or the implementation thereof is broken. I’m not sufficiently versed in the technology to make a value statement on any of those points.

    I’m hoping that you would be willing to share your thoughts on possible answers to these questions.

    • Jason Brownlee November 13, 2017 at 10:22 am #

      The tutorial is a demonstration of a method, not the best way of solving or even framing the presented problem.

      I should have made that clearer, but that is the philosophy behind every single blog post on my site. I show how to use the methods, not how to get the best results (for a specific problem). The former problem is tractable the latter is not.

      • Tommy November 13, 2017 at 12:14 pm #

        Thanks for the clarity and candor! As a long-time comp-sci person, I find it very strange to run these tensorflow sessions and get different results for the same inputs (I’ve been putting your code through the paces) … I found I needed to add this, or every subsequent run would result in predictions that seemed to augment each previous run:


        For what it’s worth, I zeroed out all the other variables (instead of eliminating them) and it /did/ have bearing on the output. I don’t think this methodology can be dismissed as ineffective. It seems to be approximating a workable solution. More exploration is necessary.

        Thank you for setting me on the path!

        • Jason Brownlee November 14, 2017 at 10:06 am #


          Well, these are stochastic algorithms in general, but a single trained model should be deterministic and when it’s not, we’re in trouble.

          • Tommy November 14, 2017 at 11:48 am #

            Have you tried running multiple iterations and examining yhat_inv?

            I keep getting different output, and I didn’t expect that. Am I looking in the wrong place?

            I can send a catalog of my results if that helps…

          • Jason Brownlee November 15, 2017 at 9:45 am #

            I have not.

            In general, we do expect different results across different runs given the stochastic nature of neural networks (forgive me if I am missing the point):

  98. sam November 15, 2017 at 10:23 pm #

    Hi Jason,

    multivariate time series forecasting possible for multi-step??

    • Jason Brownlee November 16, 2017 at 10:30 am #


      • sam November 16, 2017 at 6:23 pm #


        Jason Can you please explain..How to prepare dataset for train models.. let’s suppose i have 5 feature and i want to predict t + 5 value..

        For example..

        x1 = (2,3,4,3,1,6,8,9,4,1)
        x2 = (5,2,5,7,9,9,6,3,1,3)
        x3 = (2,3,4,8,1,6,8,9,1,1)
        x4 = (5,1,5,7,9,9,6,3,1,7)
        x5 = (2,3,4,6,8,3,1,3,5,7)
        y = (8,7,6,5,4,3,2,8,9,7)


  99. Tommy November 18, 2017 at 3:54 pm #

    What do you think about putting a dropout layer between the LSTM and Dense layers to address the overfitting phenomenon?

Leave a Reply