Multivariate Time Series Forecasting with LSTMs in Keras

Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.

This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems.

In this tutorial, you will discover how you can develop an LSTM model for multivariate time series forecasting with the Keras deep learning library.

After completing this tutorial, you will know:

  • How to transform a raw dataset into something we can use for time series forecasting.
  • How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
  • How to make a forecast and rescale the result back into the original units.

Kick-start your project with my new book Deep Learning for Time Series Forecasting, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Aug/2017: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.
  • Update Oct/2017: Added a new example showing how to train on multiple prior time steps due to popular demand.
  • Update Sep/2018: Updated link to dataset.
  • Update Jun/2020: Fixed missing imports for LSTM data prep example.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Air Pollution Forecasting
  2. Basic Data Preparation
  3. Multivariate LSTM Forecast Model
    1. LSTM Data Preparation
    2. Define and Fit Model
    3. Evaluate Model
    4. Complete Example
  4. Train On Multiple Lag Timesteps Example

Python Environment

This tutorial assumes you have a Python SciPy environment installed. I recommend that youuse Python 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend, Ideally Keras 2.3 and TensorFlow 2.2, or higher.

The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed.

If you need help with your environment, see this post:

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

1. Air Pollution Forecasting

In this tutorial, we are going to use the Air Quality dataset.

This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.

The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:

  1. No: row number
  2. year: year of data in this row
  3. month: month of data in this row
  4. day: day of data in this row
  5. hour: hour of data in this row
  6. pm2.5: PM2.5 concentration
  7. DEWP: Dew Point
  8. TEMP: Temperature
  9. PRES: Pressure
  10. cbwd: Combined wind direction
  11. Iws: Cumulated wind speed
  12. Is: Cumulated hours of snow
  13. Ir: Cumulated hours of rain

We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.

This dataset can be used to frame other forecasting problems.
Do you have good ideas? Let me know in the comments below.

You can download the dataset from the UCI Machine Learning Repository.

Update, I have mirrored the dataset here because UCI has become unreliable:

Download the dataset and place it in your current working directory with the filename “raw.csv“.

2. Basic Data Preparation

The data is not ready to use. We must prepare it first.

Below are the first few rows of the raw dataset.

The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas.

A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now.

The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “pollution.csv“.

Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have.

The code below loads the new “pollution.csv” file and plots each series as a separate subplot, except wind speed dir, which is categorical.

Running the example creates a plot with 7 subplots showing the 5 years of data for each variable.

Line Plots of Air Pollution Time Series

Line Plots of Air Pollution Time Series

3. Multivariate LSTM Forecast Model

In this section, we will fit an LSTM to the problem.

LSTM Data Preparation

The first step is to prepare the pollution dataset for the LSTM.

This involves framing the dataset as a supervised learning problem and normalizing the input variables.

We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.

This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:

  • Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
  • Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

We can transform the dataset using the series_to_supervised() function developed in the blog post:

First, the “pollution.csv” dataset is loaded. The wind direction feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.

The complete code listing is provided below.

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

  • One-hot encoding wind direction.
  • Making all series stationary with differencing and seasonal adjustment.
  • Providing more than 1 hour of input time steps.

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.

Define and Fit Model

In this section, we will fit an LSTM on the multivariate input data.

First, we must split the prepared dataset into train and test sets. To speed up the training of the model for this demonstration, we will only fit the model on the first year of data, then evaluate it on the remaining 4 years of data. If you have time, consider exploring the inverted version of this test harness.

The example below splits the dataset into train and test sets, then splits the train and test sets into input and output variables. Finally, the inputs (X) are reshaped into the 3D format expected by LSTMs, namely [samples, timesteps, features].

Running this example prints the shape of the train and test input and output sets with about 9K hours of data for training and about 35K hours for testing.

Now we can define and fit our LSTM model.

We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 8 features.

We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent.

The model will be fit for 50 training epochs with a batch size of 72. Remember that the internal state of the LSTM in Keras is reset at the end of each batch, so an internal state that is a function of a number of days may be helpful (try testing this).

Finally, we keep track of both the training and test loss during training by setting the validation_data argument in the fit() function. At the end of the run both the training and test loss are plotted.

Evaluate Model

After the model is fit, we can forecast for the entire test dataset.

We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers.

With forecasts and actual values in their original scale, we can then calculate an error score for the model. In this case, we calculate the Root Mean Squared Error (RMSE) that gives error in the same units as the variable itself.

Complete Example

The complete example is listed below.

NOTE: This example assumes you have prepared the data correctly, e.g. converted the downloaded “raw.csv” to the prepared “pollution.csv“. See the first part of this tutorial.

Running the example first creates a plot showing the train and test loss during training.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Interestingly, we can see that test loss drops below training loss. The model may be overfitting the training data. Measuring and plotting RMSE during training may shed more light on this.

Line Plot of Train and Test Loss from the Multivariate LSTM During Training

Line Plot of Train and Test Loss from the Multivariate LSTM During Training

The Train and test loss are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset is printed.

We can see that the model achieves a respectable RMSE of 26.496, which is lower than an RMSE of 30 found with a persistence model.

This model is not tuned. Can you do better?
Let me know your problem framing, model configuration, and RMSE in the comments below.

Train On Multiple Lag Timesteps Example

There have been many requests for advice on how to adapt the above example to train the model on multiple previous time steps.

I had tried this and a myriad of other configurations when writing the original post and decided not to include them because they did not lift model skill.

Nevertheless, I have included this example below as reference template that you could adapt for your own problems.

The changes needed to train the model on multiple previous time steps are quite minimal, as follows:

First, you must frame the problem suitably when calling series_to_supervised(). We will use 3 hours of data as input. Also note, we no longer explictly drop the columns from all of the other fields at ob(t).

Next, we need to be more careful in specifying the column for input and output.

We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

Next, we can reshape our input data correctly to reflect the time steps and features.

Fitting the model is the same.

The only other small change is in how to evaluate the model. Specifically, in how we reconstruct the rows with 8 columns suitable for reversing the scaling operation to get the y and yhat back into the original scale so that we can calculate the RMSE.

The gist of the change is that we concatenate the y or yhat column with the last 7 features of the test dataset in order to inverse the scaling, as follows:

We can tie all of these modifications to the above example together. The complete example of multvariate time series forecasting with multiple lag inputs is listed below:

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The model is fit as before in a minute or two.

A plot of train and test loss over the epochs is plotted.

Plot of Loss on the Train and Test Datasets

Plot of Loss on the Train and Test Datasets

Finally, the Test RMSE is printed, not really showing any advantage in skill, at least on this problem.

I would add that the LSTM does not appear to be suitable for autoregression type problems and that you may be better off exploring an MLP with a large window.

I hope this example helps you with your own time series forecasting experiments.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to fit an LSTM to a multivariate time series forecasting problem.

Specifically, you learned:

  • How to transform a raw dataset into something we can use for time series forecasting.
  • How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
  • How to make a forecast and rescale the result back into the original units.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Time Series Today!

Deep Learning for Time Series Forecasting

Develop Your Own Forecasting models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Time Series Forecasting

It provides self-study tutorials on topics like:
CNNs, LSTMs, Multivariate Forecasting, Multi-Step Forecasting and much more...

Finally Bring Deep Learning to your Time Series Forecasting Projects

Skip the Academics. Just Results.

See What's Inside

2,731 Responses to Multivariate Time Series Forecasting with LSTMs in Keras

  1. Avatar
    zorg August 14, 2017 at 7:08 pm #

    except wind *dir*, which is categorical.

  2. Avatar
    Francois AKOA August 15, 2017 at 7:16 am #

    Great post Jason. Thank you so much for making this material available for the community..

  3. Avatar
    yao August 15, 2017 at 2:02 pm #

    hi, jason. There were some problems under my environment which were keras2.0.4and tensorflow-GPU0.12.0rc0.

    And Bug was that “TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.”

    The sentence that “model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))” was located.

    Could you please help me with that?



    • Avatar
      Jason Brownlee August 15, 2017 at 4:54 pm #

      I would recommend this tutorial for setting up your environment:

      • Avatar
        yao August 16, 2017 at 7:18 pm #

        Thx a lot, doctor, it works! fabulous! 🙂

        • Avatar
          Jason Brownlee August 17, 2017 at 6:40 am #

          I’m glad to hear that.

          • Avatar
            Shirley Yang August 18, 2017 at 12:00 pm #

            Dr.Jason, I update TensorFlow then it works!
            Sorry to bother you.
            Thank you very much !
            Best wishes !

          • Avatar
            Jason Brownlee August 18, 2017 at 4:40 pm #

            I’m glad to hear that!

        • Avatar
          Shirley Yang August 17, 2017 at 8:54 pm #

          I met the same problem .

          Did you uninstall all the programs previously installed or just set up the environment again?

          Thx a lot!

      • Avatar
        Shirley Yang August 18, 2017 at 11:43 am #

        Hi Jason,I set up my environment as the your tutorial.

        scipy: 0.19.0
        numoy: 1.12.1
        matplotlib: 2.0.2
        pandas: 0.20.1
        statsmodels: 0.8.0
        sklearn: 0.18.1

        tensorflow: 0.12.1
        Using TensorFlow backend.
        keras: 2.0.5

        But the bug still existed.Is the version of tensorFlow too odd?How could I do?

        • Avatar
          Jason Brownlee August 18, 2017 at 4:39 pm #

          It might be, I am running v1.2.1.

          Perhaps try running Keras off Theano instead (e.g. change the backend in the ~/.keras.jason config)

  4. Avatar
    Songbin Xu August 15, 2017 at 10:42 pm #

    It seems that inv_y = scaler.inverse_transform(test_X)[:,0] is not the actual, should inv_yhat be compared with test_y but not pollution(t-1)? Because I think this inv_y here means pollution(t-1). Is this prediction equals to only making a time shifting from the current known pollution value (which means the models just take pollution(t) as the prediction of pollution(t+1))?

    • Avatar
      Jason Brownlee August 16, 2017 at 6:35 am #

      Sorry, I’m not sure I follow. Can you please restate your question, perhaps with an example?

      • Avatar
        Songbin Xu August 16, 2017 at 7:36 pm #

        Sorry for the confusing expression. In fact, the series_to_supervised() function would create a DataFrame whose columns are: [ var1(t-1), var2(t-1), …, var1(t) ] where ‘var1’ represents ‘pollution’, therefore, the first dimension in test_X (that is, test_X[:,0]) would be ‘pollution(t-1)’. However, in the code you calculate the rmse between inv_yhat and test_X[:,0], even though the rmse is low, it could only shows that the model’s prediction for t+1 is close to what it has known at t.
        I am asking this question because I’ve ran through the codes and saw the models prediction pollution(t+1) looks just like pollution(t). I’ve also tried to use t-1, t-2 and so on for training, but still changed nothing.
        Do you think the model tends to learn to just take the pollution value at current moment as the prediction for the next moment?

        thanks 🙂

        • Avatar
          Jason Brownlee August 17, 2017 at 6:42 am #

          If we predict t for t+1 that is called persistence, and we show in the tutorial that the LSTM does a lot better than persistence.

          Perhaps I don’t understand your question? Can you give me an example of what you are asking?

          • Avatar
            Songbin Xu August 17, 2017 at 10:53 am #

            Hmm, it’s difficult to explain without a graph.

            In a word, and also it’s an example, I want to ask two questions:

            1. In the “make a prediction” part of your codes, why it computes rmse between predicted t+1 and real t, but not between predicted t+1 and real t+1?

            2. After the “make a prediction” part of your codes run, it turns out that rmse between predicted t+1 and real t is small, is it an evidence that LSTM is making persistence?

          • Avatar
            Jason Brownlee August 17, 2017 at 4:52 pm #

            RMSE is calculated for y and yhat for the same time periods (well, that was the intent), why do you think they are not?

            Is there a bug?

          • Avatar
            David Righart August 18, 2017 at 5:30 am #

            I think Songbin Xu is right. By executing the statement at line 90: inv_y = inv_y[:,0], you compare the inv_yhat with inv_y. inv_y is the polution(t-1) and inv_yhat is the predicted polution(t).

            On line 50 the second parameter the function series_to_supervised can be changed to 3 or 5, so more days of history are used. If you do so, an error occurs in the scaler.inverse_transform (line 89).

            No worries, great tutorial and I learned a lot so far!

          • Avatar
            Jason Brownlee August 18, 2017 at 6:54 am #

            I see now, you guys are 100% correct. Thank you!

            I have updated the calculation of RMSE and the final score reported in the post.

            Note, I ran a ton of experiments on AWS with many different lag values > 1 and none achieved better results than a simple lag=1 model (e.g. an LSTM model with no BPTT). I see this as a bad sign for the use of LSTMs for autoregression problems.

          • Avatar
            Chen-Yeou Yu February 3, 2019 at 2:21 am #

            Hi Dr. Jason,

            As for this:
            Updated Aug/2017: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.

            It seems to have some errors on calculating RMSE based on (t-1) vs (t) different time slots before. I’m just curious how it is corrected? Can you elaborate that little bit more? Because for me, I’m still thinking it is RMSE based on (t-1) vs (t)


          • Avatar
            Jason Brownlee February 3, 2019 at 6:20 am #

            I have updated tutorials that I think have better code and are easier to follow, you can get started here:

          • Avatar
            SUNNY April 5, 2019 at 3:39 pm #

            hey,Janson.The RMSE before you updated it was 3.386. Is this article RMSE 26.496 the correct answer after you updated it? In other words,inv_y = scaler.inverse_transform(test_X)[:,0] is not true,test_y = test_y.reshape((len(test_y), 1))
            inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)
            inv_y = scaler.inverse_transform(inv_y) is the correct code,is it right?I find so many people use the incorrect code .

          • Avatar
            Jason Brownlee April 6, 2019 at 6:39 am #

            I don’t recall.

            I recommend starting with a more recent tutorial using modern methods:

  5. Avatar
    Simone August 16, 2017 at 1:11 am #

    Hi Jason, great post!

    Is it necessary remove seasonality (by seasonal differentiation) when we are using LSTM?

  6. Avatar
    Slavenya August 16, 2017 at 5:18 am #

    Good article, thank.

    Two questions:
    What changes will be required if your data is sporadic? Meaning sometimes it could be 5 hours without the report.

    And how do you add more timesteps into your model? Obviously you have to reshape it properly but you also have to calculate it properly.

    • Avatar
      Jason Brownlee August 16, 2017 at 6:41 am #

      You could fill in the missing data by imputing or ignore the gaps using masking.

      What do you mean by “add more timesteps”?

      • Avatar
        Slavenya August 16, 2017 at 7:00 pm #

        But what should I do if all data is stochastic time sequence?

        For example predicting time till the next event – when events frequency is stochastically distributed on the timeline.

  7. Avatar
    Jack Dan August 16, 2017 at 5:48 am #


    Thank you for an awesome post.
    (I was practicing on load forecast using MLP and SVR (You also suggested on a comment in your other LSTM tutorials). I also tried with LSTM and it did almost perform like SVR. However, in LSTM, I did not consider time lags because I have predicted future predictor variables that I was feeding as test set. I will try this method with time lags to cross validate the models)

  8. Avatar
    Adam August 16, 2017 at 1:03 pm #

    Hi Jason,

    Can I use ‘look back'(Using t-2 , t-1 steps data to predict t step air pollution) in this case?
    If it’s available,that my input data shape will be [samples , look back , features] isn’t it?

    • Avatar
      Jason Brownlee August 16, 2017 at 5:00 pm #

      You can Adam, see the series_to_supervised() function and its usage in the tutorial.

      • Avatar
        Adam August 18, 2017 at 6:07 pm #

        Hi Jason,

        If I used n_in=5 in series_to_supervised() function,in your tutorial the input shape will be [samples, 1 , features*5].Can I reshape it to [samples, 5 , features]?If I can, what is the difference between these two shape?

        • Avatar
          Jason Brownlee August 19, 2017 at 6:09 am #

          The second dimension is time steps (e.g. BPTT) and the third dimension are the features (e.g. observations at each time step). You can use features as time steps, but it would not really make sense and I expect performance to be poor.

          Here’s how to build a model multiple time steps for multiple features:

          And that’s it. I just tested and it looks good. The RMSE calculation will blow up, but you guys can fix that up I figure.

          • Avatar
            George Khoury August 19, 2017 at 11:55 pm #

            Jason, great post, very clear, and very useful!! I’m about 90% with you and think a few folks may be stuck on this final point if they try to implement multi-feature, multi-hour-lookback LSTM.

            Seems like by making adjustments above, I’m able to make a prediction, but the scaling inversion doesn’t want to cooperate. The reshape step now that we have multiple features and multiple timesteps has a mismatch in the shape, and even if I make the shape work, the concatenation and inversion still don’t work. Could you share what else you changed in this section to make it work? I’m not so concerned about the RMSE as much as that I can extract useful predictions. Thank you for any insight since you’ve been able to do it successfully.

            # make a prediction
            yhat = model.predict(test_X)
            test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
            # invert scaling for forecast
            inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
            inv_yhat = scaler.inverse_transform(inv_yhat)
            inv_yhat = inv_yhat[:,0]

          • Avatar
            Lg September 2, 2017 at 12:40 am #

            Hi Jason,

            Great and useful article.

            I am somewhat puzzled by the number of features you specify to forecast the pollution rate based on data from the previous 24 hours.

            Do not we have 8 features for each time-step and not 7?

            After generating data to supervise with the function series_to_supervised(scaled,24, 1), the resulting array has a shape of (43800, 200) which is 25 * 8.

            To invert the scaling for forecast I made few modifications. I used scaled.shape[1] below but in my opinion it could be n_features. Moreover, I don’t know if the values concatenated to yhat and test_y really matter, as long as they have been scaled with fit_transform and the array has the right shape.

            yhat = model.predict(test_X)
            test_X = test_X.reshape((test_X.shape[0], n_obs))

            # invert scaling for forecast
            inv_yhat = concatenate((yhat, test_X[:, 1:scaled.shape[1]]), axis=1)
            inv_yhat = scaler.inverse_transform(inv_yhat)
            inv_yhat = inv_yhat[:,0]

            # invert scaling for actual
            test_y = test_y.reshape((len(test_y), 1))
            inv_y = concatenate((test_y, test_X[:, 1:scaled.shape[1]]), axis=1)
            inv_y = scaler.inverse_transform(inv_y)
            inv_y = inv_y[:,0]

            The model has 4 layers with dropout.
            After 200 epochs I have got
            loss: 0.0169 – val_loss: 0.0162
            And a rmse = 29.173


          • Avatar
            Jason Brownlee September 2, 2017 at 6:13 am #

            We have 7 features because we drop one in section “2. Basic Data Preparation”.

          • Avatar
            lg September 2, 2017 at 5:59 pm #

            Hi Jason,

            It’s really weird to me :(, as I used your code to prepare the data (pollution.csv) and I have 9 fields in the resulting file.

            [date, pollution, dew, temp, press, wnd_dir, wnd_spd, snow, rain]


          • Avatar
            Jason Brownlee September 3, 2017 at 5:40 am #

            Date and wind direction are dropped during data preparation, perhaps you accidentally skipped a step or are reviewing a different file from the output file?

          • Avatar
            Lg September 3, 2017 at 6:22 pm #

            Hi Jason,

            So that’s fine, in my case I have 8 features.

            When reading the file, the field ‘date’ becomes the index of the dataframe and the field ‘wnd_dir’ is later label encoded, as you do above in “The complete example” lines 42-43.

            It is now much clearer for me. I am not puzzled anymore. 😉

            Thanks a lot for all the information contained in your articles and your e-books.

            They are really very informative.


          • Avatar
            Jason Brownlee September 4, 2017 at 4:26 am #

            I’m glad to hear that!

          • Avatar
            Cloud September 20, 2017 at 8:06 pm #

            Hi Jason,
            I think the output is column var1(t), that means:
            train_X, train_y = train[:, 0:n_obs], train[:, -(n_features+1)]
            am I right?
            In case the “pollution” is in the last column, it is easy to get train[:, -1]
            am i right?
            I just want to verify that I understand your post.
            Thank you, Jason

          • Avatar
            Hesam October 11, 2017 at 9:39 pm #

            I have some confusion for this problem.

            I want to use a bigger windows (I want to go back in time more, for example t-5 to include more data to make a prediction of the time t) and use all of this to predict one variable (such as just the pollution), like you did. I think predicting one variable will be more accurate than predicting many. Such as pollution and temperature.

            What should I do to apply more shift?

          • Avatar
            Jason Brownlee October 12, 2017 at 5:29 am #

            I show in another comment how to update the example to use lab obs as input.

            I will update the post and add an example to make it clearer.

          • Avatar
            Kentor October 19, 2017 at 10:01 pm #

            First of all, thanks for your work and the effort you put in!

            I tried to implement your suggestion for increasing the timesteps (BPTT). I have intergrated your code but I keep getting this error in when reshaping test_X in the prediction step:

            test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))
            ValueError: cannot reshape array of size 490532 into shape (35038,7)

            Do you have any tips on how to proceed?

          • Avatar
            Jason Brownlee October 20, 2017 at 5:34 am #

            I will update the post with a worked example. Adding to trello now…

          • Avatar
            Robert Dan November 23, 2017 at 10:29 pm #

            Hi Jason.
            In the code you wrote above, should the following code:

            train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))

            be actually

            train_X = train_X.reshape((train_X.shape[0]/n_hours, n_hours, n_features))

          • Avatar
            Jason Brownlee November 24, 2017 at 9:44 am #

            Why is that?

          • Avatar
            vivi March 7, 2020 at 2:10 pm #

            Hi,Janson.I am a new leaner. First, thank fou for your share! But, when I run the complete code, it has an error: pyplot.plot(history.history[‘val_loss’], label=’test’)
            KeyError: ‘val_loss’

            How can I sovle it!

          • Avatar
            Jason Brownlee March 8, 2020 at 6:03 am #

            Perhaps you did not use a validation dataset when fitting the model. In that case you cannot plot validation loss.

          • Avatar
            Anjana Rajakumar August 27, 2020 at 12:48 am #

            Hi Jason,
            Thank you for this excellent tutorial. I recently started working on LSTM methods. I have a doubt regarding this input shape. In case if the n_hour >1 , how to inverse transform the scaled values? Thanks in advance. Thanks in advance.

  9. Avatar
    Arun August 18, 2017 at 12:45 am #

    Hi Jason, I get the following error from line # 82 of your ‘Complete Example’ code.

    ValueError: Error when checking : expected lstm_1_input to have 3 dimensions, but got array with shape (34895, 8)

    I think LSTM() is looking for (sequences, timesteps, dimensions). In your code, line # 70, I believe 50 is timesteps while input_shape (1,8) represents the dimensions. May be it’s missing ‘sequences’ ?

    Appreciate your response.

    • Avatar
      Jason Brownlee August 18, 2017 at 6:25 am #

      Ensure that you first prepare the data (e.g. convert “raw.csv” to “pollution.csv”).

    • Avatar
      Sameer January 31, 2018 at 11:53 pm #

      I have the same error too. Cannot figure out what’s wrong

      • Avatar
        Timmy January 25, 2019 at 2:18 am #

        Something changed, the problem is on the model evaluation section, specifically the reshape line

        test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

        as it is, is 2 dimensions (34895, 8)

        we need to add one dimension but I can’t figure out how (noob here)

        tried this: test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

        but didn’t work (IndexError: tuple index out of range)

        any ideas anyone?

    • Avatar
      Edward October 26, 2018 at 2:42 am #

      Greetings Sir..

      I’ve run into the same problem as well. And I’m confident that I’m using “pollution.csv” data.. How can I rectify this?

  10. Avatar
    Neal Valiant August 18, 2017 at 2:35 am #

    Hi Jason, I am wondering what the issue that I’m getting is caused by, maybe a different type of dataset then the example one. basically when I run the history into the model, When i check the History.history.keys() I only get back ‘loss’ as my only key.

    • Avatar
      Jason Brownlee August 18, 2017 at 6:27 am #

      You must specify the metrics to collect when you compile the model.

      For example, in classification:

      • Avatar
        max ver April 15, 2019 at 4:40 am #

        Hi Jason,

        If you replace in this example the target by a binary target, let us say one that says if the var_1 goes up or not in the next move, thus : :

        reframed[‘target_diff’]=reframed[‘var1(t)_diff’].apply(lambda x : (x>0)*1)

        it gives this error :
        You are passing a target array of shape (8760, 1) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:

        I have :
        test_y.shape as (35038,)

        but if we follow another example from you with the PIMA dataset on a simple classification :

        which was :
        X = dataset[:,0:8]
        Y = dataset[:,8]
        model = Sequential()
        model.add(Dense(12, input_dim=8, activation=’relu’))
        model.add(Dense(8, activation=’relu’))
        model.add(Dense(1, activation=’sigmoid’))
        model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]), Y, epochs=150, batch_size=10)

        it gives no error whereas the Y have the same shape … why ?

        How can we make it work for the lstm classification please ?


  11. Avatar
    Aman Garg August 18, 2017 at 4:18 pm #

    Hello Jason,

    Thank you for such a nice tutorial.

    Since you have published a similar topic and few other related topics in one of your paid books (LSTM networks), should the reader also expect some different topics covered in it?

    I’m an ardent fan of your blogs since it covers most of the learning material and therefore, it makes me wonder that will be different in your book?

    • Avatar
      Jason Brownlee August 18, 2017 at 4:42 pm #

      Thanks Arman.

      The book does not cover time series, instead it focuses on teaching you how to implement a suite of different LSTM architectures, as well as prepare data for your problems.

      Some ideas were tested on the blog first, most are only in the book.

      You can see the full table of contents here:

      The book provides all the content in one place, code as well, more access to me, updates as I fix bugs and adapt to new APIs, and it is a great way to support my site so I can keep doing this.

  12. Avatar
    Songbin Xu August 18, 2017 at 6:54 pm #

    Thank you for accepting my opinions, such a pleasure!

    Running the codes u modified, still something puzzles me here,

    1. Have u drawn the waveforms of inv_y and inv_yhat in the same plot? I think they looks quite like persistence.

    2. Curiously, I computed the rmse between pollution(t) and pollution(t-1) in test_X, it’s 4.629, much lower than your final score 26.496, does it mean LSTM performs even worse than persistence?

    3. I’ve tried to remove var1 at t-1, t-2, … , and I’ve also tried to use lag values>1, and also assign different weights to the inputs at different timesteps, but none of them improved, they performed even worse.

    Do you have any other ideas to avoid the whole model to learn persistence?

    Looking forward to your advices 🙂

  13. Avatar
    Varuna Jayasiri August 19, 2017 at 2:51 pm #

    Why are you only training with a single timestep (or sequence length)? Shouldn’t you use more timesteps for better training/prediction? For instance in they use 40 (maxlen) timesteps

    • Avatar
      Jason Brownlee August 20, 2017 at 6:05 am #

      Yes, it is just an example to help you get started. I do recommend using multiple time steps in order to get the full BPTT.

      • Avatar
        Long.Ye August 23, 2017 at 11:06 am #

        Hi Jason and Varuna,

        When the timesteps = 1 as you mentioned, does it mean the value of t-1 time was used to predict the value of t time? Is moving window a method to use multiple time steps? Is there any other way? Has Keras any functions of moving window?

        Thank you very much.

        • Avatar
          Jason Brownlee August 23, 2017 at 4:23 pm #

          Keras treats the “time steps” of a sequence as the window, kind of. It is the closest match I can think of.

  14. Avatar
    lymlin August 20, 2017 at 4:28 pm #

    Hi Jason,
    I met some problem when learning your codes.

    dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)
    Traceback (most recent call last):
    File “”, line 1, in
    dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)
    NameError: name ‘parse’ is not defined

    • Avatar
      Jason Brownlee August 21, 2017 at 6:04 am #

      It looks like you have specified a function “parse” but not defined it.

  15. Avatar
    guntama August 21, 2017 at 11:30 am #

    Hi Jason,
    Can I use “keras.layers.normalization.BatchNormalization” as a substitute for “sklearn.preprocessing.MinMaxScaler”?

  16. Avatar
    Naveen Koneti August 21, 2017 at 10:56 pm #

    Hi Jason, Its a very Informative article. Thanks. I have a question regarding forecasting in time series. You have used the training data with all the columns while learning after variable transformations and the same has been done for the test data too. The test data along with all the variables were used during prediction. For instance, If I want to predict the pollution for a future date, Should I know the other inputs like dew, pressure, wind dir etc on a future date which I’m not aware off? Another question is, Suppose we have same data about multiple regions(let us consider that the pollution among these regions is not negligible), How can we model so that the input argument while prediction is the region name along with time to forecast just for that one region.

    • Avatar
      Jason Brownlee August 22, 2017 at 6:43 am #

      It depends on how you define your model.

      The model defined above uses the variables from the prior time step as inputs to predict the next pollution value.

      In your case, maybe you want to build a separate model per region, perhaps a model that improves performance by combining models across regions. You must experiment to see what works best for your data.

      • Avatar
        Naveen Koneti August 24, 2017 at 4:12 pm #

        Thanks! I missed the trick of converting the time-series to supervised learning problem. That alone is sufficient even for multiple regions I guess. We just have to submit the input parameters of the previous time stamp for the specific region during prediction. We may also try one-hot encoding on the region variable too during data preprocessing.

      • Avatar
        LY September 7, 2017 at 8:12 pm #

        Thank you for your excellent blog, Jason. I’ve really learnt a lot from your nice work recently. After this post, I’ve already known how to transform data into data that formates LSTM and how to construct a LSTM model.

        Like the question aksed by Naveen Koneti, I have the same puzzle.
        Recently I’ve worked on some clinical data. The data is not like the one we used in this demo. It is consist of hunderds of patients, each patient has several vital sign records. If it is about one individual’s records through many years, I can process the data as what you told us. I wonder how I can conquer this kind of data. Could you give me some advice, or tell me where I can find any solutions about it?
        If I didn’t state my question clearly and you’re interested it, pls let me know.
        Thanks in advance.

        PS. the data set in my situation is like this
        [ID date feature1 feature2 feautre3 ]
        [patient1 date1 value11 value12 value13 ]
        [patient1 date2 value21 value22 value23 ]
        [patient2 date1 value31 value32 value33 ]
        [patient2 date2……………………………………..]
        [patient3 ……………………………………………..]

    • Avatar
      Fabio Ferrari March 28, 2018 at 7:12 pm #

      Hi Naveen, I have the same your question: the model is defined such that if you know the input features at time t, then you can predict the target value at time t+1. If you want to predict the target variable at time t+2, though, you would need to know the input features at time t+1. If a feature does not change over time, it is no problem; but if a feature changes over time, then its value at time t+1 is not known and may be different from its value at time t.
      I am thinking that to solve this, you would need to define such features as output of the model as well as the target variable. In this way, at time t, you can predict the target variable for time t+1, but also the feature for time t+1, so that this predicted value can be used as input to predict the target variable for time t+2.

      What do you think about that? Did you think of a different solution?
      Many thanks

  17. Avatar
    Chris August 21, 2017 at 11:23 pm #

    again a nice post for the use of lstm’s!

    I had the following idea when reading.

    I would like to build a network, in which each feature has its own LSTM neuron/layer, so that the input is not fully connected.
    My idea is adding a lstm layer for each feature and merge it with the merge layer and feed these results to the output neurons.

    Is there a better way to do this? Or would you recommend to avoid this because the features are poorly abstracted? On the other hand, this might also be interesting.

    Thank you!

    • Avatar
      Jason Brownlee August 22, 2017 at 6:44 am #

      Try it and see if it can out-perform a model that learns all features together.

      Also, contrast to an MLP with a window – that often does better than LSTMs on autoregression problems.

  18. Avatar
    Tryfon August 22, 2017 at 5:20 am #

    Hi Jason,

    I have two questions:

    1) I have a question/ notice regarding the scaling of the Y variable (pollution). The way you implement the rescaling between [0-1] you consider the entire length of the array (all of the 43799 observations -after the dropna-).

    Is it rightto rescale it that way? By doing so we are incorporating information of the furture (test set) to the past (train set) because the scaler is “exposed” to both of them and therefore we introduce bias.

    If you agree with my point what could be a fix?

    2) Also the activation function of the output (Y variable) is sigmoid, that’s why we rescale it within the [0,1] range. Am I correct?

    Thanks for sharing the article!

    • Avatar
      Jason Brownlee August 22, 2017 at 6:49 am #

      No, ideally you would develop a scaling procedure on the training data and use it on test and when making predictions on new data.

      I tried to keep the tutorial simple by scaling all data together.

      The activation on the output layer is ‘linear’, the default. This must be the case because we are predicting a real-value.

      • Avatar
        Fati March 7, 2018 at 9:44 pm #


        First I wanna thanks for your helpful and practical blog.

        I tried to separate train and test set to do normalization on training but I have gotten error related to test set shape something like that “ValueError: cannot reshape array of size 136 into shape (34,2,4)”, which I don’t know how to fix it!
        Do you have an example on LSTM which run normalization on train and used in test, or do you explain that in your book?


      • Avatar
        Fati March 7, 2018 at 10:25 pm #


        I did some changes and just use transform method on test set, is that correct?
        firstly I divided my data-set to two different sets ,(train and test)
        secondly I ran fit_transform on train set and transform on test set

        But I get rmse=0 ? which seems weird. am I correct?

        • Avatar
          Jason Brownlee March 8, 2018 at 6:30 am #

          Sounds correct.

          An RMSE of zero suggests a bug or a very simple modeling problem.

  19. Avatar
    WCH August 22, 2017 at 5:25 pm #

    Thank you very much for your tutorial.

    I have one question,

    but I failed to read the NW in pollution. csv.(cbwd column)

    values = values.astype(‘float32’)
    ValueError: could not convert string to float: NW

    How do you fix it?

    • Avatar
      WCH August 22, 2017 at 5:30 pm #

      sorry, I saw the text above and solved it.

    • Avatar
      Juno Huang June 29, 2018 at 7:08 am #

      Hi, I would like to know how did you fix it? I still have that problem, tried to find the solution above but didn’t find one. Thank you !

      • Avatar
        Can Altas August 17, 2018 at 3:35 pm #

        You have to prepare the Data befor you convert (see “Basic Data Preparation”). In Jason’s complete Example of the LSTM this preparation step is missing (more likely left out).

        • Avatar
          Jason Brownlee August 18, 2018 at 5:33 am #

          Yes the note above the complete example says clearly:

          NOTE: This example assumes you have prepared the data correctly, e.g. converted the downloaded “raw.csv” to the prepared “pollution.csv“. See the first part of this tutorial.

  20. Avatar
    Dmitry August 22, 2017 at 5:58 pm #

    Hi Jason!
    I assume there is little mistake when you calculate RMSE on test data.
    You must write this code before calculate RMSE:

    inv_y = inv_y[:-1]
    inv_yhat = inv_yhat[1:]

    Thus, RMSE equals 10.6 (on the same data, in my case), that is much less than 26.5 in your case.

    • Avatar
      Jason Brownlee August 23, 2017 at 6:44 am #

      Sorry, I don’t understand your comment and snippet of code, can you spell out the bug you see?

      • Avatar
        Tommy November 12, 2017 at 2:50 pm #

        This beats further exploration

      • Avatar
        Azhar Khan December 22, 2017 at 11:42 pm #

        I agree with @Dmitry here. The prediction “inv_yhat” is one index ahead of real output “inv_y”.

        It can be seen by plotting predicted output v/s real output:
        pyplot.plot(inv_y[:-1,], color=’green’, marker=’o’, label = ‘Real Screening Count’)
        pyplot.plot(inv_yhat[1:,], color=’red’, marker=’o’, label = ‘Predicted Screening Count’)

        Compute RMSE by skipping first element of inv_yhat, and better RSME score is presented:
        rmse = sqrt(mean_squared_error(inv_y[:-1,], inv_yhat[1:,]))
        print(‘Test RMSE: %.3f’ % rmse)

        rmse = sqrt(mean_squared_error(inv_y, inv_yhat))
        print(‘Test RMSE: %.3f’ % rmse)

  21. Avatar
    jan August 22, 2017 at 11:01 pm #

    Hi Jason,

    great post! I was waiting for meteo problems to infiltrate the machinelearningmastery world.

    Could you write something about the changed scenareo where, given the weather conditions and pollution for some time, we can predict the pollution for another time or place with given weather conditions?

    For example: We have the weather conditions and pollution given for Beijing in 2016, and we have the weather conditions given for Chengde (city close to Bejing) also in 2016. Now we want to know how was the pollution in Chengde in 2016.

    Would be great to learn about that!

    • Avatar
      Jason Brownlee August 23, 2017 at 6:52 am #

      Great suggestion, I like it. An approach would be to train the model to generalize across geographical domains based only on weather conditions.

      I have tried not to use too many weather examples – I came from 6 years of work in severe weather, it’s too close to home 🙂

  22. Avatar
    Simone August 23, 2017 at 9:43 am #

    Hi Jason,
    I have read many of your posts about LSTM. I have not completely clear the difference between the parameters batch_size and time_steps. Batch_size means when the memory is reset (right?), but this shouldn’t have the same value of time_steps that, if I have understood correctly, means how often the system makes a prediction?

    • Avatar
      Jason Brownlee August 23, 2017 at 4:22 pm #

      Great question!

      Batch size is the number of samples (e.g. sequences) to that are used to estimate the gradient before the weights are updated. The internal state is reset at the end of each batch after the weights are updated.

      One sample is comprised of 1 or more time steps that are stepped over during backpropagation through time. Each time step may have one or more features (e.g. observations recorded at that time).

      Time steps and batch size and generally not related.

      You can split up a sequence to have one-time step per sequence. In that case you will not get the benefit of learning across time (e.g. bptt), but you can reset state at the end of the time steps for one sequence. This an odd config though and really only good to showing off the LSTMs memory capability.

      Does that help?

      • Avatar
        Simone August 24, 2017 at 6:26 am #

        Thanks, now it’s more clear!

  23. Avatar
    Pedro August 23, 2017 at 8:58 pm #

    Hi,I ger this error at this step, could you help me please?

    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    TypeError Traceback (most recent call last)
    in ()
    —-> 1 model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    C:\Anaconda3\lib\site-packages\keras\ in add(self, layer)
    431 # and create the node connecting the current layer
    432 # to the input layer we just created.
    –> 433 layer(x)
    435 if len(layer.inbound_nodes) != 1:

    C:\Anaconda3\lib\site-packages\keras\layers\ in __call__(self, inputs, initial_state, **kwargs)
    241 # modify the input spec to include the state.
    242 if initial_state is None:
    –> 243 return super(Recurrent, self).__call__(inputs, **kwargs)
    245 if not isinstance(initial_state, (list, tuple)):

    C:\Anaconda3\lib\site-packages\keras\engine\ in __call__(self, inputs, **kwargs)
    556 ‘‘)
    557 if len(input_shapes) == 1:
    –> 558[0])
    559 else:

    C:\Anaconda3\lib\site-packages\keras\layers\ in build(self, input_shape)
    1010 initializer=bias_initializer,
    1011 regularizer=self.bias_regularizer,
    -> 1012 constraint=self.bias_constraint)
    1013 else:
    1014 self.bias = None

    C:\Anaconda3\lib\site-packages\keras\legacy\ in wrapper(*args, **kwargs)
    86 warnings.warn(‘Update your ' + object_name +
    87 '
    call to the Keras 2 API: ‘ + signature, stacklevel=2)
    —> 88 return func(*args, **kwargs)
    89 wrapper._legacy_support_signature = inspect.getargspec(func)
    90 return wrapper

    C:\Anaconda3\lib\site-packages\keras\engine\ in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint)
    389 if dtype is None:
    390 dtype = K.floatx()
    –> 391 weight = K.variable(initializer(shape), dtype=dtype, name=name)
    392 if regularizer is not None:
    393 self.add_loss(regularizer(weight))

    C:\Anaconda3\lib\site-packages\keras\layers\ in bias_initializer(shape, *args, **kwargs)
    1002 self.bias_initializer((self.units,), *args, **kwargs),
    1003 initializers.Ones()((self.units,), *args, **kwargs),
    -> 1004 self.bias_initializer((self.units * 2,), *args, **kwargs),
    1005 ])
    1006 else:

    C:\Anaconda3\lib\site-packages\keras\backend\ in concatenate(tensors, axis)
    1679 return tf.sparse_concat(axis, tensors)
    1680 else:
    -> 1681 return tf.concat([to_dense(x) for x in tensors], axis)

    C:\Anaconda3\lib\site-packages\tensorflow\python\ops\ in concat(concat_dim, values, name)
    998 ops.convert_to_tensor(concat_dim,
    999 name=”concat_dim”,
    -> 1000 dtype=dtypes.int32).get_shape(
    1001 ).assert_is_compatible_with(tensor_shape.scalar())
    1002 return identity(values[0], name=scope)

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)
    668 if ret is None:
    –> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
    671 if ret is NotImplemented:

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in _constant_tensor_conversion_function(v, dtype, name, as_ref)
    174 as_ref=False):
    175 _ = as_ref
    –> 176 return constant(v, dtype=dtype, name=name)

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in constant(value, dtype, shape, name, verify_shape)
    163 tensor_value = attr_value_pb2.AttrValue()
    164 tensor_value.tensor.CopyFrom(
    –> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
    166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)
    167 const_tensor = g.create_op(

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in make_tensor_proto(values, dtype, shape, verify_shape)
    365 nparray = np.empty(shape, dtype=np_dt)
    366 else:
    –> 367 _AssertCompatible(values, dtype)
    368 nparray = np.array(values, dtype=np_dt)
    369 # check to them.

    C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ in _AssertCompatible(values, dtype)
    300 else:
    301 raise TypeError(“Expected %s, got %s of type ‘%s’ instead.” %
    –> 302 (, repr(mismatch), type(mismatch).__name__))

    TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.

  24. Avatar
    Neal Valiant August 24, 2017 at 2:49 am #

    Hi Jason,
    I was curious if you can point me in the right direction for converting data back to the actual values instead of scaled.

    • Avatar
      Jason Brownlee August 24, 2017 at 6:48 am #

      Yes, you can invert the scaling.

      This tutorial demonstrates how to do that Neal.

      • Avatar
        Neal Valiant August 25, 2017 at 7:34 am #

        Hi Jason, I did have an issue converting back to actual values, but was able to get past it using the drop columns on the reframed data which got me past it.

        When looking at my predicted values vs actual values, I’m noticing that my first column has a prediction and a true value, but for every other variable, I only see what I can assume is a prediction? does this make a prediction on every column, or just one particular one.

        Im sorry for asking a question such as this, I just think I’m confusing myself looking at my results.

        • Avatar
          Jason Brownlee August 25, 2017 at 3:56 pm #

          The code in the tutorial only predicts pollution.

  25. Avatar
    Jack Dan August 24, 2017 at 3:24 am #

    Dr. Jason,
    I have been trying with my own dataset and I am getting an error “ValueError: operands could not be broadcast together with shapes (168,39) (41,) (168,39)” when I try to do inv_yhat = scaler.inverse_transform(inv_yhat) as you have in line 86 in your script. I still can not figure out where my issue is. I have yhat.shape as (168,1) and test_X.shape as (168,38). When I do this, inv_yhat = np.concatenate((yhat, test_X[:, 1:]), axis=1), my inv_yhat.shape is (168,39). I still can not figure why inverse_transform gives that error.

    • Avatar
      Jason Brownlee August 24, 2017 at 6:50 am #

      The shape of the data must be the same when inverting the scale as when it was originally scaled.

      This means, if you scaled with the entire test dataset (all columns), then you need to tack the yhat onto the test dataset for the inverse. We jump through these exact hoops at the end of the example when calculating RMSE.

      • Avatar
        Jay Regalia August 24, 2017 at 7:29 am #

        This seems to be the same issue I am having at the moment also. i concatenate my inv_yhat with my test_X like you said, but the shape of inv_yhat after is still not taking into account the 2nd numbers(in posts case (41,).

        • Avatar
          Jack Dan August 26, 2017 at 6:00 am #

          Ask a question in stackoverflow and post the link, I should be able to help. I spent lots of time on this and have a decent idea now.

      • Avatar
        Jack Dan August 24, 2017 at 7:39 am #

        Yes, you’re right! I did that and it worked, nice! Thank you for your comment!

      • Avatar
        John Regilina August 24, 2017 at 8:38 am #

        I am having the same problem, but cannot solve the issue. everytime i try to concatenante them together, there is not change to my inv_yhat variable. i still am unable to understand this issue if you can expand a bit more that would be amazing

        • Avatar
          Jack Dan August 26, 2017 at 6:08 am #

          @John Regilina,
          Check the shape of data after you scale the data and then check the scale again after you do the concatenation. Remember, when your yhat shape will be (rowlength,1) and after concatenation inv_yhat should be the same shape after you scaled the data. Look at Dr.Jason’s answer to my comment/question. Hope that will help. (Thanks to Dr.Jason saved a lot of my time)

      • Avatar
        Sabyasachi Purkayastha May 18, 2018 at 10:48 pm #

        Hello Sir, thank you for the awesome tutorial. But I still couldn’t understand what exactly needs to be done. I am getting the error:
        > operands could not be broadcast together with shapes (12852,27) (14,) (12852,27) ”
        This the line which generates the error:
        inv_yhat = scaler.inverse_transform(inv_yhat).fit()
        Could you please give me a small example to understand what went wrong. Thanks in advance Sir.

    • Avatar
      Shan September 19, 2017 at 1:59 pm #

      I am also stuck with same thing. How did you fix it?

      • Avatar
        anna March 26, 2018 at 11:33 pm #

        Same question here, how did everyone fix this? From your answers I cannot deduce what exactly went wrong in your case, and what you did to solve it.

    • Avatar
      Machiraju Yashwanth May 10, 2021 at 5:55 am #

      I am suffering from the same problem when i am trying it on my dataset having np.shape(test_X) as (89070,13) size. Kindly kindly help me out if you have got the solution.

  26. Avatar
    Lizzie August 24, 2017 at 4:23 am #

    Hi Jason, In dataset.drop(‘No’, axis =1, inplace = True), what is the purpose of ‘axis’ and ‘inplace’?

    • Avatar
      Jason Brownlee August 24, 2017 at 6:50 am #

      Great question.

      We specify to remove the column with axis=1 and to do it on the array in memory with inplace rather than return a copy of the array with the column removed.

  27. Avatar
    Lizzie August 24, 2017 at 4:44 am #

    Fabulous tutorials Jason!

  28. Avatar
    Jaskaran August 24, 2017 at 5:19 am #

    Can you show how the multi variate forecast looks like?
    Looks like you missed it in the article.

    • Avatar
      Jason Brownlee August 24, 2017 at 6:56 am #


      You can plot all predictions as follows:

      You get:

      It’s a mess, you can plot the last 100 time steps as follows:

      You get:

      The predictions look like persistence.

      • Avatar
        BEN BECKER August 29, 2017 at 1:33 pm #

        Jason, what am I missing, looking at your plot of the most recent 100 time steps, it looks like the predicted value is always 1 time period after the actual? If on step 90 the actual is 17, but the predicted value shows 17 for step 91, we are one time period off, that is if we shifted the predicted values back a day, it would overlap with the actual which doesn’t really buy us much since the next hour prediction seems to really align with the prior actual. Am I missing something looking at this chart?

        • Avatar
          Jason Brownlee August 29, 2017 at 5:16 pm #

          This is what a persistence forecast looks like, that value(t) = value(t-1).

          • Avatar
            BECKER August 29, 2017 at 9:22 pm #

            So how would you get the true predicted value(t)? I am thinking of the last record in the time series where we are trying to predict the value for the next hour.

          • Avatar
            Jason Brownlee August 30, 2017 at 6:15 am #

            Sorry, I don’t follow. Perhaps you can restate your question?

          • Avatar
            Anna October 2, 2017 at 4:38 pm #

            Hello Jason Brownlee

            Thank you for your great posts. I run the model above for my data and it works perfectly, how ever when I draw the real data (blue one – inv_y) and the prediction (the orange one – inv_yhat), the result shows the prediction is delay after 1 step. it should be predicted one step before as your graph. your model is the same with the matlab tool:

            And after running the model, I applyed realtime this model for my problem to compute the inv_yhat in every step. I got the result is really bad, since I have never had the real inv_y. I took the prediction to feed the input ( instead of real data inv_y)

            My problem is: I received some signals as inputs, then I labeled offline to have output (real data inv_y or the first column in train_X)

            Do you have the model that trains without the real data in the first column?????? thank you

          • Avatar
            Jason Brownlee October 3, 2017 at 5:40 am #

            Your model may have low skill and be simply predicting the input as the output (e.g. persistence).

            You may need to continue to develop your model, I list some ideas for lifting model skill here:

        • Avatar
          Li Yue March 20, 2018 at 6:46 pm #

          hi, i have the same confusion as you. i think the prediction problem should be value_predict(t-1) = value_real(t). the label “train_y” indicates value_real(t+1). we input the train_x(t) into the model to get the prediction and the prediction should match “train_y” , not one step after “train_y”. did you solve this problem?

      • Avatar
        Tyler Byers October 26, 2017 at 3:40 am #

        It’s definitely similar to a persistence model since we trained the model using the var1(t-1) feature (i.e. the lagged pollution feature). The model certainly found that to be the strongest predictor. This would be ok if we were doing predictions later on an hour-by-hour basis. But, if, say we want to predict the pollution 20 hours from now, we aren’t yet going to know what the hour-19 pollution is. So it seems like cheating to include this variable in the training and prediction sets.

        I removed this variable to train the model, leaving other parameters about the same, and was then only able to get a minimum validation loss of 0.55 and test RMSE of 87.02

      • Avatar
        xeo December 26, 2017 at 4:00 am #

        It looks the prediction is pretty good. Can we say the lstm model is good?

      • Avatar
        Fiona January 27, 2019 at 10:51 pm #

        Hi, Jason.I have a question on the transform, which is I found the predicted data after inverse_transform() were not same as the original value. For example, my original data is at the range from 0 to 850, but the prediction data is at 0 to 8. Is there any problem?

      • Avatar
        Jay October 23, 2019 at 11:17 am #

        Hi Jason

        I have two questions:

        (a) based on the graphs that you have shown for the y_inv and yhat_inv, it looks like your model has overfit on the test set. Don’t you agree ?

        (b) In all time series prediction posts I have seen, the validation part uses the tail of the data to do validation (predict(yhat)). How can we modify the code in order to predict the future which is not covered in the dataset.

        • Avatar
          Jason Brownlee October 23, 2019 at 1:50 pm #

          The model in this tutorial is probably underfit – e.g. it learned a persistence model.

          Fit the data on all available data then call model.predict() to predict out of sample.

  29. Avatar
    gammarayburst August 24, 2017 at 11:32 pm #

    Wind dir is label encoded not wind speed!!!

  30. Avatar
    Filipe August 27, 2017 at 4:16 am #

    First of all, thanks. All of this material on the blog is super interesting, and helpful and making me learn a lot.

    Of course… I have a question.

    I’m surprised by the use of LSTMs here. The property of them being “stateful” I guess is being used. But is there “sequence” information flowing?

    So when I used LSTMs in Keras for text classification tasks (sentence, outcome), each “sentence” is a sequence. Each observation is a sequence. It’s an ordered array of the words in the sentence (and it’s outcome).
    In this example, I could not see a sense in which var1(t-1) is linked to var1(t-2). Aren’t they being treated as independent Xs in a regression problem? (predicting var8(t))

  31. Avatar
    STYLIANOS IORDANIS August 27, 2017 at 5:23 am #

    Awesome article, as always.
    Btw, what is your view on using an autoencoder/ restricted Boltzmann layer compressing features/ features before feeding an LSTM network ? For example, if one has a financial timeseries to forecast, e.g. a classifier trying to predict increase or decrease in a look ahead time window, via numerous technical indicators and/or other candidate exogenous leading indicators…..
    Could you write an article based on that idea?

    • Avatar
      Jason Brownlee August 27, 2017 at 5:53 am #

      I have seen better results from large MLPs, nevertheless, try it and see how you go.

      • Avatar
        STYLIANOS IORDANIS August 27, 2017 at 7:25 am #

        autoencoder/ restricted Boltzmann layers also deal with multicollinearity issues… do MLPs also deal with multicollinearity if you have multicollinearity in the features, right?

        • Avatar
          Jason Brownlee August 28, 2017 at 6:46 am #

          MLPs are more robust to multicollinearity than linear models.

  32. Avatar
    Hee Un August 29, 2017 at 12:28 am #

    Hi, I am always amazed at your article. Thank you.
    I have a question.
    Is this LSTM code now weighted for each features?
    Nowdays, I’m predicting precipitation, that is the trend is correct, but the amount is not right.
    What’s wrong with that?:(

    • Avatar
      Jason Brownlee August 29, 2017 at 5:06 pm #


      Sorry, I’m not sure I understand the question, perhaps you could rephrase it?

      I can say that I would expect better skill if the data was further prepared – e.g. made stationary.

  33. Avatar
    Vipul August 30, 2017 at 7:53 pm #

    Hi Jason,

    Thanks for wonderful explanation!
    Could you please help me to understand dimensionality reduction concept. Should PCA or statistical approach be used before feeding the data to LSTM OR LSTM will learn correlation with the inputs provided on its own? how to approach regression problem in LSTM when we have large set of features?

    Your reply is greatly appreciated!

    • Avatar
      Jason Brownlee August 31, 2017 at 6:18 am #

      Generally, if you make the problem simpler using data preparation, the LSTM or any model will perform better.

  34. Avatar
    Nader August 31, 2017 at 2:42 am #

    How can I predict a single input ?
    for example :

    [0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001]

    how do i reshape and do a model.predict () ?

    Thank you

      • Avatar
        Nader August 31, 2017 at 12:48 pm #

        Thank you, Jason.
        I applied:

        my_x = np.array([0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001])
        print(my_x.shape) # (8,)
        my_x = my_x.reshape((1, 1, 8))
        my_pred = model.predict(my_x)

        The answer is the “scaled” answer which is 0.03436

        I tried applying the scaler.inverse_transform(my_pred) to GET the actual number

        But I get the following error:

        on-broadcastable output operand with shape (1,1) doesn’t match the broadcast shape (1,8)

        Thank you

        • Avatar
          Jason Brownlee September 1, 2017 at 6:40 am #

          Yes, the transform requires data in the same form as when you “fit” it.

          • Avatar
            David September 23, 2017 at 3:27 pm #

            Then what if I use multi-time step prediction? (use several lags for prediction)
            The y_hat and X_test can not have the same dimension.

          • Avatar
            Jason Brownlee September 24, 2017 at 5:13 am #

            If the size of X or y must vary, you can use padding.

  35. Avatar
    Fejwin August 31, 2017 at 3:52 am #

    Hi Jason,
    Thanks for the tutorial!
    Maybe I missed something, but it seems that you provided the model with all of remaining data as ‘testdata’ and then tried predicting it? Isn’t that kind of pointless, since we should be interested in predicting unknown data in the future, instead of data that the model has already seen? Wouldn’t it make more sense to try the model to predict a first timestep into the future that neither the training nor the test data knew anything about? (Perhaps only give the model training data, but no test data, and afterwards ask it to predict first time step after training data?) How would I have to change the code to achieve that?

    • Avatar
      Jason Brownlee August 31, 2017 at 6:25 am #

      The model is fit on the training data, then makes a prediction for each step in the test data. The model did not “know” the answer to the test data prior to making each prediction.

      Normally we would use walk-forward validation:

      I did use walk forward validation on other LSTM examples (use the blog search) but it confuses readers more than helps it seems.

      • Avatar
        Guillermo November 8, 2017 at 9:19 pm #

        Hi Jason.

        I am digging into your example and maybe missing something because I agree with Fejwin.

        I mean, as long as real Pollution in t-1 is introduced in the test_X set, instead of predicted Pollution in t-1, when you run model.predict(test_X) each output is not considered for future prediction.

        This is with all the features, including real Pollution(t-1) the model predicts an output: predicted Pollution(t). But on the next step, when the model predicts Pollution(t+1) it doesn´t take predicted Pollution(t), it takes real Pollution(t) instead.

        Can you clarify this point please?

        Thank you.

    • Avatar
      David September 24, 2017 at 1:01 pm #

      Can I use part of trainX to predict testY ? (lags needed to predict testY is in trainX) Not sure if it is a logical way to do it.

  36. Avatar
    hadi September 1, 2017 at 12:08 pm #

    Dear Jason Brownlee,

    I have a little different question, Actually I have a sequence of characters as input and I want to project it into a multidimensional space.
    I mean I want to project each sequence of chars (let say word) to an vector of 100 real numbers along my corpus, so my input is a sequence of chars (any char-emedding is welcome) and my output is a vector for each sequence (which is a word ) and Im really confused how to define the model,
    I would appreciate if you give any clue help or sample code to define my model.

    Thanks a lot in advance.

    • Avatar
      Jason Brownlee September 1, 2017 at 3:26 pm #

      Keras provides an Embedding layer that you can use directly:

    • Avatar
      Balint Takacs May 1, 2020 at 1:09 am #

      I am also having trouble understanding the difference between the walk-forward validation (prediction) method, and the “simple” prediction method being carried out here in the example.

      Why does the walk-forward prediction (with an appended history) give different predictions than the simply calling predict on the test set, if the model is not re-fitted (that is including the new available observations, and training again) ?
      Has the cumbersome walk-forward any advantage over this approach here in the example?
      Can the walk-forward be carried out also for multivariate-multistep forecasting ?


      • Avatar
        Jason Brownlee May 1, 2020 at 6:41 am #

        Walk-forward validation simulates how we expect to use the model in practice, it evaluates the model under those conditions.

        The procedure can be adapted based on how you want to use the model, e.g. when to refit, when new obs are available, how many steps to predict, etc.

        You can learn more about walk-forward validation here:

        • Avatar
          Balint Takacs May 1, 2020 at 9:41 pm #

          Hey, thanks for the quick answer.

          So as far as I see your point, the walk forward approach, without refitting the model at each iteration, is the same as calling model.predict(X_test) at once.
          And the reason why you still implement it without refitting, is to provide the framework properly, and make it easier for us to work further with it, right ?

          If I am wrong, and it is not the same, why is it not the same? I went through many of your posts, including the one you posted, but I didnt manage to comprehend the difference, if there is any, so far.

          For example:

          Here you explain the updating, which awesome, but at the baseline part, where you do not apply updating (so no iterative re-fit), you still do iterative walk-forward predicting instead of calling model.predict() on the test set as whole. Would that be the same in the no update case?
          Sorry for being annoying. I really appreciate your help, and time.

          Many thanks

  37. Avatar
    Sai k September 2, 2017 at 12:12 am #

    Hi Jason,

    Thanks for the wonderful tutorial!
    Could you please explain how to deal the problem when situation is “Predict the pollution for the complete month (assume month has 30 days. t+1…t+30) and given the “expected” weather features for that month…assuming we have been provided historic data of pollution and weather data on daily basis”

    How should the data be prepared and how it should be feed into LSTM?

    As I new to LSTM model, I have problem understanding the data preparation and feeding to LSTM.

    Thanks in advance for your response

  38. Avatar
    Adrian September 5, 2017 at 5:29 am #

    Hi Jason,

    Thanks for sharing. I added accuracy info to model while training using ‘ metrics=[‘accuracy’] ‘.

    So model.compile(loss=’mae’, optimizer=’adam’) becomes :

    model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])

    This adds acc & val_acc to output. After 100 epochs the acc value appears quite low : (0.0761) :
    Epoch 100/100
    1s – loss: 0.0143 – acc: 0.0761 – val_loss: 0.0132 – val_acc: 0.0393

    The accuracy of the model appears very low ? Is this expected ?

    Further info on acc & val_acc values : “acc is the accuracy of a batch of training data and val_acc is the accuracy of a batch of testing data.”

    • Avatar
      Jason Brownlee September 7, 2017 at 12:38 pm #

      This is a regression problem. Accuracy does not make sense.

  39. Avatar
    Eric H September 5, 2017 at 6:33 am #

    Hi Jason, I’ve recently discovered your site and have been so pleased with your information – thank you. I’ve been trying to model data which is much like the air quality data described here, but every few time steps there will be a change in the number of features present.
    Example: in my data a time step = 1 day and a sequence can be 800 – 1200 days long. Normally the data consists of features
    – pm2.5: PM2.5 concentration
    – DEWP: Dew Point
    – TEMP: Temperature
    – PRES: Pressure
    – cbwd: Combined wind direction
    – Iws: Cumulated wind speed
    – Is: Cumulated hours of snow
    – Ir: Cumulated hours of rain

    But then every (random-ish amount of time) there will be an additional number of features for a day and then back to the baseline number of features.

    I’ve no idea on how to handle variable feature length. I’ve seen and played with plenty of variable sequence length examples, but I have both variable sequenceS and features. I’d love your input!

    • Avatar
      Jason Brownlee September 7, 2017 at 12:40 pm #

      You will need to normalize the number of features to be consistent for all time.

      • Avatar
        Eric Hiller September 10, 2017 at 5:21 am #

        Is it possible to use (what in TensorFlow – land is called) SparseFeatures or SparseTensors to represent sparse datasets, or is there a fundamental issue with handling sparse datasets within RNNs?

        • Avatar
          Jason Brownlee September 11, 2017 at 12:04 pm #

          Good question, I’m not sure off the cuff. Keras may support sparse numpy arrays – try it and see?

  40. Avatar
    Ali Haidar September 8, 2017 at 1:56 am #

    Hi Jason,

    Thanks for the amazing articles. They are really helpful.

    Lets say I want to forecast with lead 2. I mean by that forecasting values at time t using t-2 values, without using t-1 elements. I have to remove columns from reframed after running function series_to_supervised right ? To remove all columns with values t-1?


  41. Avatar
    Inna September 11, 2017 at 7:53 pm #

    Thanks for articles.

    I have a question related with time series. Is it possible to forecast all variables? For example, I have ‘pollution’, ‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’ and want to predict all of them for the next hour. We know about trends and common rules (because of data amount: few years), so we can do forecasting. Where can I find more info about it?

    • Avatar
      Jason Brownlee September 13, 2017 at 12:22 pm #

      Yes, this example can be modified to predict each variable.

  42. Avatar
    appreciator September 12, 2017 at 10:59 am #

    Thank you Jason for the great tutorial! I’m adapting it for different data, and i’m trying to use >1 time step. However I noticed something strange in the series-to-supervised: Since the first loops ends at 0 and the last loops starts at 0, won’t there be two columns that are the same?

  43. Avatar
    Eric September 12, 2017 at 11:49 am #

    Hi Jason,

    Thanks for the tutorial. I had just one question though.
    I’ve seen tutorial using multivariate time series to train a lot of dataset (all have correlation between each other) at the same time and were able to predict for each dataset used.

    For sake of argument let’s say than one of the dataset is broke, the sensor that get the information to feed it is out of service (let’s say at some point one of the column of data only have 0 instead of whatever value). Do you think that we could use the other spot to continue to predict the broken one? (there is correlation between them and there would be a lot of non broken data from before the bug)

    Best regards,

    • Avatar
      Jason Brownlee September 13, 2017 at 12:27 pm #

      Yes, you could try it and see. Or impute the missing data and see if that is better.

      • Avatar
        Eric September 14, 2017 at 2:22 pm #

        Thank you Jason,

        I shall try that as soon as possible.I guess that the overall accuracy will lower for every set prediction (since my goal is to use multivariate, feed it every spot data set and predict each of them (with possibility to predict a broken one)) so one spot being fed “wrong” data should lower each spot accuracy no?

        Best regards,

  44. Avatar
    Shan September 13, 2017 at 3:46 am #

    Is there any time parser like date parser? I am working with data which is in milliseconds.

    • Avatar
      Jason Brownlee September 13, 2017 at 12:33 pm #

      It can handle parsing dates and times I believe.

  45. Avatar
    kumar September 13, 2017 at 10:00 pm #

    i got this error when i tried to run the program

    pyplot.plot(history.history[‘val_loss’], label=’test’)
    KeyError: ‘val_loss’

  46. Avatar
    Simon September 15, 2017 at 9:55 pm #

    Hi Jason,

    Wouldn’t it be better to scale the data after you run the series_to_supervised function? As it stands now, the inverse scaling doesn’t work if n_in > 1 since the dimensions don’t line up anymore.

    • Avatar
      Jason Brownlee September 16, 2017 at 8:41 am #

      It would, but the scaling would be column-wise and incorrect.

      • Avatar
        Simon September 17, 2017 at 11:26 am #

        Could you expand more on this and how the code might be modified to incorporate multi-step? I’m also playing around with turning this into a classification problem, would it still work if the feature we are trying to predict is a classifier?

        • Avatar
          Jason Brownlee September 18, 2017 at 5:42 am #

          I give the code to do this in another comment.

          For classification, you will need to change the number of neurons in the output layer, the activation function in the output layer and the loss function.

  47. Avatar
    Agrippa Sulla September 16, 2017 at 5:18 am #

    I have a little question. I’ve successfully built my own LSTM multivariate NN using your code as a basis (thanks!). It forecasts export growth for the UK using past export growth and GDP. It perform decently but the financial crisis kinda messes things up.

    Now I want to add data to this model, but I can’t go further back than 1980 for the time-series (not for now at least). So what I want to do is add the GDP growth rate of all the UK’s major trading partners. Should I be worried about adding another 20 input neurons (e.g. countries)? Do you have a post talking about the risks of using data that is low in rows (e.g. years) but high in columns (e.g. inputs).

    I hope my question makes sense.


    • Avatar
      Jason Brownlee September 16, 2017 at 8:46 am #

      I don’t have posts on the topic of more columns than rows. It does require careful handling.

      As a start, I would recommend developing a strong test harness, then try adding data and see how it impacts the model skill. Experiment.

  48. Avatar
    Ed September 16, 2017 at 6:00 am #

    Thanks a lot for your tutorial!
    Is there a feature importance plot for cases like this?
    sometimes is very important to know it

    • Avatar
      Jason Brownlee September 16, 2017 at 8:47 am #

      Good question. I’m not sure about feature importance plots for LSTMs. I would expect that if feature importance can be calculated for MLPs, then it could be calculated for LSTMs, but this is not something I have looked into sorry.

  49. Avatar
    Kuldeep September 20, 2017 at 12:53 am #

    Hi Jason,

    Great post as always!

    I have a question regarding scaling. My problem is quite different as I have to apply series to supervised function first on the data coming from different source and then combine the data… my question is, can I apply scaling at the end? Should scaling be applied column wise or on complete matrix/array?

    • Avatar
      Jason Brownlee September 20, 2017 at 5:58 am #

      The key is being able to scale the data consistently. The place in the pipeline is less important.

  50. Avatar
    Nejra September 21, 2017 at 1:25 am #

    Hi Jason thank you very much for your tutorials!
    I’m trying to develop an LSTM for time prediction having as input 3 features (2 measurements and a third one is a sort of control of the system) and the output (value to predict) is not a single value but a vector of 6 values. So, at every time step my network should be able to predict this entire vector. Two questions:
    1. Since my inputs are not correlated between them, their order in the input array will not influence my predictions?
    2. How can I shape my output in order to estimate all the 6 values of the vector for each time step?
    Thanks for any kind of help!

  51. Avatar
    Mitchel Myers September 22, 2017 at 5:34 am #

    I replicated the example described on this page, and saved my test_y and yhat vectors to csv so that I could manually check how my prediction compared with the true values. However, when I did this, I discovered that every yhat value in my array is the exact same value (~34). I was expecting a unique yhat value for each input vector. Do you have any suggestions to help fix this?

  52. Avatar
    Mitchel Myers September 23, 2017 at 3:25 am #

    Follow up on this — when this error arose, I was using my own data set that I want to perform time series forecasting on. When I duplicated the guide exactly as described above, the issue goes away. Do you have any idea why this issue comes up (where every predicted yhat value is the exact same) when I use a different data set?

    • Avatar
      Jason Brownlee September 23, 2017 at 5:44 am #

      Perhaps the model needs to be tuned to your specific dataset?

  53. Avatar
    zwj September 25, 2017 at 1:10 pm #

    Hi Jason thank you very much for your tutorials! I try to delete the columns [‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’] from the train_X data, and I also get the almost same test RMSE. It is 26.461. It seems to show that the 8 weather conditions have no affect on the prediction result. The code is below.

    # fit an LSTM network to training data
    def fit_lstm(train, test, batch_size, neurons):
    # split into input and outputs
    train_X, train_y = train[:, 0:1], train[:, -1]
    test_X, test_y = test [:, 0:1], test [:, -1]

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(neurons, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss=’mae’, optimizer=’adam’)

    # fit network
    history =, train_y, epochs=50, batch_size=batch_size, validation_data=(test_X, test_y), verbose=2, shuffle=False)
    #history =, train_y, epochs=50, batch_size=72, verbose=2, shuffle=False)

    return model

    # make a prediction
    def make_forecasts(model, test_X):
    test_X = test_X[:, 0:1]
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    forecasts = model.predict(test_X)

    return forecasts

  54. Avatar
    Mitchel September 27, 2017 at 1:39 am #

    Can you explain why the train_X and test_X data sets are reshaped to this?

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

  55. Avatar
    Lino September 28, 2017 at 12:59 pm #

    Hi Jason

    Great post.
    Suppose i want to predict the next 24h using previous one year dataset. How can we do it?

  56. Avatar
    Nels September 29, 2017 at 5:56 am #

    I think I’m missing something fundamental in my understanding of LSTM/s and BPTT. I’ve read through many of your posts and have come to understand RNN’s and LSTM in particular much better because of them, so thank you for that!

    My question that I hope you can shed some light on is what is the difference between passing the past information, i.e. var(t-n)…var(t-1) in the input vector for a single sample, and passing multiple sequences, of length n as a single sample?

    To help clarify, using temsteps of length N, I have a configuration that looks like this:

    Input to LSTM is [samples, timesteps, features].
    Each sample/observation consists of a vector of timestamps (of size N+1) where each of these vector’s values corresponds to the input feature’s values I.e.

    Observations for each time t, with features f and r
    time t
    [ f(t-N) r(t-N) ]
    [ f(t-N+1) r(t-N+1) ]
    [ f(t-N+2) r(t-N+2) ]
    . .
    . .
    . .
    [ f(t) r(t) ]
    And for each observation/sequence the target is Y(t).

    Or, as many of your examples do, you can include the the past information in the form of a windowed input, with a single time step, so something like:

    Input is [samples, 1, features]. So for every observation, we include previous time values as features

    Observations for each time t, with features f and r
    time t
    [ f(t-N), r(t-N), f(t-N+1), r(t-N+1), f(t-N+2), r(t-N+2), f(t), r(t) ]
    And again, for each observation, the target is Y(t).

    I understand that having sequences longer than 1 allows BPTT to work over the length of those sequences, but I don’t think I really understand the difference in these two methods.

    I have tried the described two options, and I find the the latter is performing better based on preliminary tests. I can use a window size of 3 and a sequence length of 1 and get good results, but if I use the first approach and a window size of 12, the model actually fails to learn within the same amount of time.

    Hence, I wonder if I don’t have a fundamental misconception. If you have some time, I would like to hear your explanation on this difference and how the LSTM responds in terms of “memory” based on these two different types of input setup. (I have read a lot of articles, blogs, git hub issues, and stack overflow posts trying to wrap my head around this, but I haven’t found anything that address this directly.)


  57. Avatar
    Paul September 29, 2017 at 12:28 pm #

    With this line…

    # drop columns we don’t want to predict
    reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

    I don’t understand the numbers used here, doesn’t the data not even have that many columns? There are 8 feature columns and 1 index column.

    I’m adapting this code for my own use and have very different features but I’m not sure I’m getting that line adapted right.

    Thanks for the great post!

    • Avatar
      Paul September 29, 2017 at 1:29 pm #

      Nevermind! I figured it out.

    • Avatar
      Jason Brownlee September 30, 2017 at 7:33 am #

      It does have that many columns after we reshape it to be a supervised learning problem.

  58. Avatar
    Wenhan Wang September 30, 2017 at 2:05 pm #

    This is awesome!
    Helping me a lot in my real work!

  59. Avatar
    Vilmara Sanchez October 4, 2017 at 3:54 pm #

    Hi Dr. Jason, I am working on a project for sleep stage classification where the number of timesteps (observations) in the input series (ECG signal) is different than the number of timesteps in the output series (sleep stage scores).

    The issue here is that the input and output time series are not equal in terms of timesteps as the examples you have shown in your problems.

    I have tried to frame the problem in different ways without getting results that make sense. Could you please provide guidance on how to approach this problem?.



  60. Avatar
    Devakar Verma October 6, 2017 at 6:06 pm #

    Hi Jason,
    If we want to predict multiple features as output and having multiple feature as input. How can we solve this problem. For example input variables are temperature and humidity and want to predict both temperature and humidity, can we solve this with single LSTM model.

    Thanks for your anticipated response.

    • Avatar
      Jason Brownlee October 7, 2017 at 5:50 am #

      Yes you can. Change the multivariate input model to output more than one value in the output layer.

  61. Avatar
    Brent October 7, 2017 at 5:55 am #

    Hi Jason,

    Thank you for taking the time to write such an excellent post and follow up with questions. The mechanics of the data conversion & training work great.

    However, my first reaction is that the LSTM doesn’t seem to have learned anything more than to copy the previous value. As BECKER states:

    > it looks like the predicted value is always 1 time period after the actual?

    These are the same results as in your Shampoo example: the predicted value appears to be equal to the previous value (possibly with some constant offset).

    Have you found a different network architecture that performs better than a DNN without LSTM layers?

  62. Avatar
    sathvik October 9, 2017 at 1:34 pm #

    Thank you so much Jason for the wonderful article, learnt a lot… I wanted to have a comparison shown on multivariate statistical methods and neural networks and I was looking for some post/article on multivariate time series model using ARIMA. I would be glad to know if anything you know of the same.

    Thank you

    • Avatar
      Jason Brownlee October 9, 2017 at 4:46 pm #

      You will need to look into using SARIMAX, sorry I do not have an example at this stage.

  63. Avatar
    Shan October 12, 2017 at 4:34 am #

    Hi Jason, is there any library available to perform feature extraction/ dimensionlity reduction for sequential LSTM model?

    • Avatar
      Jason Brownlee October 12, 2017 at 5:37 am #

      Often an embedding layer is used to project observations at each time step prior to feeding them into the LSTM.

  64. Avatar
    Terry October 12, 2017 at 6:15 pm #

    How does multivariate LSTM compare to Multivariate ARIMAX? Are there use cases where one model outperforms the other?

    • Avatar
      Jason Brownlee October 13, 2017 at 5:45 am #

      I would recommend using a linear model first and only moving to a neural net if it delivers better results on your specific problem.

  65. Avatar
    Hesam October 13, 2017 at 4:27 am #


    There are some problem of scaling back when we use more than one shift in time, I mean something like this:

    reframed = series_to_supervised(scaled, 6, 1)

    I can train and test the model, but some errors appears in the scaling back section which I couldn’t fix.

    Please have a look. I really appreciate it.

  66. Avatar
    Anil Maddala October 13, 2017 at 9:59 am #

    Hi Jason, thanks for the great series of articles. How should I modify the code from changing the LSTM code from preiction to classification?

    One sample input data is 60 time steps over 2 features and I want to classify the 60 step input sequence into 3 classes. To start with is LSTM the right approach?

    Hoping that you wold take any requests, I would definetly love to see an article on Multivariate classification in Keras using LSTM/GRU and it would be really helpful for analyzing sensor data. You could look at the Human Activity Recognition dataset

    • Avatar
      Jason Brownlee October 13, 2017 at 2:55 pm #

      Change the loss function and the activation function of the output layer to categorical_crossentropy and softmax respectively.

  67. Avatar
    heeun October 13, 2017 at 6:31 pm #

    Hi Jason, thanks yor nice article.

    I have a question!

    That algorithm is many to one right?

    How can I slove many to many?? for example, i want predict pollution and rain

    • Avatar
      Jason Brownlee October 14, 2017 at 5:42 am #

      It is many-to-one in terms of features.

      You can change it to be many-to-many by outputting multiple features.

  68. Avatar
    Pau October 14, 2017 at 1:13 pm #

    3 Things:
    1) Thanks so much for this. I’ve used this as a basis for some code I’m writing and it gave me a great head start.
    2) One thing that would be great to help with understanding the meanings of variables you’re using is to first put them into variables rather than using the integers. For example,

    x_size = 1
    train_X, train_y = train[:, :-x_size], train[:, -x_size:]
    test_X, test_y = test[:, :-x_size], test[:, -x_size:]

    This way, as people are reading the code they understand why it’s “-1” in case their adapted usage has different dimensions, they can change one variable and have it used everywhere it’s needed.

    3) For instance, I’m trying to make this code output multiple predictions and am having a bit of trouble figuring out all the variables I need to change.

    I have 368 columns of data, the first 168 are what will be predicted based on the other 200 points.

    x_size = 200
    # split into input and outputs
    train_X, train_y = train[:, :-x_size], train[:, -x_size:]
    test_X, test_y = test[:, :-x_size], test[:, -x_size:]

    # reshape input to be 3D [samples, timesteps, features]
    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))
    print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

    # design network
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

    I get the error:
    ValueError: Error when checking target: expected dense_1 to have shape (None, 1) but got array with shape (659, 200)

    Should the Dense(1) be Dense(x_size) where for me that is 200? (this is why it would be great to use variables so I know what that 1 means). When I try it as 168 (which is what it seems like it should be), I get an error.

    When I switch to x_size, it actually runs without errors, but I’m not sure if that means I’m correct or not.

    I’m so confused.


    • Avatar
      Jason Brownlee October 15, 2017 at 5:18 am #

      I have an example of multiple timestep outputs here that you could use as a starting point:

      • Avatar
        Paul October 16, 2017 at 4:35 pm #

        Rather than trying to predict many timestep outputs, I’m looking to output multiple predicted values per timestep.

        One thing I don’t understand is this section:

        # invert scaling for forecast
        inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)
        inv_yhat = scaler.inverse_transform(inv_yhat)
        inv_yhat = inv_yhat[:,0]

        Why is it inserting the yhat values as the *first* column? The scaler has a different scale per column so positioning is important, and the Y data had been the last column in the row, hadn’t it? So won’t it get scaled incorrectly?

        • Avatar
          Jason Brownlee October 17, 2017 at 5:38 am #

          The first column is the pollution value, we remove it from the test data, concat our prediction so we have enough columns for the transform’s expectations, then invert the transform and get the predicted pollution values in the correct scale.

          Does that help?

  69. Avatar
    Rui October 14, 2017 at 9:35 pm #

    First of all ,thanks a lot for the great tutorial Jason.

    I just have one question regarding the achieved predictions using the LSTM network.

    I just don’t understand why are you making “trainPredict = model.predict(trainX)” .

    I get the predict method using the testset testX, but using this method for trainX is not like if you were in some way cheating? I say this because we train the network using the trainX and trainY and trainY corresponds to the labels you are trying to predict in the predict method using trainX.

    Is it performed for validation purposes only?

    I’m still learning to work with the Keras API so I might be confused with the syntax of it

    Many thanks

  70. Avatar
    Kai Li October 17, 2017 at 1:05 pm #

    Thanks a lot for your tutorial!
    I still have some question,looking forward to your answer.
    If I want use the feature(t) 、 feature(t-1) and pollution(t-1) to predict pollution (t), how can I do to reshape my input?

  71. Avatar
    DC October 17, 2017 at 8:21 pm #

    Hi Jason, Thank you very much for the wonderful post. I have a few questions.

    1. You did not de-trend by using diff for above example. Diff from multi step only works for series. Can you please share how can we de-trend of multivariate time series?

    2. I’d like to use past 3 days of above data to predict 3 time steps for multivariate data as above. Can you please let me know how I can do that with the example above?

    Thanks for your help.

  72. Avatar
    Xie October 19, 2017 at 12:30 am #

    Hi, Jason. First of all, any thanks for your post. And I have some problems.

    1. I don’t really get the meaning of hidden_units? Can you please explain a little bit.
    2. I am building a lstm network as you do. I just follow your ways and build the network but got an error, as described here you please help me?


    • Avatar
      Jason Brownlee October 19, 2017 at 5:37 am #

      A hidden unit is a neuron or cell in a hidden layer.

      A hidden layer is a layer that is not the output or the input layer.

      Change your code to set “return_sequences” to be “False”.

  73. Avatar
    Argie October 19, 2017 at 3:16 am #

    So in your example you are using the data this way:

    No,year, month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
    1, 2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0

    Is possible to use the data in a way that lets say we could have multiple input numbers in one of the columns like for example, having
    No, year, month, day, hour, pm2.5, newVariable
    and in the new variable position instead of having just one integer like 20
    to have a sequence of integers like (5,10,3,50,23)

    Would that be possible using it on the same context, or is there any scenario that we could
    use the data the way I mentioned ?

    • Avatar
      Jason Brownlee October 19, 2017 at 5:40 am #

      If you mean, can you predict a sequence output, then yes. Here is an example:

      • Avatar
        Argie October 19, 2017 at 7:31 am #

        I might have not been clear enough, and sorry for that.

        What I mean is that as an input I will have 4 different categories of data lets call them A, B, C, and D, that each one of them will have more than one integer, to be exact they will have 10 integers
        so for example:

        A = {3,4,6,8,34,65,43,1,54} and so on with the other three categories.

        The sequence of numbers within the four categories belong on different time stamps, for example 3 -> t0 , 4-> t1 and so on.

        So what I need is to classify them for different data samples.

        • Avatar
          Jason Brownlee October 19, 2017 at 3:55 pm #

          These would be parallel series (columns) that could be all fed to one LSTM model like the example in the above tutorial.

          The model will process the parallel series one at a time step at a time.

          If the series extends beyond 200-400 time steps, then they could be split into multiple samples (e.g. multiple sub-parallel series).

          Does that help?

          • Avatar
            Argie October 20, 2017 at 11:31 am #

            So so helpful, I tried it and worked like a charm.

            Great job, and so helpful all the material you provide, and the way you do it !!

            Thanks a lot Jason !!

          • Avatar
            Jason Brownlee October 21, 2017 at 5:23 am #

            I’m glad to hear that, well done!

  74. Avatar
    Tim October 19, 2017 at 4:59 am #

    Really appreciate all the work you have done!

  75. Avatar
    Abhinav October 19, 2017 at 6:36 am #

    Hi Dr Brownlee. Thank you for this tutorial.

    inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

    inv_yhat = scaler.inverse_transform(inv_yhat)

    what does these steps do?

    Because I am getting a ValueError: operands could not be broadcast together with shapes (1822,11) (6,) (1822,11) on this step.
    I am applying on my own dataset

    • Avatar
      Jason Brownlee October 19, 2017 at 3:52 pm #

      These steps add the prediction to the test input data so that we can inverse the transform and get the prediction back into the scale we care about.

    • Avatar
      Neha Aggarwal December 21, 2018 at 12:12 pm #

      Hi Abhinav,

      I am facing a similar problem. What did you do to rectify it ?


  76. Avatar
    TvT October 19, 2017 at 8:08 pm #

    Hi Jason,

    Thanks for sharing your awesome work, I’ve been learning a lot from you!

    I have been struggling with increasing the second dimension to fully benefit from the BPTT though. I keep getting lost in the shapes. Would you mind sharing your code for multiple time steps aswell?
    That would be awesome!

    Keep up the good work!

  77. Avatar
    Dirk October 20, 2017 at 7:42 pm #

    Awesome work, thanks for sharing it!

    Could it be possible that you switched up the chronological order of your predictions?
    It looks to me that you predict the pollution of the previous hour, instead of predicting the future.

    • Avatar
      Jason Brownlee October 21, 2017 at 5:33 am #

      That is what a persistence model looks like exactly.

  78. Avatar
    Craig October 21, 2017 at 3:22 am #

    Hi Jason, I’m new to Deep Learning, so sorry if this is a fundamental question. I am trying to use an LSTM NN to create a super fast surrogate for a coastal circulation model (something sort of similar to this, but with time dependency:

    My training set looks something like this:

    -samples: 2000 – (I modeled a year with hourly output)
    -timesteps: 7 – (t-6, t-5, …, t)
    -features: 4 – (offshore boundary tide, 1st derivative of offshore boundary tide, boundary river discharge for river-1, and boundary river discharge for river-2)

    Currently, my target is velocity magnitude for one node in my model domain ([2000,1]

    My question is: When you do this tutorial, you assign the time steps as additional features (i.e. for my problem, our train_X = [2000,1,28]). I did this and it works fine, but eventually I’d like to scale this, and I thought I’d try to reshape my data to it’s intended shape for the model (i.e. [2000,7,4]). However, when I do this, my training time goes way down (it’s probably 3-4x slower.

    Does the model treat these two shapes differently? If not, why does it take so much longer to train with the latter shape?

  79. Avatar
    Amir Aaron October 22, 2017 at 5:58 pm #

    Hi Jason,
    Great article.
    I have a small question:
    In previous article you pointed out that we need to make the data stationary,
    Do we need to do it for multi-variant as well?

  80. Avatar
    Andriy October 24, 2017 at 12:39 pm #

    Nice article! I think one question remains unanswered. Why use RNNs if we only use one previous step to predict the next step? Why not SVM for example?

    • Avatar
      Jason Brownlee October 24, 2017 at 4:00 pm #

      No reason at all, we cannot what will work best for a given problem.

      Try it and compare the results!

  81. Avatar
    Ali Abdul October 25, 2017 at 7:39 pm #

    Hi Jason,

    Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

    Thanks in advance!

    • Avatar
      Jason Brownlee October 26, 2017 at 5:25 am #


      • Avatar
        Ali Abdul October 27, 2017 at 1:28 am #

        Great! One more small thing. When dealing with tails (let’s say 0 for lower, 1 for other than tail, 2 for upper tail), the classes and the features of course will be highly imbalanced. What would your approach be?

        • Avatar
          Jason Brownlee October 27, 2017 at 5:23 am #

          You might need to adjust the distribution via rescaling to make the least represented classes better represented.

  82. Avatar
    Mehmet Abd October 26, 2017 at 8:28 pm #

    Hi Jason,

    Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

    Thanks in advance!

  83. Avatar
    Hesam October 29, 2017 at 8:22 pm #


    What we should do if the time itself would be a value that we must predict, such as predicting time and date for the next rainfall?

    • Avatar
      Jason Brownlee October 30, 2017 at 5:37 am #

      You could predict the likelihood of rainfall for each hour and then use code (an if statement) to interpret those predictions and only output the predictions with a probability above a given threshold.

  84. Avatar
    Thabet October 30, 2017 at 3:33 am #

    Hello Jason,

    Could you perhaps show me exactly where to change as to predict the temperature instead of pollution?

    • Avatar
      Jason Brownlee October 30, 2017 at 5:42 am #

      You can change the column used as the output variable when fitting the model.

      Around line 52 in the full example where we drop columns we don’t care about. Change it to drop the pollution as well and not drop temperature.

      • Avatar
        Thabet October 31, 2017 at 10:14 am #

        Can you please help me further as i can’t manage to find where to change to predict for the temperature instead of pollution

        “” Next, we need to be more careful in specifying the column for input and output.
        We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

        # split into input and outputs
        n_obs = n_hours * n_features
        train_X, train_y = train[:, :n_obs], train[:, -n_features]
        test_X, test_y = test[:, :n_obs], test[:, -n_features]
        print(train_X.shape, len(train_X), train_y.shape)

        Where and how should i change to chose the temperature column?

  85. Avatar
    Allen November 1, 2017 at 7:03 pm #

    Hi Jason,

    Thanks for sharing your awesome work, I’ve been learning a lot from you!

    I have a small question:

    In previous article you pointed out that “Predict the pollution for the next hour as above and
    given the “expected” weather conditions for the next hour.” , eg “pollution,dew,temp”.

    What would your approach be?

    • Avatar
      Jason Brownlee November 2, 2017 at 5:11 am #

      For the case: “Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.”

      You would not need to transform the dataset, you would simply pretend that the actual weather conditions for the next hour are a forecast and predict the pollution value at that time.

  86. Avatar
    Ali November 2, 2017 at 3:42 am #

    first thanks for the post I learned a lot. I have a fundamental question about LSTM. lets say, I have 3 variables X, Y, and Z. I want to predict on Z.

    if I make the input(train_X in example above) time lagged. So I pass it x(t), x(t-1), x(t-2), x(t-3) etc…. then will the time component of LSTM matter or not? For example we have:

    t, x, y, x-1, x-2, y-1, y-2, z-1, z-2, z
    1, 1, 2, 0, 0, 0, 0 , 0, 0, 3
    2, 2, 4, 1, 0, 2. 0, 3 0, 3
    3, 3, 6, 2, 1, 4, 2, 3, 3, 6
    4, 4, 8, 3, 2 6, 4 6, 3, 6
    5, 5, 10, 4, 3, 8, 6 6, 6, 9

    traditionally we would train on variables (x, y, x-1, x-2, y, y-1, y-2, z-2, z-2) on the first 4 time-steps then evaluate on the 5th.

    my question is if I train it on time step,(1, 2, 4, 5) and evaluate on step 5, will I have the same result? mainly if I add the time-lag as an input can I reshuffle the data?

  87. Avatar
    Ali November 2, 2017 at 4:40 am #

    Hi Jason,

    if we pass in previous time lag can we shuffle the data around in the model? in other words make the input timeless?

    • Avatar
      Ali November 2, 2017 at 4:41 am #

      sorry when I refreshed my question didn’t appear, I thought it did not go through….did not mean to impatiently spam. apologies.

      • Avatar
        Jason Brownlee November 2, 2017 at 5:14 am #

        No problem, I moderate comments so there is some delay before they appear.

  88. Avatar
    Gus C November 3, 2017 at 3:41 am #

    Thanks for this great post.
    So how do you assess graphically your forecast with the actual?

  89. Avatar
    Num November 3, 2017 at 4:44 am #

    Hello, I have a problem that’s highly related to this guide.

    I have a time series where the predicted variable is (allegedly) in part dependant on some features from that time step, and these features are known before it (they are “planned prices” and “expected value” for different feature). I would like to include them as input into the LSTM.
    For one output, this turned out to be easy (just keep them in), but if I try to predict several outputs, I am having troubles formating the input correctly.

    For better understanding, the desired input would be features x1 through x8 for t-1,t-2…etc and then x1 through x7 for t,t+1,t+2…etc.

    Is this even possible with the example given here?

  90. Avatar
    Geoffrey Anderson November 3, 2017 at 4:58 am #

    PM2.5 is just one time series to predict, clearly. Predicting say 3 (or even 100,000) time series would be nice to look at too. An real life example where it’s useful is inventory management in retailing businesses. How many units will be sold in the next day of eggs, mascara, paper plates, frozen corn, 2% milk, skim milk, etc etc. Many of these TS will be correlated. Might need multi-tasking neural network outputs. LSTM would offer more automatic feature engineering than, say, using a boosted tree traditional machine learning algorithm which is natively unaware of time series. The latter needs manual feature creation of time-windowed aggregates by the data scientist. The LSTM just inputs the raw time series values directly by contrast, finding its own features. A bonus when using the LSTM is there may be some time-window or other features the human didn’t know about in advance. Another bonus is multiple-output (multitasking) that neural networks can naturally provide, unlike boosted trees for example. I’d suggest to start with only 2 or 3 TS at first, because a whole grocery store’s worth of items for even just a one day example is way too cumbersome to look at and manipulate easily on one small monitor screen. Just a warning: This may be frontier research, believe it or not.

    • Avatar
      Jason Brownlee November 3, 2017 at 5:23 am #

      Thanks for the suggestion Geoffrey. I hope to spend more time on this soon.

  91. Avatar
    Lu November 6, 2017 at 8:35 pm #

    I plot inv_yhat and inv_y in a same figure, and I found an interesting fact, that the training result is shifted to right for an hour compared with the ground truth. That’s to say the predicted result is almost the one hour ago data, or X_t = X_{t-1} approximately.
    Actually, the best estimation for RNN is to output the latest result, without doing any prediction. How do you think about this?

  92. Avatar
    Rafael November 7, 2017 at 6:32 am #

    I’m using my own dataset and I’m not using the series_to_supervised method because I already have the dataset prepared in 2 files, train and test files. I still have the error:

    Traceback (most recent call last):
    File “”, line 64, in
    inv_yhat = scaler.inverse_transform(inv_yhat)
    File “C:\Users\rafae\AppData\Local\Programs\Python\Python35\lib\site-packages\sklearn\preprocessing\”, line 385, in inverse_transform
    X -= self.min_
    ValueError: operands could not be broadcast together with shapes (52,12585) (12586,) (52,12585)

    • Avatar
      Rafael November 7, 2017 at 6:34 am #

      To load the datasets

      #Train dataset
      dataset = read_csv(‘trainning_small.csv’, header=None, index_col=None)
      dataset.drop(dataset.columns[[0]], axis=1, inplace=True)
      train = dataset.values

      encoder = LabelEncoder()
      train[:,-1] = encoder.fit_transform(train[:,-1])
      train = train.astype(‘float32’)

      scaler = MinMaxScaler(feature_range=(0, 1))
      train = scaler.fit_transform(train)

      #Test dataset
      dataset_test = read_csv(‘test_passare.csv’, header=None, index_col=None)
      dataset_test.drop(dataset_test.columns[[0]], axis=1, inplace=True)
      test = dataset_test.values

      encoder = LabelEncoder()
      test[:,-1] = encoder.fit_transform(test[:,-1])
      test = test.astype(‘float32’)

      test = scaler.fit_transform(test)

      train_x, train_y = train[:, :-1], train[:, -1]
      test_x, test_y = test[:, :-1], test[:, -1]

      train_x = train_x.reshape((train_x.shape[0], 1, train_x.shape[1]))
      test_x = test_x.reshape((test_x.shape[0], 1, test_x.shape[1]))
      print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)

      (838, 1, 12585) (838,) (52, 1, 12585) (52,)

  93. Avatar
    Fred November 7, 2017 at 4:30 pm #

    Dr. Brownlee,

    First of all, thanks for this wonderful post. I have applied your code with the following parameters:
    lags=8, features=8, epochs=50, batch=104, neurons=150

    And got almost perfect match between train and test. The test RMSE is 26.526.

    My question is that what does this result stand for?

    • Avatar
      Jason Brownlee November 8, 2017 at 9:18 am #

      Well done. The result is a summary of the error between predicted and expected values.

  94. Avatar
    Vlad November 12, 2017 at 5:37 am #

    I launched this example on my notebook (AMD FX-8800P Radeon R7, 8GB RAM), it runs already 4 hours and I even can’t see what is going on with the model training and how long will it run. Is it possible to include in the example some monitoring and visualization of the training process, ex. using callbacks.RemoteMonitor ?

    P.S. previously I worked with Matlab, it was so nice to see number of epochs, accuracy, error, and many other parameters during the training process. It helped a lot to understand should I continue training, or should I change the model.

    • Avatar
      Jason Brownlee November 12, 2017 at 9:08 am #

      You should see the progress for each epoch and across epochs as output on the command line.

  95. Avatar
    Vlad November 12, 2017 at 7:56 am #

    Hm, relaunched the example step-by-step and found out it’s stuck not at training, but at model compilation. Working for hours at 100% CPU load on block:
    # design network
    model = Sequential()
    model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
    model.compile(loss=’mae’, optimizer=’adam’)
    What’s wrong?
    Ubuntu 16.4, Keras 2.0.6, Theano 0.9.0, Python 3.6.2, Anaconda custom

    • Avatar
      Jason Brownlee November 12, 2017 at 9:09 am #

      Are you running on the command line? If you run in a notebook, you may hide error or verbose messages.

  96. Avatar
    Vlad November 12, 2017 at 9:57 am #

    I updated all libraries and anaconda and python and now it works! Sorry for disturbance 🙂 BTW, monitoring tool can be used for callbacks.RemoteMonitor is hualos-master

  97. Avatar
    Tommy November 13, 2017 at 5:20 am #

    Thanks for the very well written article. I really appreciate the detailed walkthrough.

    I have been looking for a way to apply multivariate input to a machine learning prediction model of any sort. I’m doing this in order to predict the growth of compute systems in excess of hundreds of thousands of nodes bases on 6 years of daily samples. Simply looking at the Y growth over time and feeding that into something like Facebook prophet has proved somewhat insufficient because it only looks at the problem as a function of past behavior.

    In reality there are more variables at play that control or effect that line of growth. As such, simple univariate approaches fall short and the predictions can be very good or very bad.

    When I found this article I thought to myself, Eureka! I will be able to use this approach in order to feed in multivariate data along with the growth of my systems in order to get better predictions. However I was somewhat crestfallen at the revelation of 2 key problems discussed over the last several months here in the comments…

    One problem you acknowledged as a potential/known issue and linked to another article explaining why autoregression time series problems may not be best solved with lstm neural networks. The article posits that better results might be obtained by stacking or using more layers. Have you tried this? If so, what did it look like and what results did you get?

    The second and more concerning problem was when one commenter performed the same exercise as laid out in this article, but removed all of the multivariate data and still obtained the same rmse rate as you did. It was as if none of the other variables had any bearing on the prediction. This is deeply concerning, because as I see it, either this event was anomalous and driven by the input data, or the overall approach itself may be flawed, or the implementation thereof is broken. I’m not sufficiently versed in the technology to make a value statement on any of those points.

    I’m hoping that you would be willing to share your thoughts on possible answers to these questions.

    • Avatar
      Jason Brownlee November 13, 2017 at 10:22 am #

      The tutorial is a demonstration of a method, not the best way of solving or even framing the presented problem.

      I should have made that clearer, but that is the philosophy behind every single blog post on my site. I show how to use the methods, not how to get the best results (for a specific problem). The former problem is tractable the latter is not.

      • Avatar
        Tommy November 13, 2017 at 12:14 pm #

        Thanks for the clarity and candor! As a long-time comp-sci person, I find it very strange to run these tensorflow sessions and get different results for the same inputs (I’ve been putting your code through the paces) … I found I needed to add this, or every subsequent run would result in predictions that seemed to augment each previous run:


        For what it’s worth, I zeroed out all the other variables (instead of eliminating them) and it /did/ have bearing on the output. I don’t think this methodology can be dismissed as ineffective. It seems to be approximating a workable solution. More exploration is necessary.

        Thank you for setting me on the path!

        • Avatar
          Jason Brownlee November 14, 2017 at 10:06 am #


          Well, these are stochastic algorithms in general, but a single trained model should be deterministic and when it’s not, we’re in trouble.

          • Avatar
            Tommy November 14, 2017 at 11:48 am #

            Have you tried running multiple iterations and examining yhat_inv?

            I keep getting different output, and I didn’t expect that. Am I looking in the wrong place?

            I can send a catalog of my results if that helps…

          • Avatar
            Jason Brownlee November 15, 2017 at 9:45 am #

            I have not.

            In general, we do expect different results across different runs given the stochastic nature of neural networks (forgive me if I am missing the point):

  98. Avatar
    sam November 15, 2017 at 10:23 pm #

    Hi Jason,

    multivariate time series forecasting possible for multi-step??

    • Avatar
      Jason Brownlee November 16, 2017 at 10:30 am #


      • Avatar
        sam November 16, 2017 at 6:23 pm #


        Jason Can you please explain..How to prepare dataset for train models.. let’s suppose i have 5 feature and i want to predict t + 5 value..

        For example..

        x1 = (2,3,4,3,1,6,8,9,4,1)
        x2 = (5,2,5,7,9,9,6,3,1,3)
        x3 = (2,3,4,8,1,6,8,9,1,1)
        x4 = (5,1,5,7,9,9,6,3,1,7)
        x5 = (2,3,4,6,8,3,1,3,5,7)
        y = (8,7,6,5,4,3,2,8,9,7)


  99. Avatar
    Tommy November 18, 2017 at 3:54 pm #

    What do you think about putting a dropout layer between the LSTM and Dense layers to address the overfitting phenomenon?

    • Avatar
      Jason Brownlee November 19, 2017 at 11:08 am #

      Try it and see, I’d love to hear how it goes.

  100. Avatar
    Abdulrauf Garba November 19, 2017 at 10:36 pm #

    Hi, Jason, we need a similar tutorial of Multivariate time series using the Recurrent neural network in R.

  101. Avatar
    Louis November 22, 2017 at 1:51 am #

    Hello Jason!

    You say in your post:

    “We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.”

    Is it possible to do the same without prior knowledge of the pollution levels?

    I am working on a very similar time series forecasting problem. However, in my case, I don’t have access to intermediate level of pollution.

    Thank you

    • Avatar
      Jason Brownlee November 22, 2017 at 11:13 am #

      Yes, but it is important to spend time exploring different framings of the problem.

  102. Avatar
    Shantanu November 22, 2017 at 5:50 am #


    I have a question about splitting the data.
    I have the data month wise for around 20 years.
    How should I split it?

  103. Avatar
    michael November 22, 2017 at 9:21 am #

    Hi Jason,

    Thank you for this excellent tutorial!

    This may or may not be a slight variation of your “Train On Multiple Lag Timesteps Example”, but I was wondering how I should modify your example to do a multivariate one to multiple time step prediction i.e. look at one time step of 8 dimensional data and predict 10 time steps of 8 dimensional data. Or a multivariate seq2seq prediction i.e. show 10 time steps of 8 dimensional data and predict 10 time steps of 8 dimensional data.


  104. Avatar
    Sammy November 23, 2017 at 1:20 pm #

    Hi Jason,
    First of all, thank you very much for this excellent post. I would be grateful if you can show how to do multivariate time series forecasting per group. In other words, lets say we have data for many cities and we would like to add the forecasting per city ? How we can feed the data to LSTM for a given city and get inv_y, inv_yhat to compare to see how model does ?
    Thanks again,

    • Avatar
      Jason Brownlee November 24, 2017 at 9:31 am #

      You could model each city separately or combine all cities into a single dataset, or do both and ensemble the result.

  105. Avatar
    Nagabhushan S Baddi November 23, 2017 at 7:50 pm #

    Hi Jason.
    I have a dataset of 169307 rows and 41 features. I want to use timestep of 5. So, when I am using X=np.reshape(X, (169307, 5, 41)), I am getting an error that “cannot reshape array of size 6941587 into shape (169307,5,41)”. Does this mean that n_samples*n_features in the orginal dataset should be divisible by n_timesteps? If this is true, then how can I be able to use timestep of my choice?

  106. Avatar
    Chris November 25, 2017 at 11:27 pm #

    Hi Jason,
    I’m a little confused about the range of scaling.

    In many other posts you mentioned the following:
    “Transform the observations to have a specific scale. Specifically, to rescale the data to values between -1 and 1 to meet the default hyperbolic tangent activation function of the LSTM model.”

    Is there a reason for the use of 0 to 1 ?
    Isn’t -1 to 1 better for scaling, since the activation function is tanh?

    Thank you,

    • Avatar
      Jason Brownlee November 26, 2017 at 7:32 am #

      Great question, a scale of 0-1 results in better skill in my experience.

  107. Avatar
    Somayeh November 28, 2017 at 1:44 am #

    Hi Jason,

    Thank you so much for the wonderful tutorial! That was so helpful for me.
    When i read your post, my questions was solved about how to predict multi-output multi-input system in multi-step time series because of your great illustration.

    But I have a question, in my problem, we have many observations for some cases in each time (about 500), so we have multiple series inputs and outputs in each time.

    Could you please help me how can solve this issue.

    Any help will be useful for me. i will be very appreciated for your help.

    Thank you,


    • Avatar
      Jason Brownlee November 28, 2017 at 8:39 am #

      I would recommend exploring many different framings of the problem to see what works best and consider a baseline MLP model.

    • Avatar
      Max July 20, 2018 at 12:55 am #

      May I ask how you solved your problem of multiple outputs? I am having trouble implementing it.

  108. Avatar
    Michael November 29, 2017 at 6:35 am #

    I see this question has been raised before, I’m sorry for beating a dead horse. I’ve been struggling with the inverse_transform step.
    I tried to implement this algorithm using my own dataset and had trouble with it. Then I tried to run the example with the example dataset as in the tutorial and also had an error on the inverse_transform step.

    inv_yhat = scaler.inverse_transform(inv_yhat)

    (on my data)
    ValueError: operands could not be broadcast together with shapes (15357,287) (8,) (15357,287)

    on the tutorial data set:
    ValueError: operands could not be broadcast together with shapes (35037,24) (8,) (35037,24)

    PS. your blog is great. Keep up the the good work!

    • Avatar
      Jason Brownlee November 29, 2017 at 8:30 am #

      Generally, you must make sure that the data has the same shape and that columns have the same index when transforming and inverse transforming.

      Confirm this before performing each operation.

      Does that help? Let me know how you go.

    • Avatar
      Cynthia June 20, 2020 at 1:16 pm #

      This error is because he applied scaler.fit_transform on the dataframe that only had 8 columns (the original dataframe), but then he apply the scaler.inverse_transform on the test_X dataframe which had 16 columns; hence, the mismatch. I don’t know why he was able to upload the full code without reproducing this error.

  109. Avatar
    Abdur Rehman Nadeem November 29, 2017 at 8:21 am #

    HI jason,

    Thanks for great tutorial. I have a question how to choose the no. of timesteps as you always choose 1 timestep ? From where can I see the predicted value as graph just showing training of model and how can I predict the value for different time intervals (e.g. if I want to predict the value for next 1, 2, 4 or hours)?

  110. Avatar
    Ahmed Ali Mbarak November 29, 2017 at 4:07 pm #

    Hello Mr Jason Brownlee, Your tutorial is awesome, it helped me in my project. I have been really interested in machine learning and this place has given me a lot.

    My next move was to find a way to input data to my code and predict the future value. Like for example, for predicting air pollution. A user will keep todays data like N02 and windspeed and the code will spit out tomorrow’s air pollution. In other words how to apply the code to practice?.

    Thank you.

  111. Avatar
    Abdur Rehman Nadeem November 29, 2017 at 8:25 pm #

    Hi Jason,

    In series_to_supervised() function, when we change the value of variable “n_in” (e.g. if we say 2 in this example ,does it mean we are now predicting for the next two hour because now the dataframe will have 16 columns instead of 8)? How the value of “n_out” effects please explain that also .

    Best Regards,

  112. Avatar
    Abdur Rehman Nadeem November 30, 2017 at 12:21 am #

    Hi Jason,

    i took the “yhat” array as my predicted values and “test_X” array as actual values because we predicted on test_X array and draw a plot using matplotlib , did I do the right ?

  113. Avatar
    Sammy November 30, 2017 at 7:15 am #

    Hi Jason,
    I wanted to have n_in: Number of lag observations as input (X) set to 3 (using my own data) as can be seen below
    49 # frame as supervised learning
    50 reframed = series_to_supervised(scaled, 3, 1)

    I make the data samples
    86 inv_yhat = scaler.inverse_transform(inv_yhat)
    and I get the following error:
    File “/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/”, line 385, in inverse_transform
    X -= self.min_
    ValueError: operands could not be broadcast together with shapes (67112,57) (19,) (67112,57)
    I have initially 19 variables and I have number of observations set to 3 the text_X has following shape
    >>> test_X.shape
    (67112, 1, 57)
    yhat = model.predict(test_X) and
    >>> yhat.shape
    (67112, 1)

    I don’t understand the error above. I would be grateful if you can help me see what I am doing wrong.
    Again, thanks a lot. You are awesome !

    • Avatar
      Jason Brownlee November 30, 2017 at 8:40 am #

      Hi Sammy, did you try the section “Update: Train On Multiple Lag Timesteps Example”?

      • Avatar
        Sammy November 30, 2017 at 9:00 am #

        No as I didn’t see the update before. I will try it now. Thanks a lot

  114. Avatar
    Miha December 1, 2017 at 2:37 am #

    Hi Jason,

    First of all, many thanks for this great tutorial!

    I’m trying to apply this to my own problem. However, I’m facing some problems.
    Let’s say we have the time series of multivariate data structured like this:

    x1,x2,x3,…x30, y1
    x1,x2,x3,…x30, y2

    where x1 – x30 are numeric (continues) values and y1 – yn are labels which I want to predict.
    Y can only be 1 (on) or 0 (off). Some of these parameters are raw sensor data, which increase or decrease over n samples, so I know that this problem is ideal for RNN.

    But I am not sure if my approach is ok.

    Is it ok to re-factor the data in a way, that I take the first 10 samples (without y values of course), create the 2D array of them and try to predict the output of sample n10 and then move for 1 place and take next 10 samples and predict sample n11 and so on… So not to combine them into one vector like you did.

    For example, if I have 10,000 samples, each for 100ms and I want to look at the last 10 samples (1 second) I train the data with samples of shape (99990, 10, 30 ) where 99990 represent the number of samples, each containing 10 readings (1 second) with the dimension of 30.

    My current model looks like this, but it is not as successful as I want it to be (I think it can be a lot better):

    model = Sequential()
    model.add(LSTM(100, input_shape=(nsamples, nbatch, ndimension))
    model.add(Dense(1, activation=’sigmoid’))
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’)

    Can you please point me in the right direction?

    • Avatar
      Abdur Rehman Nadeem December 2, 2017 at 9:28 am #

      Hi Maha,

      Can you tell me why you are just applying “Activation Function” to just output layer I mean why there is no “Activation Function” for hidden layer?

      • Avatar
        Jason Brownlee December 3, 2017 at 5:22 am #

        We are using the default activation functions for the LSTM hidden layers.

  115. Avatar
    Silvia December 3, 2017 at 4:01 am #

    train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))
    test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

    I’m having a lot of troubles with these two lines.

    I don’t understand why it isn’t like so

    train_X = train_X.reshape((1, train_X.shape[0], train_X.shape[1]))
    test_X = test_X.reshape((1, test_X.shape[0], test_X.shape[1]))

    I thought (and obviously I’m wrong, but I want to know why) that we had 1 sample because we have one city, but have multiple timesteps one for each set of measurements.

    If we had 3 cities would we then have 3 instead of 1?

  116. Avatar
    Mahesh December 3, 2017 at 12:50 pm #

    Hi Jason,

    If I have data for every city then how can I build one LSTM model. Here data is for only one city and have to forecast pollution. Lets suppose if I append data for other cities so can we predict pollution using single LSTM
    Yes,we can build model for each city separately but can we build a single model?

    • Avatar
      Jason Brownlee December 4, 2017 at 7:44 am #

      There is no one best way. I would encourage you to explore different ways to frame this problem, perhaps one model per city, perhaps one model for regions or all cities, perhaps ensembles of models. See what works best for your data.

  117. Avatar
    lucy80 December 3, 2017 at 10:47 pm #

    Hi Jason,

    If instead of single time series we have multiple time series, how should we normalize data?
    i.e. if we have pollution data for 100 cities, normalization should be done citiwise or across all cities ?

    • Avatar
      Jason Brownlee December 4, 2017 at 7:47 am #

      It really depends on the model that you are constructing.

      Your goal is to ensure input data to the model is consistent.

  118. Avatar
    Mangesh Divate December 9, 2017 at 7:38 am #

    Hello Jason, one question is why didn’t you used scikit-learn train_test_split function instead of

    # split into train and test sets
    values = reframed.values
    n_train_hours = 365 * 24
    train = values[:n_train_hours, :]
    test = values[n_train_hours:,

    • Avatar
      Jason Brownlee December 9, 2017 at 9:22 am #

      By all means, try it. Note that you cannot shuffle the series.

  119. Avatar
    james December 11, 2017 at 1:16 am #

    in my computer, every epochs used 191s! emmmmmm……….. this time is too long .
    i want to ask ,you used GPU to speed up ? or other problems?
    thank you!!

    • Avatar
      Jason Brownlee December 11, 2017 at 5:27 am #

      GPU can speed up LSTMs somewhat, but not as much as MLPs.

  120. Avatar
    Mark December 11, 2017 at 8:23 am #

    Hi Jason,

    Thank you so much for your brilliant website helping us all get good at machine learning!

    Please could you clarify the line of code that outputs the next hour’s pollution reading? I’ve run the model and it return the RMSE but I’m interested to see the t+1 prediction.

    What code would I add at the end so that when the model has finished running it prints the next hour’s predicted pollution reading?

    Many thanks!

  121. Avatar
    Mark December 13, 2017 at 12:49 am #

    Thank you, Jason.

    I’m almost ready to apply what you’ve taught me here to my use case. The only other thing that isn’t 100% clear to me is the dropping columns number references 9,10,11,12,13,14,15 (below):

    # drop columns we don’t want to predict
    reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

    I get that you’re dropping the columns after ‘pollution’ because you only want to predict the pollution readings but why are they referenced 9-15?

    Thank you in advance!

    • Avatar
      Jason Brownlee December 13, 2017 at 5:40 am #

      We are dropping variables that we do not want to predict at the next time step. We only want to predict pollution.

      • Avatar
        Mark December 13, 2017 at 7:50 am #

        I understand that. My question was around the numbering. If we’re dropping columns ‘dew’ through to ‘rain’ i.e. columns number 3 to 9 in the prepared “pollution.csv” dataset above then why isn’t the code written:

        reframed.drop(reframed.columns[[3,4,5,6,7,8,9]], axis=1, inplace=True)

        It’s the 9 – 15 that I just need an explanation for please.

        Many thanks

        • Avatar
          Jason Brownlee December 13, 2017 at 4:10 pm #

          We are dropping them from the new dataset that has lag variables.

          Try printing the version of the dataset that we are modifying to get an idea of its shape.

  122. Avatar
    Chris December 13, 2017 at 11:07 pm #

    Hello json,
    again a very successful contribution.

    What I would like to do is something like a early warning system that predicts as early as possible, as safely as possible for example in the case of natural disasters, financial forecast or driving data from the prediction output of a Multivariate Time Series LSTM Forecast.

    Suppose I get the prediction, e.g. x, y and z and each area labeled with x or z must be K-units long, each time they occur. X and z make up 10 percent of the data.

    The ground truth and Prediction would then look like e.g.
    GT:y y y y y y y y x x x x x x y y y y y y z z z z z z y y y y y y y y y y y y y y y y
    PR:y y y x x y y y x x x x x x y y y x y y y z z z y y y y y y y y y z z y y y x x y y

    Now I would like to determine an overall probability for an event, based on the PR sequence.
    Op:y – – – – – – – – X – – – – – – Y – – – – – – Z- – – – – -Y – – – – – – – – – – – – – – – – –

    I had the idea of a window with a threshold or a sequence classification task.

    Since I am fairly new to machine learning and co, but I’m thinking that this problem has probably been discussed and solved very often, I would be very happy about your advice.

    • Avatar
      Jason Brownlee December 14, 2017 at 5:39 am #

      There is not one best way to solve a problem like, this, but many. I’d encourage you to brainstorm different ways of framing this as a prediction problem and see what works best.

  123. Avatar
    Abdur Rehman Nadeem December 14, 2017 at 4:14 am #

    Hi Jason,

    These days LSTM is also popular for sentimental analysis. Have you written any tutorial on Sentimental Analysis using LSTM or something like that ?

  124. Avatar
    Mike December 14, 2017 at 5:42 pm #

    can i save my model ? i don’t want to train it everytime….
    oh,and do you have any article to talk how to predict next n step in Multivariate Time Series Forecasting with LSTMs in Keras??
    thank you!!!

  125. Avatar
    Tony December 15, 2017 at 11:26 pm #

    Hi, jason
    I read your article and run the code.But i have some questions .Can you give me some suggestions?
    1. In this article, you prepare the pollution dataset for the LSTM. All features are normalized, your dataset is transformed a supervised learning problem . I want to ask ,why the code is ‘MinMaxScaler(feature_range=(0, 1)) ‘, rather than ‘MinMaxScaler(feature_range=(-1, 1))’ ?I remember the default activation function for LSTMs is the hyperbolic tangent (tanh), which outputs values between -1 and 1. Why we set (0,1) in there?
    2. In this code,we don’t transform Time Series to Stationary. Why? I think we must transform Time Series to Stationary. It’s necessary,right?
    3. the important arguments are batch_size, n_neuron and epochs. How shoud i adjust them?
    4. Can i use CNN network to predict Multivariate Time Series ? Too many people all think LSTM is the best way, Really?
    Thank you very much!

    • Avatar
      Jason Brownlee December 16, 2017 at 5:29 am #

      Results are better if you normalize the data.

      Making the data stationary may improve the skill of the model. I was trying to keep the example simple.

      Use experiments to see what values give the best results. Be systematic.

      I think MLP is better at time series, here’s why:

      • Avatar
        Tony December 16, 2017 at 7:15 pm #

        thank you jason,
        your reply it’s very usefu. But i still don’t understand why the code is MinMaxScaler(feature_range=(0, 1))? in your other article ,you use feature_range=(0, 1),
        so i’m very wondering . what is the reason? The activation function for LSTMs is changeable?

        • Avatar
          Jason Brownlee December 17, 2017 at 8:51 am #

          Sorry, I don’t follow?

          • Avatar
            Tony December 17, 2017 at 1:47 pm #

            i am foolish,I write it wrongly ,i am sorry,
            my question is:
            But i still don’t understand why the code is MinMaxScaler(feature_range=(0, 1))? in your other article ,you use feature_range=(-1, 1),The activation function for LSTMs is tanh? i think thnh is in (-1,1) , why in there ,we use (0,1)?
            thank you so much….

          • Avatar
            Jason Brownlee December 18, 2017 at 5:20 am #

            LSTMs generally perform better with normalized data (in the range 0-1).

          • Avatar
            slouchpie January 18, 2018 at 12:49 pm #

            Hi Jason, great article.
            Can you please explain why it is OK to use feature_range [0. 1] as opposed to [-1, 1].
            In another article ( you said that the feature_range should be [-1, 1] in order to be the same range as the hyperbolic tan (tanh) function, which default LSTM uses. In fact, you said “This is the preferred range for the time series data.”.
            I am not sure why it is OK to now use [0, 1]. Are you taking absolute value of tanh somewhere in your LSTM layer?

          • Avatar
            Jason Brownlee January 19, 2018 at 6:26 am #

            The range [0,1] results in better skill.

  126. Avatar
    soloyuyang December 16, 2017 at 12:06 am #

    The work you have done is wonderful. i’m interested in time series forecasting with lstm.
    i have two questions.
    1.In some cases in time series forecasting, especially the single series, the features are the data of previous time(t-1,t-2…). For example,only the series of pm2.5, i want to predict the value on t+1,depending on the data of t-k……t-1,t. how should i set the “time-steps” and “features”, [samples, k+1, 1]or [samples, 1, k+1](