Multi-Step LSTM Time Series Forecasting Models for Power Usage

Given the rise of smart electricity meters and the wide adoption of electricity generation technology like solar panels, there is a wealth of electricity usage data available.

This data represents a multivariate time series of power-related variables that in turn could be used to model and even forecast future electricity consumption.

Unlike other machine learning algorithms, long short-term memory recurrent neural networks are capable of automatically learning features from sequence data, support multiple-variate data, and can output a variable length sequences that can be used for multi-step forecasting.

In this tutorial, you will discover how to develop long short-term memory recurrent neural networks for multi-step time series forecasting of household power consumption.

After completing this tutorial, you will know:

  • How to develop and evaluate Univariate and multivariate Encoder-Decoder LSTMs for multi-step time series forecasting.
  • How to develop and evaluate an CNN-LSTM Encoder-Decoder model for multi-step time series forecasting.
  • How to develop and evaluate a ConvLSTM Encoder-Decoder model for multi-step time series forecasting.

Let’s get started.

Note: This is a reasonably advanced tutorial, if you are new to time series forecasting in Python, start here. If you are new to using deep learning for time series, start here. If you really want to get started with LSTMs for time series, start here.

  • Update Jun/2019: Fixed bug in to_supervised() that dropped the last week of data (thanks Markus).
  • Update Nov/2021: Fixed a typo (thanks Sandy)
How to Develop LSTM Models for Multi-Step Time Series Forecasting of Household Power Consumption

Photo by Ian Muttoo, some rights reserved.

Tutorial Overview

This tutorial is divided into nine parts; they are:

  1. Problem Description
  2. Load and Prepare Dataset
  3. Model Evaluation
  4. LSTMs for Multi-Step Forecasting
  5. LSTM Model With Univariate Input and Vector Output
  6. Encoder-Decoder LSTM Model With Univariate Input
  7. Encoder-Decoder LSTM Model With Multivariate Input
  8. CNN-LSTM Encoder-Decoder Model With Univariate Input
  9. ConvLSTM Encoder-Decoder Model With Univariate Input

Python Environment

This tutorial assumes you have a Python SciPy environment installed, ideally with Python 3.

You must have Keras (2.2 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this tutorial:

A GPU is not required for this tutorial, nevertheless, you can access GPUs cheaply on Amazon Web Services. Learn how in this tutorial:

Let’s dive in.

Problem Description

The ‘Household Power Consumption‘ dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years.

For more about this dataset, see the post:

The data was collected between December 2006 and November 2010 and observations of power consumption within the household were collected every minute.

It is a multivariate series comprised of seven variables (besides the date and time); they are:

  • global_active_power: The total active power consumed by the household (kilowatts).
  • global_reactive_power: The total reactive power consumed by the household (kilowatts).
  • voltage: Average voltage (volts).
  • global_intensity: Average current intensity (amps).
  • sub_metering_1: Active energy for kitchen (watt-hours of active energy).
  • sub_metering_2: Active energy for laundry (watt-hours of active energy).
  • sub_metering_3: Active energy for climate control systems (watt-hours of active energy).

Active and reactive energy refer to the technical details of alternative current.

A fourth sub-metering variable can be created by subtracting the sum of three defined sub-metering variables from the total active energy as follows:

Load and Prepare Dataset

The dataset can be downloaded from the UCI Machine Learning repository as a single 20 megabyte .zip file:

Download the dataset and unzip it into your current working directory. You will now have the file “household_power_consumption.txt” that is about 127 megabytes in size and contains all of the observations.

We can use the read_csv() function to load the data and combine the first two columns into a single date-time column that we can use as an index.

Next, we can mark all missing values indicated with a ‘?‘ character with a NaN value, which is a float.

This will allow us to work with the data as one array of floating point values rather than mixed types (less efficient.)

We also need to fill in the missing values now that they have been marked.

A very simple approach would be to copy the observation from the same time the day before. We can implement this in a function named fill_missing() that will take the NumPy array of the data and copy values from exactly 24 hours ago.

We can apply this function directly to the data within the DataFrame.

Now we can create a new column that contains the remainder of the sub-metering, using the calculation from the previous section.

We can now save the cleaned-up version of the dataset to a new file; in this case we will just change the file extension to .csv and save the dataset as ‘household_power_consumption.csv‘.

Tying all of this together, the complete example of loading, cleaning-up, and saving the dataset is listed below.

Running the example creates the new file ‘household_power_consumption.csv‘ that we can use as the starting point for our modeling project.

Model Evaluation

In this section, we will consider how we can develop and evaluate predictive models for the household power dataset.

This section is divided into four parts; they are:

  1. Problem Framing
  2. Evaluation Metric
  3. Train and Test Sets
  4. Walk-Forward Validation

Problem Framing

There are many ways to harness and explore the household power consumption dataset.

In this tutorial, we will use the data to explore a very specific question; that is:

Given recent power consumption, what is the expected power consumption for the week ahead?

This requires that a predictive model forecast the total active power for each day over the next seven days.

Technically, this framing of the problem is referred to as a multi-step time series forecasting problem, given the multiple forecast steps. A model that makes use of multiple input variables may be referred to as a multivariate multi-step time series forecasting model.

A model of this type could be helpful within the household in planning expenditures. It could also be helpful on the supply side for planning electricity demand for a specific household.

This framing of the dataset also suggests that it would be useful to downsample the per-minute observations of power consumption to daily totals. This is not required, but makes sense, given that we are interested in total power per day.

We can achieve this easily using the resample() function on the pandas DataFrame. Calling this function with the argument ‘D‘ allows the loaded data indexed by date-time to be grouped by day (see all offset aliases). We can then calculate the sum of all observations for each day and create a new dataset of daily power consumption data for each of the eight variables.

The complete example is listed below.

Running the example creates a new daily total power consumption dataset and saves the result into a separate file named ‘household_power_consumption_days.csv‘.

We can use this as the dataset for fitting and evaluating predictive models for the chosen framing of the problem.

Evaluation Metric

A forecast will be comprised of seven values, one for each day of the week ahead.

It is common with multi-step forecasting problems to evaluate each forecasted time step separately. This is helpful for a few reasons:

  • To comment on the skill at a specific lead time (e.g. +1 day vs +3 days).
  • To contrast models based on their skills at different lead times (e.g. models good at +1 day vs models good at days +5).

The units of the total power are kilowatts and it would be useful to have an error metric that was also in the same units. Both Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) fit this bill, although RMSE is more commonly used and will be adopted in this tutorial. Unlike MAE, RMSE is more punishing of forecast errors.

The performance metric for this problem will be the RMSE for each lead time from day 1 to day 7.

As a short-cut, it may be useful to summarize the performance of a model using a single score in order to aide in model selection.

One possible score that could be used would be the RMSE across all forecast days.

The function evaluate_forecasts() below will implement this behavior and return the performance of a model based on multiple seven-day forecasts.

Running the function will first return the overall RMSE regardless of day, then an array of RMSE scores for each day.

Train and Test Sets

We will use the first three years of data for training predictive models and the final year for evaluating models.

The data in a given dataset will be divided into standard weeks. These are weeks that begin on a Sunday and end on a Saturday.

This is a realistic and useful way for using the chosen framing of the model, where the power consumption for the week ahead can be predicted. It is also helpful with modeling, where models can be used to predict a specific day (e.g. Wednesday) or the entire sequence.

We will split the data into standard weeks, working backwards from the test dataset.

The final year of the data is in 2010 and the first Sunday for 2010 was January 3rd. The data ends in mid November 2010 and the closest final Saturday in the data is November 20th. This gives 46 weeks of test data.

The first and last rows of daily data for the test dataset are provided below for confirmation.

The daily data starts in late 2006.

The first Sunday in the dataset is December 17th, which is the second row of data.

Organizing the data into standard weeks gives 159 full standard weeks for training a predictive model.

The function split_dataset() below splits the daily data into train and test sets and organizes each into standard weeks.

Specific row offsets are used to split the data using knowledge of the dataset. The split datasets are then organized into weekly data using the NumPy split() function.

We can test this function out by loading the daily dataset and printing the first and last rows of data from both the train and test sets to confirm they match the expectations above.

The complete code example is listed below.

Running the example shows that indeed the train dataset has 159 weeks of data, whereas the test dataset has 46 weeks.

We can see that the total active power for the train and test dataset for the first and last rows match the data for the specific dates that we defined as the bounds on the standard weeks for each set.

Walk-Forward Validation

Models will be evaluated using a scheme called walk-forward validation.

This is where a model is required to make a one week prediction, then the actual data for that week is made available to the model so that it can be used as the basis for making a prediction on the subsequent week. This is both realistic for how the model may be used in practice and beneficial to the models allowing them to make use of the best available data.

We can demonstrate this below with separation of input data and output/predicted data.

The walk-forward validation approach to evaluating predictive models on this dataset is provided below named evaluate_model().

The train and test datasets in standard-week format are provided to the function as arguments. An additional argument n_input is provided that is used to define the number of prior observations that the model will use as input in order to make a prediction.

Two new functions are called: one to build a model from the training data called build_model() and another that uses the model to make forecasts for each new standard week called forecast(). These will be covered in subsequent sections.

We are working with neural networks, and as such, they are generally slow to train but fast to evaluate. This means that the preferred usage of the models is to build them once on historical data and to use them to forecast each step of the walk-forward validation. The models are static (i.e. not updated) during their evaluation.

This is different to other models that are faster to train where a model may be re-fit or updated each step of the walk-forward validation as new data is made available. With sufficient resources, it is possible to use neural networks this way, but we will not in this tutorial.

The complete evaluate_model() function is listed below.

Once we have the evaluation for a model, we can summarize the performance.

The function below named summarize_scores() will display the performance of a model as a single line for easy comparison with other models.

We now have all of the elements to begin evaluating predictive models on the dataset.

LSTMs for Multi-Step Forecasting

Recurrent neural networks, or RNNs, are specifically designed to work, learn, and predict sequence data.

A recurrent neural network is a neural network where the output of the network from one time step is provided as an input in the subsequent time step. This allows the model to make a decision as to what to predict based on both the input for the current time step and direct knowledge of what was output in the prior time step.

Perhaps the most successful and widely used RNN is the long short-term memory network, or LSTM for short. It is successful because it overcomes the challenges involved in training a recurrent neural network, resulting in stable models. In addition to harnessing the recurrent connection of the outputs from the prior time step, LSTMs also have an internal memory that operates like a local variable, allowing them to accumulate state over the input sequence.

For more information about Recurrent Neural Networks, see the post:

For more information about Long Short-Term Memory networks, see the post:

LSTMs offer a number of benefits when it comes to multi-step time series forecasting; they are:

  • Native Support for Sequences. LSTMs are a type of recurrent network, and as such are designed to take sequence data as input, unlike other models where lag observations must be presented as input features.
  • Multivariate Inputs. LSTMs directly support multiple parallel input sequences for multivariate inputs, unlike other models where multivariate inputs are presented in a flat structure.
  • Vector Output. Like other neural networks, LSTMs are able to map input data directly to an output vector that may represent multiple output time steps.

Further, specialized architectures have been developed that are specifically designed to make multi-step sequence predictions, generally referred to as sequence-to-sequence prediction, or seq2seq for short. This is useful as multi-step time series forecasting is a type of seq2seq prediction.

An example of a recurrent neural network architecture designed for seq2seq problems is the encoder-decoder LSTM.

An encoder-decoder LSTM is a model comprised of two sub-models: one called the encoder that reads the input sequences and compresses it to a fixed-length internal representation, and an output model called the decoder that interprets the internal representation and uses it to predict the output sequence.

The encoder-decoder approach to sequence prediction has proven much more effective than outputting a vector directly and is the preferred approach.

Generally, LSTMs have been found to not be very effective at auto-regression type problems. These are problems where forecasting the next time step is a function of recent time steps.

For more on this issue, see the post:

One-dimensional convolutional neural networks, or CNNs, have proven effective at automatically learning features from input sequences.

A popular approach has been to combine CNNs with LSTMs, where the CNN is as an encoder to learn features from sub-sequences of input data which are provided as time steps to an LSTM. This architecture is called a CNN-LSTM.

For more information on this architecture, see the post:

A power variation on the CNN LSTM architecture is the ConvLSTM that uses the convolutional reading of input subsequences directly within an LSTM’s units. This approach has proven very effective for time series classification and can be adapted for use in multi-step time series forecasting.

In this tutorial, we will explore a suite of LSTM architectures for multi-step time series forecasting. Specifically, we will look at how to develop the following models:

  • LSTM model with vector output for multi-step forecasting with univariate input data.
  • Encoder-Decoder LSTM model for multi-step forecasting with univariate input data.
  • Encoder-Decoder LSTM model for multi-step forecasting with multivariate input data.
  • CNN-LSTM Encoder-Decoder model for multi-step forecasting with univariate input data.
  • ConvLSTM Encoder-Decoder model for multi-step forecasting with univariate input data.

If you are new to using LSTMs for time series forecasting, I highly recommend the post:

The models will be developed and demonstrated on the household power prediction problem. A model is considered skillful if it achieves performance better than a naive model, which is an overall RMSE of about 465 kilowatts across a seven day forecast.

We will not focus on the tuning of these models to achieve optimal performance; instead, we will stop short at skillful models as compared to a naive forecast. The chosen structures and hyperparameters are chosen with a little trial and error. The scores should be taken as just an example rather than a study of the optimal model or configuration for the problem.

Given the stochastic nature of the models, it is good practice to evaluate a given model multiple times and report the mean performance on a test dataset. In the interest of brevity and keeping the code simple, we will instead present single-runs of models in this tutorial.

We cannot know which approach will be the most effective for a given multi-step forecasting problem. It is a good idea to explore a suite of methods in order to discover what works best on your specific dataset.

LSTM Model With Univariate Input and Vector Output

We will start off by developing a simple or vanilla LSTM model that reads in a sequence of days of total daily power consumption and predicts a vector output of the next standard week of daily power consumption.

This will provide the foundation for the more elaborate models developed in subsequent sections.

The number of prior days used as input defines the one-dimensional (1D) subsequence of data that the LSTM will read and learn to extract features. Some ideas on the size and nature of this input include:

  • All prior days, up to years worth of data.
  • The prior seven days.
  • The prior two weeks.
  • The prior one month.
  • The prior one year.
  • The prior week and the week to be predicted from one year ago.

There is no right answer; instead, each approach and more can be tested and the performance of the model can be used to choose the nature of the input that results in the best model performance.

These choices define a few things:

  • How the training data must be prepared in order to fit the model.
  • How the test data must be prepared in order to evaluate the model.
  • How to use the model to make predictions with a final model in the future.

A good starting point would be to use the prior seven days.

An LSTM model expects data to have the shape:

One sample will be comprised of seven time steps with one feature for the seven days of total daily power consumed.

The training dataset has 159 weeks of data, so the shape of the training dataset would be:

This is a good start. The data in this format would use the prior standard week to predict the next standard week. A problem is that 159 instances is not a lot to train a neural network.

A way to create a lot more training data is to change the problem during training to predict the next seven days given the prior seven days, regardless of the standard week.

This only impacts the training data, and the test problem remains the same: predict the daily power consumption for the next standard week given the prior standard week.

This will require a little preparation of the training data.

The training data is provided in standard weeks with eight variables, specifically in the shape [159, 7, 8]. The first step is to flatten the data so that we have eight time series sequences.

We then need to iterate over the time steps and divide the data into overlapping windows; each iteration moves along one time step and predicts the subsequent seven days.

For example:

We can do this by keeping track of start and end indexes for the inputs and outputs as we iterate across the length of the flattened data in terms of time steps.

We can also do this in a way where the number of inputs and outputs are parameterized (e.g. n_input, n_out) so that you can experiment with different values or adapt it for your own problem.

Below is a function named to_supervised() that takes a list of weeks (history) and the number of time steps to use as inputs and outputs and returns the data in the overlapping moving window format.

When we run this function on the entire training dataset, we transform 159 samples into 1,100; specifically, the transformed dataset has the shapes X=[1100, 7, 1] and y=[1100, 7].

Next, we can define and fit the LSTM model on the training data.

This multi-step time series forecasting problem is an autoregression. That means it is likely best modeled where that the next seven days is some function of observations at prior time steps. This and the relatively small amount of data means that a small model is required.

We will develop a model with a single hidden LSTM layer with 200 units. The number of units in the hidden layer is unrelated to the number of time steps in the input sequences. The LSTM layer is followed by a fully connected layer with 100 nodes that will interpret the features learned by the LSTM layer. Finally, an output layer will directly predict a vector with seven elements, one for each day in the output sequence.

We will use the mean squared error loss function as it is a good match for our chosen error metric of RMSE. We will use the efficient Adam implementation of stochastic gradient descent and fit the model for 70 epochs with a batch size of 16.

The small batch size and the stochastic nature of the algorithm means that the same model will learn a slightly different mapping of inputs to outputs each time it is trained. This means results may vary when the model is evaluated. You can try running the model multiple times and calculate an average of model performance.

The build_model() below prepares the training data, defines the model, and fits the model on the training data, returning the fit model ready for making predictions.

Now that we know how to fit the model, we can look at how the model can be used to make a prediction.

Generally, the model expects data to have the same three dimensional shape when making a prediction.

In this case, the expected shape of an input pattern is one sample, seven days of one feature for the daily power consumed:

Data must have this shape when making predictions for the test set and when a final model is being used to make predictions in the future. If you change the number if input days to 14, then the shape of the training data and the shape of new samples when making predictions must be changed accordingly to have 14 time steps. It is a modeling choice that you must carry forward when using the model.

We are using walk-forward validation to evaluate the model as described in the previous section.

This means that we have the observations available for the prior week in order to predict the coming week. These are collected into an array of standard weeks called history.

In order to predict the next standard week, we need to retrieve the last days of observations. As with the training data, we must first flatten the history data to remove the weekly structure so that we end up with eight parallel time series.

Next, we need to retrieve the last seven days of daily total power consumed (feature index 0).

We will parameterize this as we did for the training data so that the number of prior days used as input by the model can be modified in the future.

Next, we reshape the input into the expected three-dimensional structure.

We then make a prediction using the fit model and the input data and retrieve the vector of seven days of output.

The forecast() function below implements this and takes as arguments the model fit on the training dataset, the history of data observed so far, and the number of input time steps expected by the model.

That’s it; we now have everything we need to make multi-step time series forecasts with an LSTM model on the daily total power consumed univariate dataset.

We can tie all of this together. The complete example is listed below.

Running the example fits and evaluates the model, printing the overall RMSE across all seven days, and the per-day RMSE for each lead time.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that in this case, the model was skillful as compared to a naive forecast, achieving an overall RMSE of about 399 kilowatts, less than 465 kilowatts achieved by a naive model.

A plot of the daily RMSE is also created.

The plot shows that perhaps Tuesdays and Fridays are easier days to forecast than the other days and that perhaps Saturday at the end of the standard week is the hardest day to forecast.

Line Plot of RMSE per Day for Univariate LSTM with Vector Output and 7-day Inputs

We can increase the number of prior days to use as input from seven to 14 by changing the n_input variable.

Re-running the example with this change first prints a summary of performance of the model.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a further drop in the overall RMSE to about 370 kilowatts, suggesting that further tuning of the input size and perhaps the number of nodes in the model may result in better performance.

Comparing the per-day RMSE scores we see some are better and some are worse than using seven-day inputs.

This may suggest benefit in using the two different sized inputs in some way, such as an ensemble of the two approaches or perhaps a single model (e.g. a multi-headed model) that reads the training data in different ways.

Line Plot of RMSE per Day for Univariate LSTM with Vector Output and 14-day Inputs

Line Plot of RMSE per Day for Univariate LSTM with Vector Output and 14-day Inputs

Encoder-Decoder LSTM Model With Univariate Input

In this section, we can update the vanilla LSTM to use an encoder-decoder model.

This means that the model will not output a vector sequence directly. Instead, the model will be comprised of two sub models, the encoder to read and encode the input sequence, and the decoder that will read the encoded input sequence and make a one-step prediction for each element in the output sequence.

The difference is subtle, as in practice both approaches do in fact predict a sequence output.

The important difference is that an LSTM model is used in the decoder, allowing it to both know what was predicted for the prior day in the sequence and accumulate internal state while outputting the sequence.

Let’s take a closer look at how this model is defined.

As before, we define an LSTM hidden layer with 200 units. This is the encoder model that will read the input sequence and will output a 200 element vector (one output per unit) that captures features from the input sequence. We will use 14 days of total power consumption as input.

We will use a simple encoder-decoder architecture that is easy to implement in Keras, that has a lot of similarity to the architecture of an LSTM autoencoder.

First, the internal representation of the input sequence is repeated multiple times, once for each time step in the output sequence. This sequence of vectors will be presented to the LSTM decoder.

We then define the decoder as an LSTM hidden layer with 200 units. Importantly, the decoder will output the entire sequence, not just the output at the end of the sequence as we did with the encoder. This means that each of the 200 units will output a value for each of the seven days, representing the basis for what to predict for each day in the output sequence.

We will then use a fully connected layer to interpret each time step in the output sequence before the final output layer. Importantly, the output layer predicts a single step in the output sequence, not all seven days at a time,

This means that we will use the same layers applied to each step in the output sequence. It means that the same fully connected layer and output layer will be used to process each time step provided by the decoder. To achieve this, we will wrap the interpretation layer and the output layer in a TimeDistributed wrapper that allows the wrapped layers to be used for each time step from the decoder.

This allows the LSTM decoder to figure out the context required for each step in the output sequence and the wrapped dense layers to interpret each time step separately, yet reusing the same weights to perform the interpretation. An alternative would be to flatten all of the structure created by the LSTM decoder and to output the vector directly. You can try this as an extension to see how it compares.

The network therefore outputs a three-dimensional vector with the same structure as the input, with the dimensions [samples, timesteps, features].

There is a single feature, the daily total power consumed, and there are always seven features. A single one-week prediction will therefore have the size: [1, 7, 1].

Therefore, when training the model, we must restructure the output data (y) to have the three-dimensional structure instead of the two-dimensional structure of [samples, features] used in the previous section.

We can tie all of this together into the updated build_model() function listed below.

The complete example with the encoder-decoder model is listed below.

Running the example fits the model and summarizes the performance on the test dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that in this case, the model is skillful, achieving an overall RMSE score of about 372 kilowatts.

A line plot of the per-day RMSE is also created showing a similar pattern in error as was seen in the previous section.

Line Plot of RMSE per Day for Univariate Encoder-Decoder LSTM with 14-day Inputs

Encoder-Decoder LSTM Model With Multivariate Input

In this section, we will update the Encoder-Decoder LSTM developed in the previous section to use each of the eight time series variables to predict the next standard week of daily total power consumption.

We will do this by providing each one-dimensional time series to the model as a separate sequence of input.

The LSTM will in turn create an internal representation of each input sequence that will together be interpreted by the decoder.

Using multivariate inputs is helpful for those problems where the output sequence is some function of the observations at prior time steps from multiple different features, not just (or including) the feature being forecasted. It is unclear whether this is the case in the power consumption problem, but we can explore it nonetheless.

First, we must update the preparation of the training data to include all of the eight features, not just the one total daily power consumed. It requires a single line change:

The complete to_supervised() function with this change is listed below.

We also must update the function used to make forecasts with the fit model to use all eight features from the prior time steps.

Again, another small change:

The complete forecast() function with this change is listed below:

The same model architecture and configuration is used directly, although we will increase the number of training epochs from 20 to 50 given the 8-fold increase in the amount of input data.

The complete example is listed below.

Running the example fits the model and summarizes the performance on the test dataset.

Experimentation found that this model appears less stable than the univariate case and may be related to the differing scales of the input eight variables.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that in this case, the model is skillful, achieving an overall RMSE score of about 376 kilowatts.

A line plot of the per-day RMSE is also created.

Line Plot of RMSE per Day for Multivariate Encoder-Decoder LSTM with 14-day Inputs

Line Plot of RMSE per Day for Multivariate Encoder-Decoder LSTM with 14-day Inputs

CNN-LSTM Encoder-Decoder Model With Univariate Input

A convolutional neural network, or CNN, can be used as the encoder in an encoder-decoder architecture.

The CNN does not directly support sequence input; instead, a 1D CNN is capable of reading across sequence input and automatically learning the salient features. These can then be interpreted by an LSTM decoder as per normal. We refer to hybrid models that use a CNN and LSTM as CNN-LSTM models, and in this case we are using them together in an encoder-decoder architecture.

The CNN expects the input data to have the same 3D structure as the LSTM model, although multiple features are read as different channels that ultimately have the same effect.

We will simplify the example and focus on the CNN-LSTM with univariate input, but it can just as easily be updated to use multivariate input, which is left as an exercise.

As before, we will use input sequences comprised of 14 days of daily total power consumption.

We will define a simple but effective CNN architecture for the encoder that is comprised of two convolutional layers followed by a max pooling layer, the results of which are then flattened.

The first convolutional layer reads across the input sequence and projects the results onto feature maps. The second performs the same operation on the feature maps created by the first layer, attempting to amplify any salient features. We will use 64 feature maps per convolutional layer and read the input sequences with a kernel size of three time steps.

The max pooling layer simplifies the feature maps by keeping 1/4 of the values with the largest (max) signal. The distilled feature maps after the pooling layer are then flattened into one long vector that can then be used as input to the decoding process.

The decoder is the same as was defined in previous sections.

The only other change is to set the number of training epochs to 20.

The build_model() function with these changes is listed below.

We are now ready to try the encoder-decoder architecture with a CNN encoder.

The complete code listing is provided below.

Running the example fits the model and summarizes the performance on the test dataset.

A little experimentation showed that using two convolutional layers made the model more stable than using just a single layer.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that in this case the model is skillful, achieving an overall RMSE score of about 372 kilowatts.

A line plot of the per-day RMSE is also created.

Line Plot of RMSE per Day for Univariate Encoder-Decoder CNN LSTM with 14-day Inputs

ConvLSTM Encoder-Decoder Model With Univariate Input

A further extension of the CNN-LSTM approach is to perform the convolutions of the CNN (e.g. how the CNN reads the input sequence data) as part of the LSTM for each time step.

This combination is called a Convolutional LSTM, or ConvLSTM for short, and like the CNN-LSTM is also used for spatio-temporal data.

Unlike an LSTM that reads the data in directly in order to calculate internal state and state transitions, and unlike the CNN-LSTM that is interpreting the output from CNN models, the ConvLSTM is using convolutions directly as part of reading input into the LSTM units themselves.

For more information for how the equations for the ConvLSTM are calculated within the LSTM unit, see the paper:

The Keras library provides the ConvLSTM2D class that supports the ConvLSTM model for 2D data. It can be configured for 1D multivariate time series forecasting.

The ConvLSTM2D class, by default, expects input data to have the shape:

Where each time step of data is defined as an image of (rows * columns) data points.

We are working with a one-dimensional sequence of total power consumption, which we can interpret as one row with 14 columns, if we assume that we are using two weeks of data as input.

For the ConvLSTM, this would be a single read: that is, the LSTM would read one time step of 14 days and perform a convolution across those time steps.

This is not ideal.

Instead, we can split the 14 days into two subsequences with a length of seven days. The ConvLSTM can then read across the two time steps and perform the CNN process on the seven days of data within each.

For this chosen framing of the problem, the input for the ConvLSTM2D would therefore be:


  • Samples: n, for the number of examples in the training dataset.
  • Time: 2, for the two subsequences that we split a window of 14 days into.
  • Rows: 1, for the one-dimensional shape of each subsequence.
  • Columns: 7, for the seven days in each subsequence.
  • Channels: 1, for the single feature that we are working with as input.

You can explore other configurations, such as providing 21 days of input split into three subsequences of seven days, and/or providing all eight features or channels as input.

We can now prepare the data for the ConvLSTM2D model.

First, we must reshape the training dataset into the expected structure of [samples, timesteps, rows, cols, channels].

We can then define the encoder as a ConvLSTM hidden layer followed by a flatten layer ready for decoding.

We will also parameterize the number of subsequences (n_steps) and the length of each subsequence (n_length) and pass them as arguments.

The rest of the model and training is the same. The build_model() function with these changes is listed below.

This model expects five-dimensional data as input. Therefore, we must also update the preparation of a single sample in the forecast() function when making a prediction.

The forecast() function with this change and with the parameterized subsequences is provided below.

We now have all of the elements for evaluating an encoder-decoder architecture for multi-step time series forecasting where a ConvLSTM is used as the encoder.

The complete code example is listed below.

Running the example fits the model and summarizes the performance on the test dataset.

A little experimentation showed that using two convolutional layers made the model more stable than using just a single layer.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that in this case the model is skillful, achieving an overall RMSE score of about 367 kilowatts.

A line plot of the per-day RMSE is also created.

Line Plot of RMSE per Day for Univariate Encoder-Decoder ConvLSTM with 14-day Inputs

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Size of Input. Explore more or fewer number of days used as input for the model, such as three days, 21 days, 30 days, and more.
  • Model Tuning. Tune the structure and hyperparameters for a model and further lift model performance on average.
  • Data Scaling. Explore whether data scaling, such as standardization and normalization, can be used to improve the performance of any of the LSTM models.
  • Learning Diagnostics. Use diagnostics such as learning curves for the train and validation loss and mean squared error to help tune the structure and hyperparameters of a LSTM model.

If you explore any of these extensions, I’d love to know.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.





In this tutorial, you discovered how to develop long short-term memory recurrent neural networks for multi-step time series forecasting of household power consumption.

Specifically, you learned:

  • How to develop and evaluate Univariate and multivariate Encoder-Decoder LSTMs for multi-step time series forecasting.
  • How to develop and evaluate an CNN-LSTM Encoder-Decoder model for multi-step time series forecasting.
  • How to develop and evaluate a ConvLSTM Encoder-Decoder model for multi-step time series forecasting.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Note: This post was an excerpt chapter from the book “Deep Learning for Time Series Forecasting“. Take a look, if you want more step-by-step tutorials on getting the most out of deep learning methods on time series forecasting problems.

846 Responses to Multi-Step LSTM Time Series Forecasting Models for Power Usage

  1. Avatar
    Konrad October 10, 2018 at 8:12 am #

    Hi Jason,

    Thanks for another great article.

    I’ve got a question about your thoughts about Attention based networks and how do they compere to LSTMs. I heard many voices in favor of the first ones, but I would like to know how this looks in real situations and not competitions-world 😉


    • Avatar
      Jason Brownlee October 10, 2018 at 2:58 pm #

      Attention-based models can offer a lot of benefit on challenging sequence prediction problems.

      I have not used attention for time series forecasting though, sorry. Id on’t have good off the cuff advice.

      • Avatar
        Konrad October 11, 2018 at 6:05 am #

        Ok, sure, thanks for reply! 🙂

    • Avatar
      Russ March 21, 2023 at 12:55 pm #

      I ran the
      Encoder-Decoder LSTM Model With Multivariate Input
      and get the following results
      lstm: [1566.582] 1611.0, 1526.1, 1515.5, 1596.3, 1494.1, 1504.0, 1707.5

      which are significantly worse than the other approaches
      What am I doing wrong?

  2. Avatar
    shamsul October 12, 2018 at 1:03 pm #

    # model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))
    # model.add(Dense(100, activation=’relu’))

    how do we choose LSTM unit and dense unit? for example, here 200 units for LSTM and 100 units for Dense have been used. is there any formula out there? should we guess?

    it would be great if you could explain! Thanks in advance.

  3. Avatar
    NoobAtMl October 12, 2018 at 11:08 pm #

    How to calculate the accuracy of the Convolutional LSTM model of the electricity consumption dataset. Can you please provide the code for that?

  4. Avatar
    Troy October 13, 2018 at 7:07 am #

    Hey Jason,

    Great article. I’m trying to understand how you have your encoder decoder model vs. the official Keras example below:

    I also worked through this example from JEddy92 where he adopted the Keras method to do time series analysis:

    I’ve tried building the Keras model as similar to your model as possible and running both over the same data. Your model seems significantly different from their example, and I can’t quite reconcile the differences.

    I actually can’t get the Keras model for sequence to sequence to produce any good results for time series analysis. Running 1000 epochs and I got RMSE of 466.192. Have you built any time series models using the approach they are trying? Any ideas why this approach is so much harder to train than the one you have above?

  5. Avatar
    JiaojiaoFu October 17, 2018 at 2:33 pm #

    Multivariate prediction is which of these variables is predicted? I did not see the introduction of this part. Is the default giving the first variable of multiple variables?

    • Avatar
      Jason Brownlee October 18, 2018 at 6:23 am #

      Some of the models in the above tutorials take multivariate input and make a multi-step univariate prediction.

  6. Avatar
    Hasan October 19, 2018 at 1:27 am #

    Hi Jason,

    Great article, thanks.

    I am trying out image (spectrogram) input sequences for classification output.
    My network looks similar to “CNN-LSTM Encoder-Decoder Model With Univariate Input” with the difference that I am using TimeDistributed(Conv2D) layers and Multivariate Input.
    Your examples do not use TimeDistributed Conv layers , but I was wondering if you have any thoughts ? My intention is to pass every sample of my batch individually through the Conv layer and collectively through the LSTM decoder. This I think would allow me to not have to explicitly preprocess my input data by collecting all samples representing a sequence together.
    I am not sure if that would work okay, any comments would be a great help.

    • Avatar
      Jason Brownlee October 19, 2018 at 6:08 am #

      You can adapt the above example to use a time distributed conv.

      Perhaps try it and see, use results to guide you.

  7. Avatar
    Silvia Maria October 26, 2018 at 7:01 pm #

    Hi Jason,

    another great article, thank you… and this time it is exactly what I needed for my univariate time series forecasting project!

    I learned so much from your tutorials and your book, I cannot be more grateful 🙂

    I wanted to ask you a couple of questions, with reference to both proposed models (Vanilla LSTM and Encoder-Decoder):

    1) If I wanted to make the (Vanilla LSTM / Encoder-Decoder) networks deeper, how should I insert more layers?

    2) Statefulness, i.e., memory between batches: here you are using stateless networks, I guess you do that under the hypothesis that a single training batch contains all the series variability timescales we want to model, is that right?

    If I wanted to make the models stateful to see if statefulness leads to better results with my series, how should I do that? I’m not sure in which layers I should set return_sequences = True.

    Tank you very much for your attention, best,


    • Avatar
      Jason Brownlee October 27, 2018 at 5:58 am #


      Yes, you can make a model deeper by adding more layers.

      Don’t worry about statefulness for now, it does not impact model skill in my experiments.

  8. Avatar
    Silvia Maria October 27, 2018 at 7:17 pm #

    Thank you for your prompt answer.
    Now, it is very clear to me how I can add more layers in the Vanilla case, but not so clear in the Encoder-Decoder case. Should I add layers in both the encoder and the decoder? Could you please give me an example? Thank you for your patience, best, Silvia

    • Avatar
      Jason Brownlee October 28, 2018 at 6:08 am #

      You can add more layers to the encoder or more layers to the decoder.

  9. Avatar
    Daniel October 29, 2018 at 1:27 am #

    Hi Jason, I am enjoying a lot these posts! I am trying to replicate the Encoder-Decoder LSTM Model With Multivariate Input, but instead of using daily data, I resampled the data to hourly values. The goal is to predict a full week of values at an hourly level.

    I kept the rest of the model as is, except for the number of inputs (one week = 7*24) and the split_database, which now looks like this:

    train, test = data[32:24392], data[24392:34472]
    # restructure into windows of weekly data
    train = array(split(train, len(train)/(7*24)))
    print(‘[samples(weeks), timesteps(hours), features]: {}’.format(train.shape))
    test = array(split(test, len(test)/(7*24)))
    print(‘[samples(weeks), timestemps(hours), features]: {}’.format(test.shape))
    return train, test

    When I train the RNN, I get nan values in the loss function from the very beginning.

    I tried to use a MinMaxScaler on the data, and also tried with other optimizers, but I wasn’t successful.

    Any insights on this matter? Thanks a lot 🙂

    • Avatar
      Jason Brownlee October 29, 2018 at 5:58 am #

      Perhaps double check your input data does not have any nan’s.

      • Avatar
        Daniel October 29, 2018 at 6:59 am #

        I did that, but there were no nans. I got it working using that MinMaxScaler, plus tanh activation functions instead of ReLu for the LSTM layers. Thanks a lot and keep up this awesome work you are doing.

    • Avatar
      Daniel K. August 1, 2019 at 11:22 pm #

      Hi Daniel,

      can show me your code? I don’t know where I have to rescale the data.

      Thanks a lot and thank you Jason!

    • Avatar
      Shannon Jemina Doyle February 19, 2020 at 3:23 am #

      had the same problem. You need to convert the pandas dataframe to numpy array, then perform the split.

  10. Avatar
    Allen November 2, 2018 at 9:05 pm #

    Hi Jason,

    Thank you for the nice tutorial! It helps a lot! I noticed that you used differencing and scaling in the other tutorials for time series data, is there a reason why you don’t use it in this tutorial? Thank you!

    • Avatar
      Jason Brownlee November 3, 2018 at 7:02 am #

      To try to keep the example simple.

      I do recommend scaling input and target variables in general. It will make life easier for the learning algorithm.

  11. Avatar
    Jose Rafael November 6, 2018 at 11:42 am #

    Hello Jason Brownlee,

    You are one of my best research references, great job!

    This article has helped me to understand something about the context, however, I have a question on how I can simulate or predict future values using machine learning or deep learning, but with algorithms and graphs showing clearly, for example, for a set of historical daily temperature data, how could I simulate a possible value for month 6 But 10 years ahead?

    Do you have another article or link of any reference?

    Thank you very much.


  12. Avatar
    Ameni November 7, 2018 at 6:37 pm #

    Hi, thanks for your very nice tutorial.
    My question is about evaluating the overall RMSE during the training phase.
    Is it correct to use this code:

    from keras import backend as K

    def root_mean_squared_error(y_true, y_pred):
    return K.sqrt(K.mean(K.square(y_pred – y_true)))

    and the use model.compile(optimizer = ‘adam’, loss = root_mean_squared_error ,metrics=[root_mean_squared_error])

    instead of

    model.compile(optimizer = ‘adam’, loss = ‘mse’)

  13. Avatar
    Anubhav Srivastava November 10, 2018 at 8:39 pm #

    For the LSTM with multi-step forecasting, curious why you didn’t use LSTM layers with return_sequence=True and a Dense(1) output layer? Instead you have used two Dense layers, one with 100 outputs and an final Dense(7).

    Would the return_sequence=True in an LSTM followed by a Dense(1) approach be wrong?

    • Avatar
      Jason Brownlee November 11, 2018 at 6:02 am #

      Note, we do use this approach in the encoder-decoder, which requires the use of a TimeDistributed wrapper layer.

      • Avatar
        Anubhav Srivastava November 11, 2018 at 3:29 pm #

        Got it. So I take that to be a valid approach too?

        • Avatar
          Jason Brownlee November 12, 2018 at 5:36 am #

          Valid? I don’t follow, sorry.

          Try a suite of models and the one that gives the best performance is the one to use. Whether a model works or not is not enough.

  14. Avatar
    Simone Faragalli November 16, 2018 at 2:44 am #

    Hi, Great Article.

    Should it not be “test” instead of “train” series?

    Many thanks

    • Avatar
      Jason Brownlee November 16, 2018 at 6:17 am #

      No, here we are adding seeding the history with the training set.

    • Avatar
      Eva April 15, 2023 at 6:03 pm #

      I had the exact same question.
      The code does not seem to use test_x anywhere.
      It looks like train_x is used to predict test_y inside evaluate_model.
      Are you sure this is correct?

  15. Avatar
    Daniel November 17, 2018 at 4:20 am #

    Hi Jason! Any insights or rule of thumb to set input_size and batch size? Should these two be related?

    Thanks a lot!

    • Avatar
      Jason Brownlee November 17, 2018 at 5:52 am #


      Input size for lstms is the shape of each sample, e.g. timesteps and variables.

      Batch size is the number of samples to process before estimating the error gradient and updating weights via backprop.

  16. Avatar
    Daniel November 17, 2018 at 9:44 pm #

    Thanks for your reply! So, if I am trying to forecast a full week with hourly granularity, and I have let’s say, a full year of hourly observations, would a large batch size better capture the variation in the dependencies accross variables in the past? Or would it depend only on the input size?

    I would like the network to remember not only the recent behaviour, but also the past! 🙂

    Thanks a lot!

  17. Avatar
    Pierre December 1, 2018 at 2:28 am #

    I really like your Tutorial.

    I am trying to improve the model by using forecast weather to improve the load forecast.
    I have a dataset with many weather variables. I Want to build a model that use past_load, past_weather and future_weather to forecast future load.
    I would like to know what is the best way to prepare the dataset to optimally use LSTM.
    My problem is how to arrange the data in timesteps and features for each sample when there are some features that are not avalaible at all timesteps.

    I have tested many approaches:

    1) I have tried training my models with 1 timestep per sample and inputing all past weather and load and future weather as distinct features.
    2) I also tried with many timesteps and one feature per time step but inputting a dummy value in the future load to make such that the model put zero weights in the future loads that will not be available when the model will be used in prediction mode.

    I am sure that this is a common prediction problem and I am sure that there is a better way to proceed.

    • Avatar
      Jason Brownlee December 1, 2018 at 6:52 am #

      For missing data, you could try using a masking layer and mark the missing values to be ignored.

      There is no best way in applied machine learning, I recommend testing a suite of framings of the problem in order to discover what works best for your specific dataset.

  18. Avatar
    Abbas December 6, 2018 at 6:35 pm #

    Thanks for your article. I am working on crypto-price prediction, but I have lag in my predicting. I mean that my prediction is only based on my previous data, if price at t is 10 $, my prediction would also be 10 $, it means that at time t+1 we should expect the price to be 10 $; actually, I predict nothing. I have run your article’s code, and found that you may also have lag in your prediction. In addition, I have read your article about determining Base Line of predicting time series and I want to know what is the base line of house holds power consumption? is it greater than 370? can you explain more about LSTM lags?

  19. Avatar
    kany December 11, 2018 at 9:30 pm #

    Hello!! Great Article.
    I have a question Why your forecast model is different with this mentionned below:

    Is it really depended of the dataset?

    Thanks so much!!

    • Avatar
      Jason Brownlee December 12, 2018 at 5:53 am #

      Yes, models should suit the dataset.

      Also, the post you linked to should not be used as a reference as the approach is pretty poor.

      • Avatar
        kany December 13, 2018 at 2:51 am #

        Thanks Jason for replying me!! I am new and interest into this domain LSTM. If i resume your program was to evaluate the model by calculating MSE and RMSE. How can i know exactly the total power will be consumed for example next Sunday or Friday?
        In your code you use “yhat_sequence” which contains each week predict.
        Is it this variable “yhat_sequence” we know the total power will be consumed?


  20. Avatar
    mk December 24, 2018 at 4:21 pm #

    To make a forecast,YOU retrieve last observations for input data.I don’t think that’s the right way to do it.Although this method is used in many papers and programs.

    A more realistic way to reflect the performance of the model is as follows:
    last 7 days of train data as input,forecast output next 7days,and then,use this output as next input,forcast another next 7days.we use recurring forecasts to get all 2010 Results.We compare the results with the whole test set,but no using the test data as input.

    In this way, we can avoid leakage of time in the test data.

    Thanks for your kind attention and look forward your prompt reply.

  21. Avatar
    mk January 4, 2019 at 7:38 pm #

    I want to achieve ConvLSTM Encoder-Decoder Model With Multivariate Input
    and modify the source 2 functions as follow

    8 means features

    • Avatar
      Jason Brownlee January 5, 2019 at 6:52 am #

      Nice work. Sorry, I don’t have the capacity to review your code changes.

  22. Avatar
    mk January 4, 2019 at 7:46 pm #

    model can be trained but forecast and evolution are failed.
    Can you give me some advice?

    • Avatar
      Jason Brownlee January 5, 2019 at 6:52 am #

      What do you mean exactly?

      • Avatar
        mk January 5, 2019 at 5:35 pm #

        # reshape into [samples, time steps, rows, cols, channels]
        input_x = input_x.reshape((1, n_steps, 1, n_length, 8))

        reshape function is failed.

        ConvLSTM Encoder-Decoder Model With Multivariate(eg.8 features
        ) Input,I’m not sure about this model support Multivariate(eg.8 features
        ) Input.

        • Avatar
          Jason Brownlee January 6, 2019 at 10:17 am #

          Perhaps confirm that you worked through all steps and copied all code and that your environment is up to date.

  23. Avatar
    mk January 7, 2019 at 5:20 pm #

    Learning Diagnostics. Use diagnostics such as learning curves for the train and validation loss and mean squared error to help tune the structure and hyperparameters of a LSTM model.

    Train dataset is splited into validation and train data.Validation sets are used to adjust loss.

    Validation sets are not used a scheme called walk-forward validation.

    test dataset will be used a scheme called walk-forward validation.

    Is my understanding of this proposal correct?

    • Avatar
      Jason Brownlee January 8, 2019 at 6:46 am #

      Not quite. The train/test/validation split is challenging or may not even make sense when using walk-forward validation (e.g. sequence or time series data).

  24. Avatar
    mk January 8, 2019 at 7:49 pm #

    all code use this :mse = mean_squared_error(actual[:, i], predicted[:, i])
    actual shape is 2d,predicted shape is 3d in some code.
    I’m not sure whether this is correct
    predicted = array([[[1 ],
    [2 ],

    actual = array([[1, 2, 3, 4, 5, 6],….

  25. Avatar
    James January 11, 2019 at 5:02 am #

    Hi Jason,

    Another great post, thank you!

    I had a very general question: if my understanding is correct, these examples deal with splitting the data into train and test sets and then comparing the prediction with the test set with an RMSE. How do we make a prediction beyond the test set?

    For example:

    We train the model based on week 1 – week 9 data.
    We pass the model a sample of week 10 data
    How do we predict week 11?

    • Avatar
      Jason Brownlee January 11, 2019 at 7:54 am #

      In exactly the same way. e.g. model.predict()

      • Avatar
        James January 16, 2019 at 2:49 am #

        Thanks for the prompt response! Just a quick follow up – if I were to separate the training phase by saving the model and then performing predictions later on – would I still require the full history of the train data?

        Reason being, I notice that when calling evaluate_model you are not only training the model with the training data but also using it as history:

        history = [x for x in train]

        Does that imply that I would need the full training set data again for the prediction phase? or is it enough to just use new test data as history and run against predictions against the saved model?

        • Avatar
          Jason Brownlee January 16, 2019 at 5:50 am #

          Only enough history to make a prediction is required.

      • Avatar
        Noah January 17, 2019 at 12:04 am #

        Hi Jason, great post. I have a question related to James’ above.

        If I call model.predict() using the final week (e.g., Week 10) of my testing set as input data, I am predicting Week 11 values, not Week 10 values, correct?

        Thanks again

  26. Avatar
    Galen January 16, 2019 at 11:21 pm #

    Thanks Jason.

    I wonder, do you have a simpler example focusing only on the multi-step forecasting? This would be very helpful, since I’m only interested in that at the moment.

  27. Avatar
    PJ January 19, 2019 at 12:33 am #

    If you have multiple features predicting some dependent variable different from those features, meaning can you think of each time-step of these features as a sequence? That is, assuming each row is a time step and each column a feature (and that all features are normalized, Z-scored), does it make sense to use a plain LSTM on this sequence, even though the sequence is not temporal?

    • Avatar
      Jason Brownlee January 19, 2019 at 5:45 am #

      Sorry, I don’t follow your problem definition, perhaps you can elaborate it?

      • Avatar
        PJ January 20, 2019 at 6:31 am #

        Let’s say I am predicting US stock market (my Y) by looking at time series features such as UK and German stock market (X1 and X2). So, with 2 features, and let’s say the last week of time values, your Keras input would be (samples, 7, 2) in shape. Is this inherently better than just using X1 and X2 at the current time step to predict Y? That is, using (X1, X2) to predict Y in a way where input would be (samples,seq length = 2, channels = 1). Does this ever depend on the specific domain as well? To me, it makes sense that past values have a particular ‘pattern’ that correlates with future values. If you, on the other hand, combine X1 and X2 together, you are looking for a pattern/correlation *across* the features that determines the value. I have seen situations where the same problem has been tackled both ways, but I wonder if one is more likely to be successful than another

        • Avatar
          Jason Brownlee January 21, 2019 at 5:27 am #

          It depends on the specific of the problem.

          Also, in general, LSTMs are terrible at time series forecasting compared to other methods.

  28. Avatar
    mk January 20, 2019 at 12:54 pm #

    Is the result of the model a single training or an average of multiple training in this post?

    • Avatar
      Jason Brownlee January 21, 2019 at 5:29 am #

      In general, model weights are updated at the end of each batch, there are many batches in each epoch and we fit for many epochs.

      • Avatar
        mk January 21, 2019 at 12:34 pm #

        The small batch size and the stochastic nature of the algorithm means that the same model will learn a slightly different mapping of inputs to outputs each time it is trained. This means results may vary when the model is evaluated.
        Your results is an average of model performance?

        • Avatar
          Jason Brownlee January 22, 2019 at 6:17 am #

          You can reduce the variance of the model by fitting the model many times and averaging the performance.

          That is not done in this tutorial.

  29. Avatar
    Ivan January 20, 2019 at 10:27 pm #

    Hi Jason, can you clarify how to evaluate multiple step forecasting, like the mathematical formular behind. In this case, it is 7 steps forecasting, so is the formular sum( sqrt(mse(t1)+mse(t2)+…+mse(t7)), sqrt(mse(t8)+…+mse(t14)), ….)? ti is the difference between predicted and actual for time I.

    • Avatar
      Jason Brownlee January 21, 2019 at 5:34 am #

      You can evaluate each forecasted lead time separately, or combine all lead times and calculate a single error measure.

      I recommend the former approach and that is the approach used in this tutorial.

      • Avatar
        Ivan January 21, 2019 at 6:34 am #

        Thank you for your reply. How can we choose the model using this approach? There may be some cases when model 1 has lower Error for Monday to Wednesday and model 2 has lower error for Thursday to Saturday.

        • Avatar
          Jason Brownlee January 21, 2019 at 11:58 am #

          Same thing.

          If you can achieve lower overall error using a mixture of predictions of different models, then that is your “model”.

          • Avatar
            Ivan January 21, 2019 at 3:48 pm #

            Thank you so much! It sounds like an ensemble method.

          • Avatar
            Jason Brownlee January 22, 2019 at 6:20 am #

            No, you can achieve this with one model. You can achieve it with a direct model for each forecast lead time, and I show this in another tutorial:

          • Avatar
            Ivan January 22, 2019 at 11:45 am #

            Say I’m interested in predicting the probability distribution of household power consumption in the following 1-day period, so is there any methods that can predict the probability distribution? If so, how would you evaluate accuracy of these stochastic predictions?

          • Avatar
            Jason Brownlee January 23, 2019 at 8:40 am #

            Probability refers to an event, what is the event? Usage above a threshold?

            If in that case, it is a 2d probability distribution. A start would be probability per time interval and use a metric for comparing distributions per interval, like kl divergence.

          • Avatar
            Ivan January 23, 2019 at 9:37 am #

            Please let me clarify the question a bit. The models you developed in the tutorial are dealing with mean predictions, i.e. one prediction for one time step ( the model may predict the consumption would be 500 for tomorrow). The result (500 consumption) is a mean prediction because the consumption has the stochastic nature (50% chance to be 450 and 50% chance to be 550). Is there any ways to analyze this stochastic natural or the probability distribution of each possible consumption outcome?

          • Avatar
            Jason Brownlee January 23, 2019 at 12:03 pm #

            Not quite. One model will make one deterministic forecast for each day.

            For a range of forecasts for each day, an ensemble (e.g. a bootstrap) of models is required from which a distribution could be estimated and interpreted as a prediction uncertainty.

          • Avatar
            Ivan January 23, 2019 at 9:41 am #

            In other words, if the model predicts 500 for tomorrow, then is there any ways to evaluate the likelihood to be 500 for tomorrow and the probability for other possible outcomes?

          • Avatar
            Jason Brownlee January 23, 2019 at 12:03 pm #

            Yes, this is called a prediction interval:

          • Avatar
            Ivan January 23, 2019 at 12:20 pm #

            An ensemble of models sounds like a great idea to approximate a distribution for a range of forecasts. Then can we evaluate the accuracy by using kl divergence to compare predicted distribution and empirical distribution from the dataset?

            Do you think poisson distribution can possibly be used to approximate the distribution for power consumption?

          • Avatar
            Jason Brownlee January 24, 2019 at 6:37 am #

            These are separate ideas, I don’t think they mix. E.g. prediction intervals and predicting a probability. A prediction interval is not a predicted probability of an event, it is the scope of uncertainty of a point prediction.

            I recommend reading some of the background material I’ve posted and perhaps start with a strong definition of what you want to model:

          • Avatar
            Ivan January 24, 2019 at 9:23 am #

            Sorry for the confusion. I referred to predicting probability distribution for all possible outcomes in the next time interval, not a prediction interval.

          • Avatar
            Jason Brownlee January 24, 2019 at 1:23 pm #

            The model cannot do this, as stated.

          • Avatar
            Ivan January 24, 2019 at 1:32 pm #

            Right. I am going to try ensemble method and poisson distribution to do this. Thank you very much! It’s very helpful.

  30. Avatar
    mk January 21, 2019 at 5:25 pm #

    Encoder-Decoder Model,
    Encoder plays import role or Decoder Model does?
    ConvLSTM Encoder-Decoder Model,LSTM may plays import role

    Does Encoder play a role in extracting features?

  31. Avatar
    Neha January 24, 2019 at 5:08 am #

    Hi, Great Tutorial ! Thanks alot for doing this.

    I have a question. can you please explain me what is the evaluate forecast function doing?
    Is it calculating the rmse for all the days of all the weeks or just the last week predicted?
    Also are the ‘scores’ of just the last week predicted? because they are 7 in number.

    • Avatar
      Jason Brownlee January 24, 2019 at 6:49 am #

      It is calculating the RMSE for each day in weekly forecast.

  32. Avatar
    Aryorobo January 24, 2019 at 4:29 pm #

    Hi Jason,
    I have question, If I have 3 features (A,B,C) and I can access the future information from 2 of them (B,C). how can I predict A feature for multi step ahead ? how does the input array looks like for RNN LSTM ? what is the best framing problem for this situation ?

  33. Avatar
    Aryorobo January 25, 2019 at 12:10 pm #

    Thanks Jason, do you have experience with LSTM in NARX or something like A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction ?

  34. Avatar
    imGaboy January 31, 2019 at 3:59 am #

    Hi Jason. Thanks for your amazing tutorials. I have already read almost all article about this topic, but I’m trying to implement an LSTM model to make binary (or multiclass) classification from raw log data(Raw Mooc courses log data -> user-level droput/grade prediction ).

    Sample from data:

    I have read lots of publication and tutorials which seems to be what I’m looking for, but couldn’t find any example on how to use it.

    My biggest challenge to making a prediction for unique users, and feed the network with a user-level dataset.

    Do you have any idea?

  35. Avatar
    Lito February 2, 2019 at 11:59 pm #

    Hello Mr Brownlee

    I have to implement a multi-step forecasting project and i m really confused, so i would appriciate if you could help me.

    I have a lot of papers and for each paper a sequence of citations per year.

    Let say for example :
    paper1 : (2000,1), (2002, 2), (2008, 3), (2011, 4), (2012, 5)
    paper2: (1990,3), (2003,1), (2015,4)
    paperN: (2007,3)

    My goal is to predict the paper’s citation in the next year(let say t+1) and also in 5 years later(let say t+5) depending on the previous years citations.

    Which model is more suitable?
    Is it an autoregression prooblem?
    How do i deal with the different length of the sequences? Should i pad the sequences with zeros ?
    Also each sequence corresponds to a different paper.

    Any ideas or suggestions?

    Thanks in advance!

  36. Avatar
    Peter Shaw February 6, 2019 at 3:48 am #


    Great tutorial.I’m trying to understand if a ConvLSTM Encoder-Decoder Model but with multivariate Input is the best model for my dataset.

    I have a simplified plasma simulation which has around 22,000 timesteps of data. For each timestep the plasma parameters are recorded at one of 200 locations, and at each location 12 different variables are recorded. The 12 variables are a function of each other and a function of their location.

    I have created the dataset so it is a 2D array of appended matrices so that for each variable, you have the spatial data of the 200 locations. i.e. Var1-Loc(0,1,2…198,199), Var2-Loc(0,1,2…198,199)….. Var12-Loc(0,1,2…198,199).

    So the 2D dataset is 2400 columns (12 variables @ 200 locations) with 22,000 rows

    There is a need to train the neural network and predict how the plasma will behave n-timesteps into the future. Would a ConvLSTM Encoder-Decoder Model With Multivariate Input be the best architecture to go for or do you suggest an alternative architecture?



    • Avatar
      Jason Brownlee February 6, 2019 at 7:52 am #

      Generally, my advice would be to test a suite of models/framings of the problem and discover what works best.

      It does sound like a convlstm is a good fit though.

      Let me know how you go.

  37. Avatar
    kany February 9, 2019 at 3:40 am #

    Hello Jason,
    Which function to change if i want to predict one step.
    # split a univariate dataset into train/test sets

    def split_dataset(data):
    # split into standard weeks
    train, test = data[1:-328], data[-328:-6]
    # restructure into windows of weekly data
    train = array(split(train, len(train)/1))
    test = array(split(test, len(test)/1))
    return train, test

  38. Avatar
    Johan Ericson February 15, 2019 at 9:38 pm #

    Love it, saved my final year thesis!

    It would however be nice with a tutorial on how to actually use the trained model to predict on new data and how to display the results in a useful way. By useful I mainly think of plotting the known data and the predicted data in a plot with dates (or time in general) on the x-axis.

    Yout site and email courses have been gold trying to learn this stuff! Keep it up!

  39. Avatar
    Jack Maugon February 19, 2019 at 2:35 am #

    Hey Jason – how would the CNN LSTM extend to multiple input time series & predicting multiple output time series features? Is it as simple as reshaping the Y to

    # reshape output into [samples, timesteps, features]
    train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], n_features))

    and changing the final layer of the network:


    Or do you also need to change the structure of the initial convolutional layers?



  40. Avatar
    Sriram February 25, 2019 at 11:26 pm #

    Hey Jason,

    In the example Encoder-Decoder LSTM Model With Multivariate Input, I would like to know the model takes in multivariate input and predicts which feature and where is it specified in the code. I assume that it predicts the 1st input feature correct me if I am wrong.


    • Avatar
      Jason Brownlee February 26, 2019 at 6:23 am #

      Yes, it is in the name of the section:

      “7. Encoder-Decoder LSTM Model With Multivariate Input”

      Perhaps re-read the tutorial?

  41. Avatar
    mk123qwe February 26, 2019 at 7:03 pm #

    Walk-Forward Validation in this post,how to develop the model and tune the model ?

  42. Avatar
    Mingkuan February 27, 2019 at 9:17 am #

    Hi Jason,

    Thanks for your post! I tried multivariate input for the CNN-LSTM and ConvLSTM model. I took the average of 100 iterations and compared with univariate input case. It looks like multivariate input does not improve the forecast a lot. Maybe it’s because I haven’t tune the model yet. So my general question is that: Does more input variables always result in a better forecast?


    • Avatar
      Jason Brownlee February 27, 2019 at 2:38 pm #

      Well done!

      No, it really depends on the specifics of the dataset and model.

  43. Avatar
    J.Z February 28, 2019 at 9:37 pm #

    Thanks so much for this tutorial!

    I would like to predict some image characteristics such as size, position, etc.. based on search-keywords.
    I have a csv where for each keyword, image characteristics are given (training data). For instance:

    Keyword X0 Y0 Xn Yn Width Height position ImgID
    cat 261 49 872 690 611 283 top 2
    cat 23 43 866 565 603 270 buttom 3

    What lstm model best fit with such task?
    It can be considered as time series problem?

    Thank you

    • Avatar
      Jason Brownlee March 1, 2019 at 6:18 am #

      It looks like a regression problem, but not a time series prediction problem.

      Perhaps try an MLP?

  44. Avatar
    Niko March 1, 2019 at 8:39 am #

    Can you tell, how well your model is performing in terms of MAPE?

  45. Avatar
    Jc.zhu March 4, 2019 at 8:42 pm #

    In this case, why dont u use the normalization to processing the dataset?I found the loss is very big when i traing the networks.
    Thanks for u reply!

    • Avatar
      Jason Brownlee March 5, 2019 at 6:36 am #

      Yes, it is a good idea to normalize the input and output data prior to modeling.

      I left out that step of data preparation to focus on the modeling part of the tutorial. In other tutorials when I included data prep, more people were confused.

      • Avatar
        Jc.zhu March 5, 2019 at 1:26 pm #

        Thank u, Jason.
        I normalized the input and got the ideal loss, but I want to do the inverse normalization when calculating rmse, but the calculation is still the normalized value, maybe you can give me some advice.

        • Avatar
          Jason Brownlee March 5, 2019 at 2:23 pm #

          You can use the inverse_transform() on the scikit-learn encoder object.

        • Avatar
          sanjay January 13, 2020 at 9:49 pm #

          @Jc.zhu did you get where to put the inverse_transform()

          • Avatar
            Nick February 20, 2020 at 7:40 am #

            I too am struggling with this. I think the inverse_transform would be placed in the function “evaluate_forecasts” but I haven’t worked out the right way to apply it. As I understand it, the whole matrix that was initially passed into the “fit_transform” function needs to be passed. Not sure how to do that when it seems we are chunking only part of the matrix through the “evaluate_forecasts” function. Anyone figured this one out???


          • Avatar
            Jason Brownlee February 20, 2020 at 11:27 am #

            See this re transforms and inverse transforms:

  46. Avatar
    Sha March 5, 2019 at 3:12 pm #

    Hi Jason,

    Thanks for your post!
    I would like to know how to obtain the internal representation values of the last model (ConvLSTM Encoder-Decoder Model With Univariate Input).

    Thanks again.

  47. Avatar
    Oguz March 12, 2019 at 7:31 pm #

    Thank you for the tutorial.

    I have a domain-related question. How reasonable is it to sum the power values over 1-day periods? It is like you measure your velocity every minute (80 mph, 75 mph, 85 mph…) then you sum all those up to say you have a velocity of ~24 * 60 * 80 mph for that day. It doesn’t make sense physically but it may not be affecting the forecasting accuracy. If we definitely want to downsample to daily intervals it should be for energy, not power (you can indeed sum up distance covered, but not velocity).

  48. Avatar
    Kaiche March 13, 2019 at 9:47 pm #

    Hi Jason

    Thanks for your tutorial, it helps a lot.

    I want to stack two ConvLSTM, that means replace LSTM with ConvLSTM. For example time_step is 3 like input [10,3,25,25,1] and output is [10,3,2]

    The question is on this part model.add(RepeatVector(n_outputs)) when I set n_outputs = 3 as time step, I got error that convlstm expect ndim = 5, found ndim = 3

    What will be the problem base on your experience because we need the encoded output to be repeated the same number of time_step

    • Avatar
      Jason Brownlee March 14, 2019 at 9:23 am #

      I’m not sure the convlstm and be used directly in the encoder-decoder, some changes to the model may be required. I don’t have an example, you may have to prototype a few approaches.

  49. Avatar
    Jaime March 14, 2019 at 11:51 am #

    Hello Jason,
    I am adapting your last section code of this post to predict trajectories, so I need an output such as (1,18,2). The 18 is because I am predicting 18 times ahead and 2 is because I am predicting x,y.
    How can I adapt the model to have that output? Currently I am having this error:

    ValueError: Error when checking target: expected time_distributed_2 to have shape (18, 1) but got array with shape (18, 2)

    By the way, your posts are amazing. Thanks very much for create them 😀

  50. Avatar
    Melisa March 21, 2019 at 8:04 pm #

    Hi Jason,

    Congrats for the blog, it is great and really useful.

    I am trying to do a multi-step prediction of a continuous signal. Based on the past 100 samples of the signal I try to predict the next 10. It is univariate input and output but multi-step prediction. I used the model you propose in the “Encoder-Decoder LSTM Model With Univariate Input” section.

    My results are a bit curious as I observe that the first 2 or three immediate samples have a higher error than the rest. Basically, it is more difficult for the network to guess what is going to happen on the next second than 3 seconds from now. Do you by chance have any clue of what can be happening? Maybe I am not using the right approach/model?

    Thank you!

    • Avatar
      Jason Brownlee March 22, 2019 at 8:23 am #

      Perhaps try a suite of different models in order to see what works best for your specific dataset.

      This process may be helpful:

      • Avatar
        Melisa Lopez March 26, 2019 at 6:08 pm #

        Thank you! I changed the activation functions and the optimizer and it worked.

        I have yet another question. Is it possible to predict two features at the same time? What I mean is that my output vector Y would have the shape Y(samples, timesteps, 2). Would that be possible? I could not find any example in your blog and neither in a quick search in Google and I was wondering if that is possible at all or you should use the same model twice for each of the features that you would like to predict.

        Thanks again. 🙂

  51. Avatar
    Sai March 23, 2019 at 9:57 pm #

    In to_supervised function the output array dimensions were X=[1099, 7, 1] and y=[1099, 7].
    Why it can’t be the same.
    when we fit the model does it take if we make the dimension of array y same as X?

    • Avatar
      Jason Brownlee March 24, 2019 at 7:05 am #

      The model output must be one vector for each sample.

      The output may be 3d in the case of using an encoder-decoder model.

      • Avatar
        sai March 24, 2019 at 8:53 pm #

        1. Does it remember the past pattern (long ago) to impact on the present prediction in the vector model?
        2. I want to use this model for web app when I give my data by choosing 1 st time 12 output value, again I’m giving the same data choosing 24 output, that time previous model values existed in present model? Whether it works?

  52. Avatar
    Antonio March 24, 2019 at 6:16 am #

    Hi Jason,

    Any reason for this well-known problem in timeseries forescasting using LSTM’s? 1-step lagged predictions. See below:

    I’m trying to do multivariate input for predicting univariate 1-step in the future, using LSTM, and I’m facing this problem of shifted predictions. Any light on this problem would be awesome.

    Great job on this website, congrats. I’m buying your book on Deep Learning for Time Series forescasting now 🙂

    • Avatar
      Jason Brownlee March 24, 2019 at 7:10 am #

      Yes, it suggests the model has learned a persistence model, learn more here:

      • Avatar
        Antonio March 24, 2019 at 7:28 am #

        Thanks. I’ll try to solve it.

        I’m training a LSTM with multiple sequences of 100 time steps (t-100, t-99, …, t ) for 8 sensor measurements (multivariate time series). Then I try to predict for t+1 the value of one of the 8 sensors (and then the problem of shifting happens).

        Do you think there is any model more suitable for doing accurate predictions, rather than using LSTM for this kind of problems? maybe CNN-LSTM? I have bought your book but I’d like to focus on the most promising techniques for modelling this problem as best as possible. I’d thank any help to focus the search throught your book.

        Thanks Jason!

        • Avatar
          Jason Brownlee March 25, 2019 at 6:37 am #

          I have had good success with CNNs and CNN-LSTMs.

          Nevertheless, I’d recommend starting with a naive method, then a linear method, then try a suite of neural nets in order to discover what works well/best.

  53. Avatar
    Antonio March 25, 2019 at 10:32 am #

    I understand.

    Please, could you give me a few names of promising algorithms for my specific problem of time series forecasting that support multivariate data?

    Deep Learning: CNN, CNN-LSTM
    Others for neural nets or classical ones?


  54. Avatar
    steven March 26, 2019 at 4:57 am #

    how can we extract the predicted values for the corresponding actual values for the week?

  55. Avatar
    steven March 26, 2019 at 7:32 am #

    # plot scores
    days = [‘sun’, ‘mon’, ‘tue’, ‘wed’, ‘thr’, ‘fri’, ‘sat’]
    pyplot.plot(days, scores, marker=’o’, label=’lstm’)

    the above code snippet is for which week? can you please provide the dates instead of the days in the plot?

  56. Avatar
    msu March 26, 2019 at 4:18 pm #

    why whenever i use lstm it’s display “using tensorflow backend ” error , i was search and don’t find any solutionز

    • Avatar
      Jason Brownlee March 27, 2019 at 8:55 am #

      That is not an error, it is an information message.

  57. Avatar
    Sanjoy Datta March 27, 2019 at 11:04 pm #

    Learned a lot of things today. Want to try out on stock prices.

    • Avatar
      Jason Brownlee March 28, 2019 at 8:15 am #


      Perhaps not stock prices though, they’re not predictable.

  58. Avatar
    steven March 29, 2019 at 6:06 am #


    every time i run the model i get different predicted values. How can i get consistent values?

    I added theses lines at the starting of my code, but the problem still persists

    from numpy.random import seed
    from tensorflow import set_random_seed, random_uniform

  59. Avatar
    Elie Santi April 5, 2019 at 8:56 pm #


    I’m trying to use the ConvLSTM Encoder-Decoder in order to encode and decode a 8 hz time series dataset but I’m completely confused with input_shape. At the beginning I have a dataset of 1219810 rows and 8 colums. I decided to reshape it with 121981 samples of 10 rows and 8 colums. How should I reshape my data for the ConvLSTM model then ? I tried to do data.reshape(121981, 1, 10, 8, 1) but that returned that the timedistributed layer does not expect this shape. I read the article but I’m too confused to adapt it to my dataset

    I would really appreciate some help ! Thanks !

    • Avatar
      Jason Brownlee April 6, 2019 at 6:47 am #

      Perhaps you can use the above example as a starting point, then adapt it for your specific dataset?

  60. Avatar
    SURAVI KALITA April 7, 2019 at 4:06 am #

    HI Jason
    Do have the solution to the same problem in R.

    How to Develop LSTM Models for Multi-Step Time Series Forecasting of Household Power Consumption in which you solve the problem using R. Thanks

    • Avatar
      Jason Brownlee April 7, 2019 at 5:34 am #

      Sorry, I don’t have examples of time series in R. I may cover the topic in the future.

  61. Avatar
    Fazano April 8, 2019 at 4:35 pm #

    hi jason, if i want to forecast 30 days ahead based your code, the final dense must 30?

  62. Avatar
    Marthijn April 10, 2019 at 10:33 pm #

    Hi Jason,

    In the text you mention that we need to iterate over the time steps and divide the data into overlapping windows for training the neural network. You basically have a input-output pair, feed it to the neural network, move one day ahead so that the first value that was previously in the output list, will now be the last value in the input list and add another unseen value to the output list. There will be scenarios where the input list is the same as the output list, 7 steps back. Won’t this cause the leakage of data? You are using input that the neural network has already seen, namely as output.

    It seems to fit the description you use of data leakage pretty well, as seen here:

  63. Avatar
    Michael April 12, 2019 at 8:41 am #

    Hey Jason, Thank you so much for these posts I really love them and rely on them.

  64. Avatar
    Amelie April 16, 2019 at 7:32 pm #

    Hello Jason,

    I found this type of error.
    I would like to make a forecast in minutes

    Using TensorFlow backend.
    [samples(weeks), timesteps(minutes), features]: (36, 10080, 1)
    [samples(weeks), timestemps(minutes), features]: (5, 10080, 1)
    (36, 10080, 1)
    (5, 10080, 1)
    Traceback (most recent call last):
    File “C:\Users\Desktop\”, line 149, in
    model = build_model(train, n_input)
    File “C:\Users\Desktop\”, line 74, in build_model
    train_x, train_y = to_supervised(train, n_input)
    File “C:\Users\Desktop\”, line 69, in to_supervised
    return array(X), array(y)

    I need your advise!

    • Avatar
      Jason Brownlee April 17, 2019 at 6:58 am #

      Sorry to hear that.

      Perhaps try using less data?
      Perhaps try running on a different machine?

  65. Avatar
    Wasabi April 20, 2019 at 12:13 am #

    Dear Dr. Jason:
    I’m a big fan of you, and I’m very interested in your LSTM research. When I run # univariate multi-step lstm, I set up the Python 3 running environment according to your instructions. What’s the matter? I need your help! Thank you!
    Traceback (most recent call last):
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\”, line 78, in
    globals = run_file(file, None, None)
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\”, line 35, in run_file
    pydev_imports.execfile(file, globals, locals) # execute the script
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\_pydev_imps\”, line 18, in execfile
    exec(compile(contents+”\n”, file, ‘exec’), glob, loc)
    File “D:/univariate multi-step”, line 6, in
    from sklearn.metrics import mean_squared_error
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\_pydev_bundle\”, line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
    File “E:\anaconda3-4.4.0\envs\dlwin36\lib\sklearn\metrics\”, line 7, in
    from .ranking import auc
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\_pydev_bundle\”, line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
    File “E:\anaconda3-4.4.0\envs\dlwin36\lib\sklearn\metrics\”, line 29, in
    from ..utils.multiclass import type_of_target
    File “D:\Deep Learning\PyCharm 2017.2.3\helpers\pydev\_pydev_bundle\”, line 21, in do_import
    module = self._system_import(name, *args, **kwargs)
    File “E:\anaconda3-4.4.0\envs\dlwin36\lib\sklearn\utils\”, line 21, in
    from ..externals.six import string_types
    ImportError: cannot import name ‘string_types’

  66. Avatar
    Johan Ericson May 7, 2019 at 12:34 am #

    There’s one (well there’s many but mainly one) part that’s confusing me a bit…
    In the walk forward validation in the evaluate_forecast function.
    What if we have only one week of test data, would this still give valid predictions? Wouldn’t that mean that the predictions are based only on the data from the training set used for seeding?

    I’ve followed this tutorial and applied to a problem to predict two hours ahead based on the last 24 hours of power consumption. But this part is giving me a headache…

  67. Avatar
    George May 9, 2019 at 10:42 pm #

    Hi Jason and thanks for the awesome tutorial.

    I must tell you that the code does not work due to using the split numpy function in the split_dataset function.

    It returns array split does not result in an equal division.

    I tried to use the np.array_splt which is ok but then we have a problem at the

    data = train.reshape((train.shape[0]*train.shape[1], train.shape[2])) line inside the to_supervised function.

    It returns tuple index out of range because train has just one dimension.

    If you change it to data = train.reshape((train.shape[0], 1)), it returns:

    setting an array element with a sequence when calling fit method

    (For the above last error message, take into account that I have tested it with tensorflow2.0 api , so I am not sure if with 1 has a problem)

    Also, if we have enough weeks (not 159 as you say, but 800 for example) , at the to_supervised function can we use instead the version with the shit?


    def to_supervised(input_arr, shift=1):

    df = pd.DataFrame(input_arr)
    columns = [df.shift(i) for i in range(1, shift + 1)]
    df = pd.concat(columns, axis=1)
    df.fillna(0, inplace=True)
    values = df.values
    X, y = values[:, 0], values[:, 1]
    X = np.expand_dims(X, axis = 1)
    return X, y

    and use these X,y inputs in the build model?



  68. Avatar
    George May 9, 2019 at 11:18 pm #

    Jason , I was wrong!

    I just saw that you are using the consumption_days.csv and not the consumption.csv.

    I didn’t notice that, sorry.

    You can ignore my previous message expect from the last part where I ask if we can somehow use the shift function .


  69. Avatar
    Adam May 16, 2019 at 3:54 am #


    Thank you for putting these amazing tutorials together. I really appreciate the effort you put in to make various machine learning techniques understandable.

    I’m trying to understand the difference between your model architecture and the one presented in ( Is the main difference that you know the number of steps you are trying to predict out is fixed. Whereas, in the Keras blog this would work for a variable output length?

    • Avatar
      Jason Brownlee May 16, 2019 at 6:35 am #

      Yes, the Keras blog uses a dynamic RNN, whereas I typically use a fixed length input/output as it greatly simplifies the code.

      • Avatar
        Eli October 10, 2019 at 8:34 am #

        So I want to jump on this thread. Can’t that be fixed with changing the number of state that are returned with the ‘Repeat_Vector(n)’ layer? For instance, input of 100 timesteps with output of 5 timesteps. Input would have shape (100, features) for your input layer, and next layer would be Repeat_Vector(5). Say you wanted to do 10 timesteps in the future next, then you just change 5 to 10 in the Repeat_Vector layer?

        Or am I missing something?

  70. Avatar
    George May 17, 2019 at 6:29 pm #

    Hi Jason!

    I wanted to ask you.

    If I use the metrics ‘mae’ and ‘mape’ in the compilation:

    model.compile(loss=’mse’, optimizer=’adam’, metrics=[‘mae’, ‘mape’]

    the mae has values around 0.15-0.14 and mape around 110.

    Can we say something about this?(small/good mae and large/bad mape)?


  71. Avatar
    单沙嘉 May 20, 2019 at 7:05 pm #

    I have a question, how can I use multiple eigenvalues in cnn-lstm?

  72. Avatar
    Bosko May 21, 2019 at 8:08 pm #

    In part “LSTM Model With Univariate Input and Vector Output”, watching whole code i noticed something that i don’t understand, in line 107. where you make for loop. Why isn’t line 113. (history.append(test[i,:])) before line 109. (where you make prediction yhat_sequence)? I ask because you firstly send last 7 days from train set to prediction (although model trained on it), should you first update history from line 104 with new (unknown to our model) data from test set and then make prediction on it(on unknown data, not on known data like last 7 days from train set)?

    • Avatar
      Jason Brownlee May 22, 2019 at 8:03 am #

      We update history after making a prediction as a simulation for receiving the real observation after making a prediction, this is called walk forward validation.

  73. Avatar
    Markus May 23, 2019 at 8:01 pm #


    Thanks a lot for this blog post. Could you please explain why through it’s first loop the evaluate_forecasts function iterates through the columns (the features) and not rows (the samples)?

    I would expect:
    mse = mean_squared_error(actual[:, i], predicted[:, i])
    instead of
    mse = mean_squared_error(actual[i, :], predicted[i, :])

    • Avatar
      Jason Brownlee May 24, 2019 at 7:51 am #

      We are calculating the error for each column, e.g. forecasted day across the forecasts made.

      We are answering the question, when we forecast +1 how much error do we make, ditto for +2, +3, etc.

      • Avatar
        Markus May 24, 2019 at 4:12 pm #

        Please correct me if I am wrong:

        – Each of the actual and predicted arrays have 2 axises
        – shape[0] corresponds to the number of the days the model has predicted, e.g. +1, +2, +3
        – shape[1] corresponds to the features we have, that’s global_active_power, sub_metering_1, sub_metering_2 etc.

        • Avatar
          Jason Brownlee May 25, 2019 at 7:40 am #

          Not quite, we are only forecasting power consumption.

          Output is rows and cols where rows are the weekly forecasts made and cols are the days in each forecast.

          • Avatar
            Markus May 26, 2019 at 1:09 am #

            I guess all the over the places where it says:

            if out_end < len(data):

            Should be replaced with

            if out_end <= len(data):

            As otherwise we would miss one training sample we could learn from.

          • Avatar
            Jason Brownlee May 26, 2019 at 6:47 am #

            We are working with zero-offset arrays, I believe your change will introduce a bug.

          • Avatar
            Markus May 26, 2019 at 7:31 am #

            the out_end variable is used 4 lines afterwards as the following:

            y.append(data[in_end:out_end, 0])

            where with this change it’s exclusive upper bound would go up to len(data), which means out_end itself would go up to len(data) – 1 (as expected), however currently it goes only up to len(data) – 2

            Without this change the following array is the one which is missing as the last element by the training feature array train_x:

            [1455.2 ]

            And the corresponding missing training label inside train_y for that is:

            [2195.452 2094.098 2047.968 2451.11 2211.892 1224.252 1309.268]

          • Avatar
            Jason Brownlee May 27, 2019 at 6:42 am #

            Thanks, I’ll schedule time to investigate.

            Update: Fixed!

  74. Avatar
    Jack May 28, 2019 at 12:47 pm #

    Hi, Jason
    When I used ConvLSTM, The amount of data and parameters were similar to those used in your course, and my loss function used ‘mae ‘. When training the model, the loss value of training set and validation set did not change and remained at the same value. What is the most likely reason.

    • Avatar
      Jason Brownlee May 28, 2019 at 2:43 pm #

      Perhaps the model require tuning to your specific dataset.

      • Avatar
        Jack May 28, 2019 at 3:51 pm #

        Thank you for answering my question, but I still don’t quite understand why the loss value does not decrease or becomes nan. The data I used is the precipitation data of your other course

        • Avatar
          Jason Brownlee May 29, 2019 at 8:35 am #

          Perhaps try scaling the data prior to modeling to see if that makes a difference?

      • Avatar
        Jack May 29, 2019 at 1:13 pm #

        I am grateful for your answer, but the first step I will do to your data MinMaxScaler (0, 1), but there is still a loss does not fall or become a nan.I resample your precipitation data to the daily precipitation data and try to predict the future for a whole year of daily precipitation as a result,My network structure is the same as yours in the tutorial,But 363 days of precipitation a year is divided into 33 subsequences; each subsequence contains 11 days of precipitation data. That is train_x. Reshape ((train_x shape [0], 33,1,11,1)), train_y. Reshape ((train_y. Shape [0], train_y shape [1], 1))

        • Avatar
          Jason Brownlee May 29, 2019 at 2:48 pm #

          Perhaps try varying the model configuration and try debugging inputs and outputs to the model each epoch to debug/nail down the cause of your issue.

    • Avatar
      Jack May 28, 2019 at 3:47 pm #

      In addition, when ‘mse’ is used as the loss function, the loss value quickly becomes nan, again without knowing the reason

  75. Avatar
    Tim May 29, 2019 at 8:05 pm #

    Hi Jason, Thanks for such a great tutorial.

    I’m using an adapted multi-step LSTMs model for forecasting, and am getting slightly strange behaviour. Single-step predictions look reasonable, but when comparing multi-step predictions, they all take the same forecast trajectory which follows the data’s curve. Imagine each N-step projection taking the same ‘shape’ but being translated at every step to be centered around the curve.

    Is there some obvious reason why this may happen?

    Thanks again!

  76. Avatar
    ICHaLiL May 30, 2019 at 1:08 am #

    Dear Jason,

    In this example (by using different dataset), I tried to increase the input size (timesteps) from 7 to 70 and I got loss always NaN. So code is giving error.

    Which ways can I try to overcome it?

    • Avatar
      Jason Brownlee May 30, 2019 at 9:02 am #

      Perhaps investigate the cause, e.g. data, exploding gradient, vanishing gradients, etc.

  77. Avatar
    al krinker May 30, 2019 at 6:41 am #

    your forecast method is really a validation method, i.e. it uses existing data to make predictions in supervised learning model.

    not really the same as in to forecast values 2-3 months ahead.

    • Avatar
      Jason Brownlee May 30, 2019 at 9:07 am #

      Correct, we are evaluating the models.

      To use a model, fit it on all available data and forecast as follows:

      yhat = model.predict(newX)

      • Avatar
        al krinker June 6, 2019 at 11:15 am #

        what’s newX variable?

        • Avatar
          al krinker June 6, 2019 at 11:21 am #

          guess what i mean was… in your example, how do you forecast future year of values? what do you supply as newX

        • Avatar
          Jason Brownlee June 6, 2019 at 2:18 pm #

          New input to the model required to make a prediction.

          Perhaps this will help:

          • Avatar
            al krinker June 6, 2019 at 11:28 pm #

            yeah, i saw that, but it seems you need to construct your own data for the future to predict the outcome.
            i used stats package in python such as ARIMA, VAR and there is a handy function that would predict future without any input data, so i was hoping to see something like that here.

            LSTM just does not seem to be very good tool to use in this case if you simply wants to know future predictions and uncertain about what the future feature values would be.

            almost think that LSTM can give you answers to the future if you want to throw some feature values and curious about what would this input to produce

            but it is not mean to be true forecasting tool like arima or var since you can’t say given this input, predict 2 years worth of data for me in the future

          • Avatar
            Jason Brownlee June 7, 2019 at 8:01 am #

            They are models that take input and predict an output.

            yhat = f(X)

            A predictive model that takes no input would be odd!?

            Perhaps I don’t understand what you’re trying to achieve?

          • Avatar
            al krinker June 7, 2019 at 11:55 pm #

            Here is an example to make my point:
            Let’s say that you are using time series weather data to predict temperature
            you have let’s say following columns:
            – observed temp for that day, wind, precipitation, humidity

            you train your model based on the historical data,
            you have the model

            now… to forecast temperature for the upcoming weekend, what do you do?
            Do you supply your model wind, precipitation and humidity? How do you know them in advance?

            I guess you can use LSTM model to predict the temp given these values… but you cant predict the weather few days out unless you try to guess what the precipitation and humidity might be like for that weekend?

            almost feel i need another model to predict humidity and precipitation for the weekend via regression, and then use LSTM after that step

          • Avatar
            Jason Brownlee June 8, 2019 at 7:00 am #

            You must design the inputs and outputs to the model based on wha needs to be predicted and what you will have available in order to make the prediction.

  78. Avatar
    Sandipan Banerjee June 1, 2019 at 12:59 am #

    Suppose you had the data for each household (where in each house is related spatially), and the prediction for each house is required to get the total prediction. How would you change the inputs for that in the CONV-LSTM2D? I mean where would the number of houses be going in?

    Samples: n, for the number of examples in the training dataset.
    Time: 2, for the two subsequences that we split a window of 14 days into.
    Rows: 1, for the one-dimensional shape of each subsequence.
    Columns: 7, for the seven days in each subsequence.
    Channels: 1, for the single feature that we are working with as input.

  79. Avatar
    Emanuel June 3, 2019 at 5:40 am #

    Hi, thanks for the tutorial

    i have been doing a code to make the forecasting of my own dataset but i have been finding a problem. The line in the predict graph is one step ahead of the test graph. It is look like the predict graph is following the real graph

  80. Avatar
    Vane June 5, 2019 at 8:30 pm #


    is there a method of using multiple sliding windows “glued” as a final sliding window?
    I mean I’d use for example the latest 15 data rows by minute (window 1), then the last 5 data rows by hours, from the end of those hours (window 2) and finally the last 5 data rows by days, from the end of those days (window 3) and these 3 would be used as the aggregated sliding window of 15+5+5 = 25 rows of data.
    This way I suppose I could get time series data showing short, mid and longer term data, but not with continuous time indices.
    If this above is not possible, do you have a workaround dealing with such non-continuous sliding window data scheme?

    • Avatar
      Jason Brownlee June 6, 2019 at 6:23 am #

      Generally, you fit a final model on all available data then use it to start making predictions.

      • Avatar
        Vane June 7, 2019 at 2:11 am #

        I think I understand you but I mean I’d use not a sliding window of continuous 10 rows (like from t-1 to t-10), but I’d use a non-continuous sliding window.
        According to my original post’s example, I’d use a sliding window of 15+5+5 rows, as:
        from t-1 to t-15,
        t-60, t-120, t-180, t-240, t-300
        t-1440, t-2880, t-4320, t-5760, t-7200.
        Can I use such a sliding window scheme?

        Thank you!

        • Avatar
          Jason Brownlee June 7, 2019 at 8:07 am #

          Sure, you can define the model with any inputs and outputs you wish.

  81. Avatar
    Jack June 6, 2019 at 12:04 am #

    Hi, thanks for the tutorial,
    What I want to learn from you is that you expand data set by using ‘to_supervised’. What I want to know is that since CNN cannot learn rules before and after time, will it perform better in cnn-lstm model to predict results by using the data set operated by ‘shuffle’

    • Avatar
      Jason Brownlee June 6, 2019 at 6:32 am #

      It is not a good idea to so shuffle time series data.

      But technically, the CNN model has no state, so it is possible to shuffle the training dataset.

  82. Avatar
    gustavz June 6, 2019 at 6:32 pm #

    Hi Jason,

    first of all thank you for all your great tutorials!

    My question in this case is:
    I am not able to train the encoder-decoder nor the cnn-lstms with batch size > 1.
    (Contrary to this with the vanilla lstm it works)

    I always see the following error:
    InvalidArgumentError: Incompatible shapes: [10,24,1] vs. [10,24]
    [[{{node loss/dropout_loss/SquaredDifference}}]]

    where here batch size is 10 and my output window is 24.

    It seems that i struggles with the last 1 dimension, why is that? Is it possible to somehow reshape or squeeze the output to get rid of it?

    Best, Gustav

    • Avatar
      Jason Brownlee June 7, 2019 at 7:54 am #

      Hmm, that is very odd.

      Perhaps you are using a stateful LSTM? If so, try not using a stateful LSTM?

      • Avatar
        gustavz June 13, 2019 at 4:56 pm #

        it was due to a wrongly shaped validation data i used when calling

  83. Avatar
    Shubha Manikarnike June 11, 2019 at 9:28 pm #

    Hi Jason,

    I have a dataset where the Frequency is unevenly spaced. How should I approach the problem.
    Is it necessary to convert it to evenly spaced frequency before applying the lstm models ?


    • Avatar
      Jason Brownlee June 12, 2019 at 8:03 am #

      I recommend testing a suite of methods to see what works, e.g. try modeling as is, try resampling to evenly spaced, try padding, etc.

  84. Avatar
    gustavz June 13, 2019 at 4:55 pm #

    Hi Jason,

    How would I use Dropout and Batchnormalization correctly in all of the presented models?

    – I read that applying batchnorm right after LSTM layers is not a good idea, do you agree with that or do you have other experiences with that?

    – For the vanilla lstm I would add dropout and batchnorm between the dense layers and dropout inside the lstm like:

    model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs, n_features), dropout=dropout_rate))
    model.add(Dense(int(n_blocks/2), activation=activation))

    – for the encoder_decoder I would do it the same way between the dense layers. But do I need to wrap them in TimeDistributed() layers?


    – same case for conv_lstm and cnn_lstm. But here additional, is it possible/good to add batchnorm after each 1D Conv, ConvLSTM2D respectivley ?

    Thank you very much in advance!

    • Avatar
      Jason Brownlee June 14, 2019 at 6:37 am #

      Typically one or the other method is used, In practice they don’t work well togehter.

      • Avatar
        gustavz June 14, 2019 at 5:30 pm #

        Hi Jason, could you go more into detail on the questions, i am not sure how to understand your anser.

        • Avatar
          Jason Brownlee June 15, 2019 at 6:26 am #

          Sorry, I meant typically we use either BatchNorm or Dropout in a given model.

          These two techniques don’t work well together in the same model.

          Does that help?

  85. Avatar
    gerb June 18, 2019 at 7:15 am #

    hi jason,

    what changes do I have to make if i want to have parallel predictions for all features for 1 day out in the future?

    I have made these changes
    def to_supervised(train, n_steps_in, n_steps_out=1):
    # data = train.reshape((train.shape[0] * train.shape[1], train.shape[2]))
    X, y = list(), list()
    # in_start = 0
    for i in range(len(train)):
    # find the end of this pattern
    end_ix = i + n_steps_in
    out_end_ix = end_ix + n_steps_out
    # check if we are beyond the dataset
    if out_end_ix > len(train):
    # gather input and output parts of the pattern
    seq_x, seq_y = train[i:end_ix, :], train[end_ix:out_end_ix, :]

    return array(X), array(y)

    but i am getting reshape array error in build model function

  86. Avatar
    Jamie June 19, 2019 at 10:52 pm #

    Dear Jason,

    Thank you very much for these more than informative tutorials. They have been really helpful.

    There is one question that goes unanswered for me, nevertheless. If for instance I want to predict not the default 7 days ahead, but let’s say a custom 14 days, I cannot find an easy way to implement this. When I change for instance the variable n_out = 7 to n_out = 14, python throws an error.

    I have tried a few things like using the reshape function, but no attempts have succeeded so far.

    Could you please shed any light on this for me?

    Thank you in advance!

  87. Avatar
    Dave Gee June 20, 2019 at 5:01 am #

    Great article, thank you for writing it. I implemented a version of your “LSTM Encoder-Decoder Model With Multivariate Input”, where I am inputting a (10row x 100col) dataset and outputting a 10 value sequence. Each row is a day’s worth of data, and I am trying to predict the next day’s value. The output sequence is is 10 days with the last day in the set being tomorrow. The results I get in training and in testing with holdout data are too good to be true, but not so good that I immediately suspect that I am just giving it the answer. Would you be able to tell me, based on this information, if the models you describe would just calculate the “answer” for the current day’s prediction from the next day’s row in the input data?

  88. Avatar
    Ala June 25, 2019 at 12:21 pm #

    Hi Jason. Would you please explain how to handle multi-step multivariate forecasting. Do you have any article for the forecasting these sequences using lstm. I am mostly confused about the last layer. I know it should be dense but I want to predict for example 3 features so it should be dense(3) but i need it for next 10 time steps so it should be dense(10). Can you suggest how to handle this ?

  89. Avatar
    Sandra June 25, 2019 at 10:33 pm #

    Hi, Jason, Thanks for all your tutorial, I’m looking for model for sequence generator similar to language modeling model, my data, I create my own trajectories and I couldn’t find a good tutorial for LSTM sequence generator, since in my data in have [xi,yi] pixel points for trajectory,Can you help please? I got confused about how I have to preprocess the data for sequence generator model

    • Avatar
      Jason Brownlee June 26, 2019 at 6:41 am #

      I believe you could adapt an example of a language model for your purposes.

      Have a go and let me know if you have any problems along the way.

  90. Avatar
    ICHaLiL June 26, 2019 at 4:40 am #

    Hi Jason,

    I’ve a dataset which size is [1000,500,1]. It means that I’ve 1000 samples with 500 timesteps. The question is “Can I train my network with this dataset as LSTM get only 1 input and final layer gives only 1 output (Dense=1).

    Is it Possible?

    • Avatar
      ICHaLiL June 26, 2019 at 5:39 am #

      For example, I’m reading a paper about this. In the paper, Train set is 600 hour (equal to 750.000 timesteps). Authors divides this dataset into a number of non-overlapping blocks. The block length is 5000 timesteps. So Finally they have [150, 5000, 1] dataset. But, they train network with the input of previous 10 timesteps [x(t), x(t-1),…,x(t-9)] and getting one output [y(t)].

      Authors uses Theano to do this process. Also they can train the network as flattened such as [750.00, 10, 1]. But they choose to divide into the a number of blocks.

      What is the reason of that?
      How they train the network as 10 input and 1 output with dataset [150, 5000, 1]?

      Thanks for your tutorials and answers.
      Best Regards

      • Avatar
        Jason Brownlee June 26, 2019 at 6:48 am #

        I recommend using model skill to determine all framings and transforms for a specific dataset.

    • Avatar
      Jason Brownlee June 26, 2019 at 6:47 am #


      What are you trying to model exactly?

      • Avatar
        ICHaLiL June 26, 2019 at 6:53 am #

        Actually it’s a kind of power disaggregation. Honestly, it’s better if I give you the topic of the paper. You can easily understand when you look at paper. “A New Approach for Supervised Power Disaggregation by using a Deep Recurrent LSTM Network”.

  91. Avatar
    Jiawei June 26, 2019 at 1:29 pm #

    Hi Jason !
    Thanks for the detail tutorial.
    I am doing a LSTM project which is use previous 30 days to predict one day output. Each day ahs multivariate input. Do you have a tutorial that provide guidence to accomplish this task ?

    Much thx !!

    • Avatar
      Jason Brownlee June 26, 2019 at 2:35 pm #

      You can adapt the above tutorial to achieve what you describe.

  92. Avatar
    Sandra June 27, 2019 at 12:14 am #

    Hi Jason, the problem that in text generator the model predicts the probability per class ” they consider each character or each word as a class”, while in my case I have a sequence of feature points “trajectories” how I will consider my classes, my first element of the sample [x,y]

  93. Avatar
    Jack June 27, 2019 at 12:51 am #

    Hi, Jason, I want to ask you for more specific usage of convlstm model, for example I used the eight features of one dimension as the input to predict rainfall data, then train_x train_y shape should be what is, whether channels = 8 instead of 1.And rows with prediction is one dimensional characteristics,so rows of value is 1, and train_y just make sure the size of the Samples at the same value as train_x can

  94. Avatar
    Jiawei Zhang June 28, 2019 at 5:40 pm #

    Hi Jason !

    I wanted to ask for some strategy for my current LSTM project. Much appreciate if you could give me some guidedence.

    I have a dataset which contains multivariate feature. The dataset is already sorted day by day(10 days total). And I want to predict the next day output. How can I use cross-validation technique in this LSTM project in order to prevent overfitting ?

    Thank you so much if you could help me !

  95. Avatar
    tanbh July 3, 2019 at 2:09 am #

    can we extract the weights of the rnn which we trained?

    • Avatar
      Jason Brownlee July 3, 2019 at 8:38 am #

      Yes, call the get_weights() function on the model or layer.

  96. Avatar
    Al. July 5, 2019 at 7:17 am #

    Hi Jason,

    I have 10000 univariate timeseries (namely 10000 samples) and each of them has 5000 time instances, so the reshape should be [10000,5000,1]. But as I have seen in many posts and by experimenting on my own, more than 200 timesteps is not recommended.
    So, if i reshaped my data into [10000,10,500] instead of [10000,5000,1] would it have some meaning or it would be invalid? And if it is ok, what is essentially the intuition of reshaping into [10000,10,500]?

    Thanks in advance!

    • Avatar
      Jason Brownlee July 5, 2019 at 8:11 am #

      Yes, try resampling or truncating. Test different methods and see what works well for your specific dataset.

      • Avatar
        AI. July 5, 2019 at 8:17 pm #

        And i have one more question regarding the timesteps!

        In the input [samples,timesteps,features] by timesteps we mean the amount of timesteps the output becomes the input of the next matrix multiplication?

        Namely, if i reshape my data into [10000,10,500] the LSTM model will unroll in x=10 timesteps, where each timestep xi will have dimension d=500 and will take as input the output of the previous time step?

  97. Avatar
    Leen July 5, 2019 at 9:12 pm #

    Hello Jason,

    If we consider this tutorial and set the number of timesteps ahead to forecast to 1, will it be considered as single step forecasting?

    Thanks in advance.

  98. Avatar
    ICHaLiL July 7, 2019 at 4:50 am #

    Hi Jason,

    In the function of “evaluate model”, you call another function “forecast” in the for loop and so each time you run the “model.predict”. But it takes too much time. However, we could do the same process with splitting test_x with specific timesteps and with one command like below:

    model.predict(test_x, verbose=0)

    which one is more effective?

    • Avatar
      ICHaLiL July 7, 2019 at 4:51 am #

      I wrote the codes by mistake. I forgot to delete them.

    • Avatar
      Jason Brownlee July 7, 2019 at 7:54 am #

      forecast() and predict() do exactly the same thing.

  99. Avatar
    prathu July 9, 2019 at 10:57 am #

    Hi Jason, there seems to be a mistake, shouldn’t you exclude the target variable from input feature, in multivariate multistep forecast?

    X.append(data[in_start:in_end, :])
    y.append(data[in_end:out_end, 0])

    here basically you are taking all features including power consumed for X and again assign y with power consumed

    • Avatar
      Jason Brownlee July 10, 2019 at 7:55 am #

      Not in this case.

      It really depends on the nature of the problem (e.g. autoregression) and the nature of the evaluation (e.g. walk-forward validation)

      • Avatar
        George August 16, 2021 at 9:55 pm #

        So, you mean that if we don’t use walk-forward validation , but simple use X and y as above user @prathu said , is ok?


        X.append(data[in_start:in_end, 1:]) #use all features except target
        y.append(data[in_end:out_end, 0]) # use target

        Don’t use walk forward validation, just use the model, fit and check the score at the end.

        Normally, this is ok, right?

        • Adrian Tam
          Adrian Tam August 17, 2021 at 7:47 am #

          You’re correct. Validation is to give a score to your model (so you know you’re good enough or not). Therefore, nothing prevent you from trying a different combination like this.

  100. Avatar
    Chang July 9, 2019 at 12:27 pm #

    Hi Jason, thanks for the great guide and overview on LSTMs! How would you actually print out and visualise the forecasts generated by these models using your code?

    • Avatar
      Jason Brownlee July 10, 2019 at 7:56 am #

      You can use matplotlib to visualize the predictions.

      e.g. try the plot() function to create a line plot.

  101. Avatar
    gustavz July 11, 2019 at 5:42 pm #

    Hi Jason,

    One reads everywhere that RNNs (and therefore LSTMs) have the big advantage that they can process input data of any length (so the input dims must not be fixed).

    In all your tutorials and all projects that i found the input dims of keras lstm layers are fixed.

    For training this may make perfectly sense, but how do you make the model be flexible to varible input lengths at Inference time?


    • Avatar
      Jason Brownlee July 12, 2019 at 8:30 am #

      Yes, Keras can support dynamic RNNs, but I focus on static RNNs for efficiency.

  102. Avatar
    Jem July 23, 2019 at 5:21 pm #

    Hi Jason,

    I was implementing the Encoder-Decoder LSTM Model With Multivariate Input method for producing electricity through photovoltaic panels and I noticed that relu (activations) for LSTM has problems when it has several zeroes as input. How can I solve this problem? Or do I have to change the activation layer? Do you recommend one in particular?

    Thank you.

    • Avatar
      Jason Brownlee July 24, 2019 at 7:51 am #

      Perhaps try scaling the input data?
      Perhaps try using a different activation function?
      Perhaps try a different model?

      • Avatar
        Jem July 24, 2019 at 4:46 pm #

        The data is already scaled. I tried to use LeakyReLU as an activation function (which slightly improved the output). I will try to change the model hoping to improve the result.

        Thanks for your advice.

        • Avatar
          Jason Brownlee July 25, 2019 at 7:41 am #

          Nice one!

          • Avatar
            Kennard July 26, 2019 at 11:49 am #

            Hi, Jason

            Your tutorial helps me a lot, thank you very much!

            And I have a question that how to adjust the learning rate of the LSTM network in the CNN-LSTM code you’ve mentioned above.

            I’m looking forward to your reply, thank you!

  103. Avatar
    Petr July 29, 2019 at 7:23 am #

    I modified the LSTM model with univariate input and vector output code to include all 8 variables in hopes that I would get a better RMSE. However, this didn’t occur. I also added an additional 2 LSTM layers with 100 cells each to the model architecture.

    However, the RMSE is at 413 with 70 epochs of training. I would like to understand the potential cause for this. Is it that these additional variables are actually adding noise, or is the model architecture not complex enough to tease out the patterns from the additional variables?

    • Avatar
      Jason Brownlee July 29, 2019 at 2:18 pm #

      It is possible that an LSTM or the chosen configuration is a bad fit for the data.

      Perhaps try alternate models and/or LSTM configurations?

  104. Avatar
    Michael July 30, 2019 at 2:42 am #

    Hey Jason,

    Suppose I had the data set you had, but in the future and I trained the model as you did. Suppose later in the future, my meters for sub_metering_1 an sub_metering_2 went down, and I wanted to individually predict the forecast for sub_metering_3 without current data for 1 & 2. How would I do this with your model?

    For example, utilities companies can forecast the usage of individual houses, even though houses begin and end utility plans at irregular intervals.

    House 1: Data from 01/01/2007 – 01/01/2016
    House 2: Data from 01/01/2010 – Now

    In this situation, how would I continue to forecast data for House 2?

    • Avatar
      Jason Brownlee July 30, 2019 at 6:22 am #

      You would fit a new model on exactly this problem, e.g. using what is available predict the required column.

  105. Avatar
    Petr July 30, 2019 at 3:42 am #

    All I’m saying is that I’m surprised that I get worse performance when I add more variables. The univariate LSTM works well with 1 LSTM layer (RMSE of 390). I would have expected that adding additional variables to this same model would only improve the model.

    • Avatar
      Jason Brownlee July 30, 2019 at 6:23 am #

      Adding more capacity to a model requires a change to the training algorithm (lrate, etc.) and does not always improve performance.

      If adding capacity always improved performance, applied ML would not be as challenging. We’d just use large models for all problems.

  106. Avatar
    Sai Vineeth August 5, 2019 at 7:43 pm #

    Hi Jason,

    I’m doing multi-variate time series weekly sales forecasting using Random Forest Regression, I have 260 weeks of data, I wanted to know if it’s possible to forecast the target variable sales without the feature variables, can you please provide me any articles related to multi-variate time series forecasting using Regression models like RF ,SVR, Gradient Boosting etc.

    I have 260 Weeks of Feature Variables(19) and 260 weeks of Target Variable(1).

    I wanted to predict the 261st week of Target Variable using the Feature Variables.

    Thanks in Advance

  107. Avatar
    Fabio August 7, 2019 at 6:16 pm #

    Thank you Jason, I think all your articles are very intresting!

    I’m a beginner and I know that time series are very challenging. Anyway trying to predict time series I’m facing two main problems:
    1) prediction on test set “doesn’t follow enough” the real output and model can’t predict the real magnitude of peaks
    2) prediction are mainly positive values…even if the real values are equally distrubuted in [-1,1].

    What I did is to regularize the input so now all values are in range [-1, 1] and it rappresents the % of variation between two consecutive values.

    What I’ll like to do is: give a time frame consecutive values in input, predict the “next output”

    Unfortunatly I cant attach the prediction vs real output image…but here you can find a simple model I’m using.

    Have you any suggestion?

    thanks Fabio

    #LSTM Stateful

    def model01b(batch_size, n_steps_in, n_features, n_steps_out):
    model = Sequential()

    input_shape=(n_steps_in, n_features)

    model.add(LSTM(150, return_sequences=True,

    model.add(LSTM(20, return_sequences=False, stateful=False, activation=’tanh’))
    model.add(Dense(n_steps_out, activation=’tanh’)) activation=’relu’

    return model

  108. Avatar
    CharlesYuan August 8, 2019 at 2:14 pm #

    When you need to predict the power usage, such as electricity, assuming that there is a negative number of electricity consumption, can i still use this code case?

  109. Avatar
    CharlesYuan August 8, 2019 at 3:16 pm #

    using abvoe sample code, i have some dataset with negative number train/test but only get positive forecast, should it be correct or due to the evaluate_forecasts method does some check?

    • Avatar
      Jason Brownlee August 9, 2019 at 8:04 am #

      Perhaps try scaling data prior to fitting, e.g. normalizing?

      Perhaps try changing activation functions, e.g. use defaults?

      • Avatar
        CharlesYuan August 12, 2019 at 12:43 pm #

        thanks, i will try defaults.

        btw, when i try epoch = 30, loss value is around (15 to 30), is that too large or fine?

        try epoch = 70, still get loss value around 15 to 30.

        • Avatar
          Jason Brownlee August 12, 2019 at 2:12 pm #

          Loss values are relative and should be compared to a naive model on the same data.

          • Avatar
            CharlesYuan August 13, 2019 at 5:24 pm #

            get it, does this sample support multi-threads?

            I mean load a single model and make forecast in different threads at same time.


          • Avatar
            Jason Brownlee August 14, 2019 at 6:36 am #

            I don’t believe Keras/TF backend is thread safe (I could be wrong), instead I believe it will leverage multiple cores when using the model.

      • Avatar
        CharlesYuan August 22, 2019 at 9:27 am #

        Beside tanh and relu two activation function, are there any others i can try?


  110. Avatar
    Arpith August 14, 2019 at 2:41 pm #

    I have a data-set with timestamp(each day) and 2 other attributes(temperature and resistance), i would like to predict the values of 2 attributes after a week. Can you suggest which model (uni variate or multi variate) is suitable for this and how to change input and output dimensions for LSTM ?

  111. Avatar
    Isaac August 24, 2019 at 5:32 am #


    I’ve been using a variation of your multi-input timeseries forecasting scripts but for whatever reason I am unable to create a y_hat sequence with values that are different… meaning for my 20 predictions into the future.. each of them are the same… could you direct me to possible places of errors?

  112. Avatar
    Lei August 27, 2019 at 11:35 am #

    Hi Jason,
    For ConvLSTM Encoder-Decoder model, can you let me know why you added an LSTM layer after ConVLSTM2D layer ?

    • Avatar
      Jason Brownlee August 27, 2019 at 2:15 pm #

      Good question.

      No major reason, I think to interpret the output and reduce dimensionality. Try any architecture you want.

  113. Avatar
    Leqin Chen September 3, 2019 at 12:16 pm #

    Hi Jason,

    Your post is really helpful and unique in a way that no other posts can compare!

    I have a quick question here: when building LSTM related models, do you need to scale the input variables ? For example, use the MinMaxScaler(feature_range=(0,1)).

    Thanks a lot

    • Avatar
      Jason Brownlee September 3, 2019 at 2:08 pm #


      It is often a good idea to scale the data prior to fitting the model. Try with and without scaling and compare the results.

    • Avatar
      Leqin Chen September 3, 2019 at 4:01 pm #

      In addition, do you think it is necessary to transform non-stationary time series to stationary ones before fitting the model ?

      • Avatar
        Jason Brownlee September 4, 2019 at 5:55 am #

        Often yes. I recommend testing with and without this transform.

  114. Avatar
    Simon Guertin September 5, 2019 at 12:56 am #

    How would you modify this example to forecast both the global_active_power and the sub_metering_1 ?

    • Avatar
      Jason Brownlee September 5, 2019 at 6:56 am #

      Change the dataset to the framing you require, then change the model to match your dataset.

  115. Avatar
    Charles Vanya September 11, 2019 at 12:03 am #

    Dear Jason,
    I am interested in forecasting rainfall using sea surface temperatures using neural network. However, I am having a challenge in understanding how I can use rainfall and sea surface temperature data in Neural network. I would be more that happy to know what I need to do.
    My data looks like this:
    Year Rainfall
    1981 231.3
    1982 321.0
    …… …….
    …… …….
    2010 301.4

    While for temperature is

    Year Temp
    1980 23.1
    1981 25.3
    1982 20.3
    … ……
    … ……
    2010 24.7

    and I would like to forecast for say 2019

    I would apprecaite if I am assisted
    Charles Vanya

  116. Avatar
    Steven September 14, 2019 at 12:31 am #

    Hi Jason,

    Great article.

    In another article you cover LSTM as well. But I’m dealing with 1000 of rows and in your code we iterate over every row. Is there a faster way to iterate over the rows?

    • Avatar
      Jason Brownlee September 14, 2019 at 6:20 am #

      You can load all of the rows as samples and let Keras perform the iteration for you.

      • Avatar
        Steven Pauly October 18, 2019 at 9:28 pm #


  117. Avatar
    Nour September 18, 2019 at 9:00 pm #

    Dear Jason,

    Thank you so much for your helping articles/courses.
    I was wonderinf if I could split my training set into different sizes trains
    (like waht you did in :

    # restructure into windows of weekly data
    train = array(split(train, len(train)/7))
    Can I use something like :
    train = array(split(train, i for i in list))
    or will this influence the architecture of the neural network (since their architecture must be kept constant)?

  118. Avatar
    Andrius September 26, 2019 at 7:18 am #

    Thank you for such a nice lesson !

    What if I am trying to train a LSTM RNN to reproduce a certain time series? Let me put in a example: suppose that I have a collection of time series like [a ,b ,c ,d , e] which depends on the inicial value (t=0) a, that is, I have some dynamic that changes a to b, b to c and so on. In that collection, my initial value is different, but the dynamic that rules the value changes are the same. How can I model my RNN to learn that dynamic in a way that I can use it to predict all the time evolution if I only give one initial state as input ?

    • Avatar
      Jason Brownlee September 26, 2019 at 1:41 pm #

      Sounds like you might be better served with a HMM instead of an LSTM.

      You can try modeling these transitions with an LSTM, but I would be skeptical that it would be the best approach.

      You could frame the problem as one-step prediction with many input-output pair examples at different points in the sequence, with zero padding to make all input sequences the same length.

      If you are totally new to LSTMs, perhaps see some of the more general sequence prediction tutorials here:

      Let me know how you go.

  119. Avatar
    Hongqian Qin October 19, 2019 at 2:24 am #

    Hi, Jason. Thank you for your nice example! I still feel confused about one problem. As you explained, the walk-forward validation approach uses the output on time t as the part of input on time t+1. What about the multi-step LSTM and the encoder-decoder LSTM? Do they also use the output of the first prediction output as a part of input to predict the next output?

    • Avatar
      Jason Brownlee October 19, 2019 at 6:48 am #

      You can use walk-forward validation with a one step or a multi-step forecast.

      We use it with a multi-step forecast in this tutorial.

      Does that help?

  120. Avatar
    Franz Götz-Hahn October 24, 2019 at 2:07 am #

    Hi Jason! Thank you for your constant work on giving examples of how to implement certain things incrementally. I have a question regarding ConvLSTMs.

    If I have a slightly different problem than the one posed in this tutorial, namely that I don’t just have a single one-dimensional sequence (which was the power consumption here), but a multivariate case (let’s assume power consumption, water consumption and heat dissipation of houses, and I’m not trying to do step forecasting but the task at hand is a classification task (let’s assume binary classification of inhabitation of the house). Let’s also assume that I have recordings of 14 days.

    Can I use a ConvLSTM layer with input shape [n, 1, 1, 14, 3]? My reasoning for this shape: From n examples I only use a single sequence of one-dimensional data of the length of 14, but consider three channels.

    Does the kernel_size argument of the ConvLSTM2D layer then work on the 14×3 tensor? So say I use kernel_size=(3,3) does it slide a 3×3 kernel along the temporal domain and the output will be (12,3)? Or is there even a way to slide a kernel across the channels along the temporal domain?

    Thanks in advance for an answer!

    • Avatar
      Jason Brownlee October 24, 2019 at 5:43 am #

      Not sure that such a tiny input span makes sense.

      Perhaps try a suite of framings of the problem and discover what works well for your data?

      When using a convlstm, you are contriving the spatial domain from the temporal domain – so you can achieve any effect you desire.

      • Avatar
        Franz Götz-Hahn October 24, 2019 at 5:36 pm #

        After sleeping over it, do I understand it correctly, that the ConvLSTM layer basically chunks the temporal domain of whatever input size and convolves the kernel over that chunk? I have some trouble imagining the process.

        I’m envisioning it as a kind of network-in-network where the first network convolves the kernel over a sequence which is a chunk of the entirety of the timeframe, and the output is fed into a LSTM. Does that go in the right direction?

        • Avatar
          Jason Brownlee October 25, 2019 at 6:38 am #

          Yes, something a lot like that. A cleaner approach than a CNN-LSTM model.

  121. Avatar
    Lopa October 31, 2019 at 5:14 am #

    Hi Jason,

    I am working on a data (daily level) which has weekly seasonality & I am using LSTM to generate forecasts . However, for some months in the future the forecasts are not showing enough variation & looks quite flat.

    What are the options that I can try to solve this problem ?

    I have tried increasing the number of epochs,adding layers but it doesn’t seem to help a lot.

    Thanks in advance.

    • Avatar
      Jason Brownlee October 31, 2019 at 5:37 am #

      Yes, some ideas:

      Perhaps try tuning the model?
      Perhaps try diagnosing the model?
      Perhaps try an alternate model?
      Perhaps try alternate data preparation?
      Perhaps try an alternate framing?

  122. Avatar
    Lopa October 31, 2019 at 7:04 am #

    Hi Jason,

    What alternate model/framing do you recommend ?

    Rizal Maulana October 31, 2019 at 12:41 pm #


    Thank You for the very nice tutorial.
    I want to ask, how to train the dataset from the last checkpoint or last train (not from beginning again). Like this:

    I have trained 50 data, then I have 10 NEW data, (so now there are 60 data). Instead train it again from the beginning, is there any way to train with10 NEW data?


      Jason Brownlee October 31, 2019 at 1:37 pm #

      You can load the weights and then train directly.

    Julian November 11, 2019 at 4:03 am #

    Hello Mr. Brownlee, first of all: Thank you very much for this great tutorial and your other work in this blog! I used your code for my own dataset, it is multivariate with 3 features, one of these should be predicted 7 steps ahead.
    Now I’d like to normalize the input data in order to get a slightly better RMSE. But the MinMaxScaler throws the error “ValueError: non-broadcastable output operand with shape (7,1) doesn’t match the broadcast shape (7,3)”.

    Transformation is done by:

    dataset = read_csv(…

    values = dataset.values

    scaler =
    scaled = scaler.transform(values)

    train, test = split_dataset(scaled)

    Can you please give me a hint how to solve this? Or point out a better way to transform/inverse transform the data for this multivariate multi-step system?

    Thanks, Julian

    Radhouane Baba November 14, 2019 at 8:50 am #

Hi Jason,

    Hi Jason,

    I am trying to forecast the load of 1 day (144 points) from an input of 7 days (7*144 Points)
    As i am trying to develop (LSTM) a walk forward validation, i somehow see that the result is the same for some days…..
    The curves do not change that much
    And the values at the beginning of the day, they are not close to the last values of the training set (which should be obvious… because they are temporally close..)

    Am i maybe doing something wrong?
    It does not matter if the prediction are similar in some close days?

    Thank you so much

      Jason Brownlee November 14, 2019 at 1:43 pm #

      Perhaps the model requires tuning?
      Perhaps the data requires further preparation?
      Perhaps try an alternate model?

        Radhouane Baba November 15, 2019 at 12:38 am #

        I suspect it us maybe because of the length of my output and my input… it is too long..
        Maybe it will be better to have short inout and output. And build forecast from Inputs that are forecast..

    Radhouane Baba November 14, 2019 at 10:21 pm #

    I tried both lstm and SVR with tuning and seach grid inside each walkforward split… but the result is not exactly but nearly similar to the daybefore… and it does not start from last training point.. but from some kind of a mean of all days before at that particular timestep….

      Jason Brownlee November 15, 2019 at 7:47 am #

      Perhaps try 10-20 different models?
      Perhaps try tuning each model’s hyperparameters?
      Perhaps your dataset is not predictable?

        Radhouane Baba November 16, 2019 at 2:30 am #

        It is the load of household.. every 10 minutes..
        It is volatile and there are many peaks

    Fiona November 18, 2019 at 6:13 pm #

Hi teacher,

    Hi teacher,
    I am a fresh bird, I have a question. the power consumption at most time,it will be related with season or vacation.So do you think about this when you do this forecasting models?And if yes,can you tell me where i can find about this in your codes?I am doing forecasting models about water,i am at a loss.
    Thank you,teacher.

      Jason Brownlee November 19, 2019 at 7:39 am #

      Yes, you can incorporate this additional information into the model to see if it impacts forecast skill.

      I don’t have an example of this, thanks for the suggestion.

      One approach would be to have another time series as input with a flag marking vacation or not.

    Fiona November 27, 2019 at 5:06 pm #

Thanks teacher!

    Thanks teacher!

    Fabian Blandkjenn November 29, 2019 at 4:29 am #

    How would you recommend me to transform a non-stationary series when making 5 steps ahead forecasts?

    If I simply difference the dependent variable, then the y values would become a sequence of differences from the last 5 step ahead observation. This seems to possibly introduce bias to the model.

    Should I make the y(t+5) values a difference from y(t)?

    • Avatar
      Jason Brownlee November 29, 2019 at 6:53 am #

      Use differencing to remove trend, use seasonal differencing to remove seasonality, use a power transform to remove changes in variance.

    Fabian November 29, 2019 at 7:15 am #

    sorry, maybe I was a bit unclear.

    I have used differencing to transform my series to non-stationary.

    My series of the dependent variable is the price in t+5.

    So for every step, I have price(t) as the x variable, and the y variable is price(t+5).

    When I have the price differenced, and just wish to predict price(t+5), I would predict the change from (t+4), but I don’t know what price(t+4) is at step t. So then I will not be able to get my predicted value of t+5, only the expected change from t+4.

    Do you have any suggestions on how I could structure such a problem?

    Siri November 30, 2019 at 6:14 am #

    Thank you for the tutorial. I’m finding a hard time understanding the part of the code that specifies what exactly it is that we are predicting.

    For the network with multiple inputs, where are we specifying which of the inputs we are predicting outputs for?

    Jay January 3, 2020 at 8:26 pm #

Hi Jason

    Hi Jason

    is the method in and the method used in this tutorial different ?

    How can we frame a multi_input with multiple_output and multi_step problem and differentiate it with the same problem with multiple_parallel output? I mean what changes in the X, y preparation must be done in order to do so?

      Jason Brownlee January 4, 2020 at 8:30 am #

      You can use either approach to achieve the same effect.

    adonis January 14, 2020 at 8:37 pm #

Hi Jason,

    Hi Jason,

    I am working on a model that checks the last 7 days forecast then compares it to the last 7 days real data. Evaluate the past data and predict a better Power forecast = powerforecast’ = powerforecast corrected.

    so my final output is powerforecast’, which will be based on the learning for past data and on the future 7 days forecast.

    the column Pforecast is 7 days ( 24 hours steps) is longer than the Preal in my csv.

    I am having troubles, choosing the right way, to start this model.

    Murilo Souza February 3, 2020 at 1:50 am #

    In the Problem Description section, to be accurate, reactive power is not measured in watts (or kilowatts in this case), but in var (kilovar in this case).

    But this is a great tutorial! Just what i was looking for, since i’m starting to learn about time series forecasting and i wish to do a comparison between a physical model, a MLP and a LSTM model for energy generated through solar panels.

    Thanks a lot!

    Tom Leung February 10, 2020 at 2:08 pm #

Hi Jason,

    Hi Jason,

    Thanks for your excellent blog, it really helps. But I still have a question on the decoder input during the training and testing phase.

    I know that during the training phase, the ground truth of the last time step (t-1) should be given as the current input of the decoder (t). As for the testing phase, the output or internal state of the last decoder will be used.

    My question is that if I build the encoder-decoder model using the approach in this blog, will Keras takes care of the decoder input properly and seamlessly?


    • Avatar
      Jason Brownlee February 11, 2020 at 5:07 am #

You're welcome.

      You’re welcome.

      Yes. But perhaps I don’t understand your question? Can you elaborate?

      • Avatar
        Tom Leung February 14, 2020 at 2:01 pm #

        I want to know is teacher forcing enable in the default settings of Keras RNN models?

        • Avatar
          Jason Brownlee February 15, 2020 at 6:22 am #

          We do teacher forcing in the way we structure the data/training, Keras does not.

    Lucy February 10, 2020 at 7:24 pm #

    Thanks for your post!
    I have understood how to use CNN+LSTM with univariate input.
    But I am very confused in how to use CNN+LSTM with multivariate input.
    I have no ideas how to preprocess the datas and put them into CNN, and how to push the datas into LSTM after the datas trained by the CNN
    I am looking forward to your rely!

    • Avatar
      Jason Brownlee February 11, 2020 at 5:09 am #

      The above example has multi-variate input.

      Perhaps I don’t understand the problem you are having exactly?

      Perhaps this tutorial will help:

      • Avatar
        Lucy February 11, 2020 at 2:53 pm #

        Thank you for your reply!

        But I found the above example just has CNN-LSTM for univariate input.

        Exactly I want to study how to develop and evaluate an CNN-LSTM model for multi-variate input time series forecasting!

        I am very confused in this part
        Can you give me some suggestions?

        • Avatar
          Jason Brownlee February 12, 2020 at 5:41 am #

          See the above tutorial for a multi-input CNN-LSTM model for power usage forecasting.

    Tom February 15, 2020 at 3:55 am #

    Hi Jason, these tutorials are brilliant!

    I’ve been attempting to modify this walkthrough for my own data series.
    However, instead of 7 periods, I’m trying to use around 600 periods as input, in order to predict 300 periods of output.
    When I try to fit the model in the debug I can see the loss going to nan on the first or second reading of the first epoch.

    If I set the input periods and output periods to around 50, it runs ok. Any more than this and it struggles.

    I read your page about clipping the optimiser, and attempted this, but it still didnt work.
    I also read a suggestion from another contributor, to train the model on a shorter input first, in order to stabilise it, however I’m very new to this and am not sure how to do it.

    Do you have any tips for working with large inputs and outputs?

    • Avatar
      Tom February 15, 2020 at 4:24 am #

Hi Jason,

      Hi Jason,

      I attempted to set the kernal_initializer to Zeros, and that appears to have worked.
      Are there any downsides to doing that however?

      model.add(Dense(100, kernel_initializer=’Zeros’, activation=’relu’))



      • Avatar
        Jason Brownlee February 15, 2020 at 6:37 am #

        Ouch. I don’t like it, model will likely be limited in what it can learn.

    • Avatar
      Jason Brownlee February 15, 2020 at 6:37 am #


      Maybe try a smaller learning rate.
      Try using relu in the hidden layers.
      Scale data prior to fitting.

    jessy February 17, 2020 at 4:33 pm #

Hi Jason,

    Hi Jason,
    could you tell me the technique to capture irregular time interval in time series data..

    Muhammad Usman February 21, 2020 at 1:05 am #

Hi Jason,

    Hi Jason,

    Nice post. Could you please explain , in the function split_dataset() ,how the list or array is converted to 3D tensor? array split is returning 2D shape.

    Thank you in advance.

    BEst regards

    Andreas Markaurel February 22, 2020 at 2:02 am #

    Hi Jason!

    Thanks for your awesom article!
    I would love to use some of these methods for my final thesis. Unfortionately I need some academic articles to cite from.
    Can you recommend any, besides the one for “Convolutional LSTM Network”?

    I would need articles specifically for:
    * LSTM Model With Univariate Input and Vector Output
    * Encoder-Decoder LSTM Model
    * CNN-LSTM Encoder-Decoder Model With Univariate/Multivariat Input

    Looking forward to hearing from you!

    Qadeer March 3, 2020 at 6:15 pm #

    HI, Thank you for nice tutorial. I am new to programming and machine learning. Can you please guide how to calculate RMSE and MAE for LSTM Multivariate encoder decoder model. and what changings will be required to normalize or standardize the data; also to unnormalize it for actual forecast?

    Alex March 5, 2020 at 1:35 am #

    First of all, thank you for your introduction. I’d like to ask you about the prediction time series. I want to predict the value in one minute, two minutes, three minutes, four minutes and five minutes in the future. The delay in one minute is OK, but when I predict five minutes, the predicted value always lags behind the actual value. Why?

  143. Avatar
    Qader March 5, 2020 at 11:27 pm #

    Dear Jason,

    Thank you so much for this amazing tutorial!!
    I have a question regarding the predictions of these different methods on this dataset. The predictions seem to be very biased to the history data! I was wondering if this reflects the limitation of these deep learning methods or there is still room for significant improvements (and how)?


    • Avatar
      Jason Brownlee March 6, 2020 at 5:34 am #

      There is much room for improvement, the models are just a demonstration for how to get started.

    Trevor March 17, 2020 at 3:06 am #

    Hi Jason, is it possible to add multivariate functionality to the ConvLSTM model? I was playing around with using the conv portion as the encoder but I keep getting this error.

    ValueError: Error when checking input: expected conv_lst_m2d_8_input to have shape (2, 1, 7, 2) but got array with shape (2, 1, 7, 1)

    I know you would probably need to see the code. If you could email I would appreciate it so much

    hamna March 26, 2020 at 9:38 pm #

    I have been working on an CNN-LSTM model. The link is given below, .
    I am finding problem in real time prediction of this CNN-LSTM model.I would be obliged if you would assist me.

    • Avatar
      Jason Brownlee March 27, 2020 at 6:12 am #

      Sorry, I don’t have the capacity to review/debug you example.

    Johan Tertnes March 29, 2020 at 3:30 am #

Hi Jason,

    Hi Jason,

    I’m not too familiar with python as I mostly use R, so this could be quite obvious…

    If I were to output the predictions from “evaluate_model” by simply returning “predictions”, are these predictions directly comparable to the test set as a true out of sample forecast? Or are the test set simply used to evaluate the model fit as a validation set, rather than true out-of-sample forecasts?

    • Avatar
      Jason Brownlee March 29, 2020 at 6:04 am #

      Yes, exactly.

      • Avatar
        Johan Tertnes March 29, 2020 at 7:15 am #

        Thanks for such a quick answer!

        So to my understanding, for a multivariate forecast, I would need to feed “predict.model” from the “forecast” function with an additional test set containing the lagged x values for the y values I’d like to predict in order to get true out of sample?

        Is there by any chance an example of out-of-sample forecasts for multivariate time series included in your book?

        Thanks again!

      • Avatar
        Johan Tertnes March 29, 2020 at 4:51 pm #

        Clarification: Essentially I would like to use the multivariate lstm to generate predictions beyond the dataset as you have shown with your “demonstrating predictions” part in the book, eg listing 9.89.

        Eg. I want to forecast one week beyond the dataset:
        Could this be achieved by simply saving the model within the “build_model” function using, loading the model, and use “predict_model”, with arbitrary x-values (as the forecasts for the first week are only based on the training set) ..?

        • Avatar
          Jason Brownlee March 30, 2020 at 5:32 am #

          Yes fit the model an available data, then pass in one sample worth of input at the end of the dataset to predict beyond it.

          • Avatar
            Johan Tertnes March 30, 2020 at 6:49 am #

            Perfect, thank you! Your book and articles are such a great resource for learning DL!

          • Avatar
            Jason Brownlee March 30, 2020 at 7:52 am #


  147. Avatar
    Uran March 30, 2020 at 4:29 pm #

    Hi, Jason:

    How to print with datetime at evaluate output in your code?
    I mean, I want to see with date time with prediction or model evaluate results.

    • Avatar
      Jason Brownlee March 31, 2020 at 7:55 am #

      If you know the date time of the input sequence when making a prediction, you can extrapolate the date times for the predictions and print along side.

      This is simple programming, not machine learning – sorry I cannot write this code for you.

  148. Avatar
    Nic April 2, 2020 at 12:27 am #

    Hi Jason,

    I am looking at a faiiirly similar problem, but instead of having four years’ worth of data for one household, I have four years’ worth of data across many different households… each one varying in length! How would this change the problem?

    Thanks for this article!

    • Avatar
      Jason Brownlee April 2, 2020 at 5:57 am #

      Focus on what you want to predict, then change the data to meet that.

      E.g. you might want to make predictions for one house or all houses in general. You might want to model per house or across groups of houses or all houses.

      Experiment with different framings of the problem to help sharpen up the answer for you/stakeholders.

  149. Avatar
    Mehdi April 2, 2020 at 10:54 pm #

    I don’t understand the intuition behind the RepeatVector() in the multivariate input.
    Why don’t we just use an LSTM with return_sequence=True and TimeDistributed(Dense()) after ?

    model = Sequential()
    model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features), return_sequences=True))
    model.add(TimeDistributed(Dense(100, activation=’relu’)))
    model.compile(loss=’mse’, optimizer=’adam’)

  150. Avatar
    Mohammad April 3, 2020 at 9:15 am #

    Hi jason
    Thanks for the awesome tutorial.

    I have a question, i don’t know whether it has been asked before or not.
    in encoder decoder architecture, shouldn’t we build two separate model for encoder and decoder ? we first encode the input using the encoder and then pass it as “initial_state” to decoder’s LSTM and not its input. and to use “initial_states” we need to call the LSTM layer as function which should be done in keras “functional” API and not the sequential.

    Thanks in advance

  151. Avatar
    Mohammad April 3, 2020 at 9:59 am #

    so in this sequential format where did you use the initial_states argument?

    i’m trying to build a conv-LSTM encoder decoder network. somewhere in your article it’s been said that we can flatten the convolution outputs and use them as initial states. the problem is that i could not do that. here is the code and the error:

    encoder = Sequential()

    encoder.add(Conv1D(filters=128, kernel_size=3, activation=”tanh”))

    encoder_output = encoder(encoder_input)
    decoder_input = Input(shape=((None, 1)), name=’dec_inp’)

    decoder_lstm = LSTM(512, return_sequences=True, return_state=True, name=’dec_lstm’)
    decoder_outputs, _, _ = decoder_lstm(decoder_input, initial_state=encoder_output)

    An initial_state was passed that is not compatible with cell.state_size. Received state_spec=ListWrapper([InputSpec(shape=(None, 512), ndim=2)]); however cell.state_size is [512, 512]

  152. Avatar
    Nick Yang April 4, 2020 at 6:37 am #

    Hi Jason, I was trying the convLSTM model with multiple input features, but I keep on getting this error and couldn’t figure out why
    ValueError: Error when checking input: expected conv_lst_m2d_11_input to have 5 dimensions, but got array with shape (1, 14, 8)

    Do you know why the dimensionality keeps on having problems? I reshape the input into 5 dimensions but getting this error in model.predict()


  153. Avatar
    Jesson April 7, 2020 at 11:31 pm #

    Hi Jason, your tutorial is very great! But somethings in convLSTM model confuse me a lot.
    The configuration of the ROWs in your turial is always 1.
    I want to know in which case, this number will becomes 2 or 3 or something else?
    This number depends on what?

  154. Avatar
    JuanMost April 17, 2020 at 6:35 pm #

    Hi Jason,

    Great job!. Good example for the LSTM tutorial. Lot of aplications.

    To solve the errors of the Multivariant ConvLSTM2D:
    input_x = data[-n_input:, :]

    instead of:
    input_x = data[-n_input:, 0]

  155. Avatar
    JuanMost April 18, 2020 at 6:39 pm #

    Hi Jason, sorry for my quick answer.

    There is no error in the example.

    I only want to coment the error in the purposed code of “mk in January 4, 2019” because I think that Nick Yang April 4, 2020 has the same error.


  156. Avatar
    Michael Nguyen April 23, 2020 at 2:30 am #

    Hi Jason,

    How to design LSTM autoencoder use All prior days, up to years worth of data ?


  157. Avatar
    Jordan April 23, 2020 at 2:35 am #


    Thanks for the great tutorials. I have a small question about the RMSE metric and how it translates to the output.

    If we had a dataset of 100 houses and had to predict the price using a regression model, and the RMSE of an LSTM was ‘100’ – does this mean the model can predict house prices of the dataset within +/- $100?

    If not, what would this RMSE actually mean?


    PS: if the Mean Absolute Error were to be 100, how does this relate to the prediction of price?

    • Avatar
      Jason Brownlee April 23, 2020 at 6:11 am #

      You’re welcome.

      RMSE is in the same units as your target variable. If it is dollars, and the error is 100, then it is 100 dollars.

  158. Avatar
    Sep May 6, 2020 at 11:34 pm #

    Hi Jason,

    I have a question regarding the multi-step classification prediction for time series problems. I want to know would it be possible to predict the label (disease or normal) of for example one patient for the next three visit? If yes, would you please give me some hints how to do that? Do you have any tutorial in this regard which may help me?


  159. Avatar
    Sep May 7, 2020 at 7:52 pm #

    Hi Jason,

    I have a question regarding splitting the longitudinal data into train and test set. Imagine we have N number of patients and each has M number of visits. With usual train-test splitting method the temporal structure between the data will be destroyed. Would you please help me in this regard? How I should split the data into train and test without mixing the patient and thus destroying temporal structure of the data?

    Many thanks in advance,

  160. Avatar
    Sep May 12, 2020 at 1:23 am #

    Hi Jason,

    Regarding the use of CNN for time series forecasting either regression or classification, the length time series for all the sample data in the training data set should be the equal, right?

    Many thanks in advance,

  161. Avatar
    sarah May 17, 2020 at 4:34 am #

    Hi Jason,

    for LSTM Model With Univariate Input and Vector Output, do we still able to differentiate the data (in preprocessing phase ), if so, using the first-order difference seems not working as the data shape will be affected. Using the window size ads order difference means losing lots of info. Am I right?

    • Avatar
      Jason Brownlee May 17, 2020 at 6:42 am #

      Perhaps evaluate the model with and without and compare the results.

      Difference only effects values, not shape.

  162. Avatar
    Maxi May 18, 2020 at 7:17 pm #

    It’s probably a stupid question, but I’ll try it anyway.
    If we assume a correlation between a certain demand and the weather. Would it be possible to train the model with historical demand and weather data and then use the next day’s weather forecast data to predict demand? Let’s say I want to create multi-step forecasting for the next 24 hours. Can I give the model the weather forecast for the next 24 hours and the model will give me the demand values?

    • Avatar
      Maxi May 18, 2020 at 7:22 pm #

      Oh, first of all, of course, hello Jason! Also thanks a lot for the great documentary! Really very instructive, for me currently still too complex and too far advanced, but I’m trying to read a bit into the topic.

    • Avatar
      Jason Brownlee May 19, 2020 at 5:59 am #

      Sure! Try it and see.

      Try a few different framings of the idea with different model types and see what works well/best.

  163. Avatar
    Kasper May 21, 2020 at 2:22 am #

    When evaluating your CNN-LSTM Encoder-Decoder Model With Univariate Input model with your inputs, I tried to determine whether the model was trained long enough (i.e. underfitting or overfitting). To achieve this, I used the to_supervised function to generate test_x and test_y like you would for the training set, and validation_loss = ( *,* ) to extract the train- and test loss for the model.

    When plotting, I found that 20 epochs is indeed preferred, because some epochs later the overfitting commences. However, at all points in the system, the testing loss significantly outperforms the training loss.

    To the extent of my knowledge, I believe this could imply:
    – Weight regularization is applied to training, but not testing (irrelevant)
    – The testing loss is calculated after the training, and thus better (difference seems to large for this)
    – The testing set is “easier” (seems the most likely)

    What are your thoughts on this?

    • Avatar
      Jason Brownlee May 21, 2020 at 6:21 am #

      Nice work!

      It may also be the case that test loss is not a useful metric to follow when using walk-forward validation.

  164. Avatar
    maya May 22, 2020 at 8:47 am #

    Hi ,thank you so much for this blog i have tried your models on a weather forecasting problem but the loss i got is high and the accuracy of the model is that much low by the way my data size is 42480 also ive tried to add normalization, hot encoder,dropout but the models keep giving me same low accuracy

  165. Avatar
    maya May 22, 2020 at 8:59 pm #

    Thank you for ur reply i ve already tried with MLP and same really very good prediction results when ploted it but a poor accuracy and high loss i ll check the blogs u suggested me and send you what ive found

  166. Avatar
    maya May 23, 2020 at 9:31 am #

    Hi again and thanks for ur efforts , i tried with the naive forecast wich gave me RMSE =3,851 and with this models it gave btw 2.3 and 2.7 is that means that its a good prediction while in most articles RMSE values are less than 1 and the loss is never less than 1 to 2 in the 4 models witch is strange for good prediction

    • Avatar
      Jason Brownlee May 24, 2020 at 6:02 am #

      You’re welcome.

      Good is relative to a naive model. If the model performs better than a naive model it has skill. The next challenge is to tune the model to get the most out of it and test other types of models to confirm that cannot do better.

  167. Avatar
    Iraj May 27, 2020 at 11:02 am #

    Hi Jason,
    Great presentation. Thank you.
    I have a question.
    With a multivariate multistep lstm composit autoencoder, the program slows down as we walk forward in forecasting. I am for asking next time step from last 60 timesteps. First loops takes around 9 sec to run and at gets to 30 sec at 100th forecast.
    It may be cashing issue, but I don’t know how to manage.
    Can you help me with the speed?
    Thank you

    • Avatar
      Jason Brownlee May 27, 2020 at 1:30 pm #

      Some ideas:

      Perhaps use a smaller model?
      Perhaps use less data?
      Perhaps run on a faster machine?
      Perhaps use an alternative implementation?

  168. Avatar
    Iraj May 27, 2020 at 1:59 pm #

    Wondering why speed is decreasing over loops.
    Whatever the model, data, cpu, or method, it the the same for each steps of walkthrough. What happens at each step that reduces the speed? This is the question!! Do you think If we use such a model, does it buildups something in memory at each run? If yes, how can we reset or clear it?
    Thank you again

    • Avatar
      Jason Brownlee May 28, 2020 at 6:07 am #

      That is odd, I would not expect that.

      Perhaps there is a leak. You may have to debug the cause.

  169. Avatar
    Iraj May 28, 2020 at 3:20 am #

    Problem resolved with:
    from keras import backend as k

    Thank you for your time

  170. Avatar
    Rahul June 1, 2020 at 5:21 am #

    Hi Jason,
    Thank you for your amazing contribution; we are all so grateful. I’ve 3 questions if you can help me with them.

    1. In the encoder decoder model, can I add more LSTM layers to both encoder and decoder parts?
    2. Why did you use 2 TimeDistributed Dense layers in the decoder? Is 1 allowed?
    3. Where do I add batch normalisation layer in the encoder and decoder parts?


    • Avatar
      Jason Brownlee June 1, 2020 at 6:29 am #


      I configured the model using trial and error.

      Batch norm can be added after the layer and before the activation, or after the activation. Perhaps try both and use what works best.

      • Avatar
        Rahul June 1, 2020 at 1:37 pm #

        Thank you, Jason, for the reply. You did highlight one reason for adding the first Dense layer and called it an interpretation layer. However, I couldn’t get it. Can you please explain, if that’s okay?


  171. Avatar
    samir June 12, 2020 at 5:10 pm #

    Hello. Thanks for the great content.
    In a chart, I want to plot the values of Actual and predicted electricity loads together (for example, 150 hours).
    But in this tutorial, the prediction values are discrete and not continuous.
    Is there a way to do that?

    • Avatar
      Jason Brownlee June 13, 2020 at 5:51 am #

      The predicted values are continuous.

      • Avatar
        samir June 13, 2020 at 4:11 pm #

        thank you. I mean, I want to have an hourly forecast instead of a daily one. The chart you draw has seven values, but I want it to have, for example, 168 values. In fact, 24 values per day instead of 1

        best regards

        • Avatar
          Jason Brownlee June 14, 2020 at 6:31 am #

          If you use hourly data as input, your predictions will be hourly and if you plot, your plot will be hourly.

      • Avatar
        samir June 14, 2020 at 6:10 am #

        Is there a way to have an hourly forecast instead of a daily forecast?

  172. Avatar
    Madareto Gaeidam June 12, 2020 at 10:31 pm #

    tanks for this post

  173. Avatar
    samir June 18, 2020 at 12:00 am #

    Hello Mr. Jason
    I want to display predictive data and real data in 1-step.
    I used the following command:

    import numpy as np
    import matplotlib.pyplot as plt

    y = actual[:, 0]
    y2 =predictions[:, 0]

    x=np.arange(1, 47, 1)
    plt.plot(x, y)
    plt.plot(x, y2, ‘r–‘)

    Would you please tell me if it is correct?
    If wrong, how do I display real values and prediction values for 1-step (Sunday)?


  174. Avatar
    elia July 9, 2020 at 1:20 am #

    Hi Jason,
    thanks for great tutorials!

    Can you please explain how is it possible to add a Conv1D layer right after a LSTM/BiLSTM layer? Thanks.

    • Avatar
      Jason Brownlee July 9, 2020 at 6:42 am #

      Typically 1D CNN is used first, then a LSTM, not the other way around.

  175. Avatar
    ws July 30, 2020 at 8:20 pm #

    thanks for your great tutorials.I have read your several books about time series and lstm.they are great useful for me.
    But when I read and practiced ConvLSTM, I met a question.

    As we all know , the convolution will get the link and features between the near points from spatial aspect. and the lstm will get the link and features between the near points from temporal aspect.

    From traditional time series, we know nearer day means closer correlations in stationary series. Therefore, generally, third day is better than eighth day in correlation ,right?

    But, for example ,we split the 14 days into two subsequences with a length of seven days.It will make eighth day have a nearer position with first days. Will this result in a greater correlation for eighth day than third day?

    This question has confused me several days.I hope you can help me solve it.


    • Avatar
      Jason Brownlee July 31, 2020 at 6:17 am #

      You’re welcome.

      Not sure I follow, does it matter? Either the model gives better predictions than other models or it does not.

  176. Avatar
    Hajer August 3, 2020 at 5:06 am #

    Can I use Encoder-Decoder LSTM Model With Univariate Input with hourly prediction by using
    predicted_output = model.predict(, batch_size=BS) As it should predict the next 12 hours starting from the given date?

    Another question is, can I train this model twice? once to predict a certain column and another to predict another column? If so, how?

    Thanks for this great tutorial.

  177. Avatar
    Hajer August 5, 2020 at 6:18 am #

    The error was caused because (7*24) was supposed to be only 24. no need for (7*24) because the dataset is already hourly.
    However I followed the tutorial to predict new data I encountered this error:
    ValueError: Failed to find data adapter that can handle input: ,

  178. Avatar
    raj August 9, 2020 at 7:55 am #

    Hi Jason,
    Is there not a case of data leak happening in the encoder decoder example.
    In my opinion , you should exclude the target variable from input feature, in multivariate multistep forecast?

    Should be:
    X.append(data[in_start:in_end, 1:])
    y.append(data[in_end:out_end, 0])

    instead of:
    X.append(data[in_start:in_end, :])
    y.append(data[in_end:out_end, 0])

    Here basically you are taking all features including power consumed for X and again assign y with power consumed. That means the independent variable ‘y’ to be predicted is already a part of dependent variable ‘X’. What do you think?

  179. Avatar
    Sarah August 12, 2020 at 5:33 pm #

    Hi Jason ,

    I need more clarification about the input shape for encoder decoder LSTM, if I want to use more than one lag, i.e. lookback data.

    I already read your other article of LSTM input shape, but I still has some doubt.

    So, for example:
    if encoder decoder univariate LSTM model to predict 24 hours ahead, based on 3 days ago as lag,
    while the dataset has 300 examples

    then, how should the input shape of the data be ?

    For me,
    [samples =300, timesteps=24, features=1]

    so how to express the lag data ?

    I am sure I am missing something.

    I appreciate your help

  180. Avatar
    Shabnam September 4, 2020 at 6:20 pm #

    Hi Jason,

    Thank you for your interesting article.

    I am working on somehow similar problem, but in addition to time-series prediction on power consumption, I am also interested to detect long-term anomalies on a time-series by considering its similarities with neighbourhood time-series in terms of Power Grid Distribution.

    In other words, a long-term anomaly may not be distinguishable on analysing a single time-series alone, but by comparing it with time-series in neighbourhood nodes we distinguish that something goes wrong in this node.

    My question is that can LSTM models be used to solve such a problem or other models such as clustering should be used? and is there any related tutorial in mastery ML?

    Thanks in advance

  181. Avatar
    André de Sousa Araujo September 15, 2020 at 11:45 pm #

    Hi Jason,

    Thanks you again, for this amazin job!! 😉

    Just to share with you: I had tried to apply this approach to predict a multivariate time series (climate variables) with rain as a target (prediction), but I don’t get good results.

    As an example, the amount of rain is zero sometimes and starts to increase 0.4, 1.0, 2.0 arrives at some peak of 11.00 and starts to return to zero again. Do you believe that is necessary to give some special pre-processing for this target?

    So, the power supply is a float series that don’t have nulls (zero), but in my case is not a normal distribution….

    Do you have a post that comments that?

  182. Avatar
    Neha Mangal September 16, 2020 at 3:43 pm #

    Hi Jason,

    It seems that Global_active_Power and Global_Intensity are directly correlated (corrleation= 1)
    Shouldn’t we remove one variable?

    • Avatar
      Jason Brownlee September 17, 2020 at 6:41 am #

      Perhaps try removing one and compare results?

  183. Avatar
    Konstantinos September 22, 2020 at 8:36 am #

    hi Jason!

    i did normalization to tha data but i want to use the inverse_transform( )to calculate the RMSE using actual values. In which place must i place the inverse_transform()?

  184. Avatar
    Suhwan Chung September 27, 2020 at 5:05 pm #

    Hi Jason,

    Thanks for the great article.

    Do you have any examples (code or reference) to build function for out of sample forecast, so that I can use the same dataset and your LSTM approach for prediction?

  185. Avatar
    Adrien Borderon September 28, 2020 at 11:42 pm #

    Hello Jason,

    I follow up on the last comment.

    Indeed your subject is really great! Thank you again for your work.

    For me I am using a fairly similar dataset with 10 input features and one output feature.

    I want to predict the next 36 future hours with my model based on a look back from the past 90 hours.

    I managed to build the model by following your topic.

    To predict the next 36 future hours that does not exist in my dataset I selected the last window of my testing set:

    X_test[-1, : , :]

    That I used for the prediction:

    model.predict(X_test [-1,:,:])

    So I get a list of 36 hours.

    I am not sure of this approach, can you confirm?

    You also use RMSE values ​​to test the performance of the model, when are the loss curves for the training and the validation? Is it possible to have them also in this context of multivariate and multi step LSTM?

    Thank you.

    • Avatar
      Adrien Borderon September 28, 2020 at 11:47 pm #

      Sorry, I wasn’t talking about loss curves but rather a classic chart with Y_true and Y_pred?

    • Avatar
      Jason Brownlee September 29, 2020 at 5:40 am #

      Perhaps try it and see.

      We cannot easily calculate validation/learning curves when using walk forward validation.

      • Avatar
        Adrien Borderon September 29, 2020 at 8:31 pm #


        thanks Jason for your response.

        No it’s not a problem for using Matplotlib, it’s just that in your example of a mutivariate LSTM model with multi-step output the output shape looks like this:

        Example :
        (2000, 36)
        sample, number step

        y_test and y_pred have this shape so I don’t know how to make a graph to compare y_true and y_pred.

        Maybe that doesn’t make sense in a multi-step case?

        If not, are you confirming to me that what I propose to predict the next 36 hours seems correct using the last window?

        And what is the difference between:

        model.add (TimeDistributed (Dense (100, activation = ‘relu’)))
        model.add (TimeDistributed (Dense (1)))

        compared to :

        model.add (Dense (7))

        Both will offer a 7-value multi-step output, right?

        Thanks again.

    • Avatar
      Max December 15, 2020 at 3:14 am #

      Hi Adrien, nice implementation 🙂
      Are you using an Encoder decoder model?

  186. Avatar
    Lahan O September 30, 2020 at 2:05 am #

    Thanks a lot Jason for the article,
    My question is a complicated one. I adapted the code to output predictions every minute. It recieves new data to aid predictions every minute. My question is in two parts:

    1) How do I train my model on the new data coming in to update the model on the go.
    2) I get an error where, after about 7 new predictions (ie. after about 7 minutes), the model predicts the same number irrespective of the input data (which can’t be right). FYI, i put model.predict() function in a while True loop, if that makes any difference.

    Thanks again.

  187. Avatar
    Lahan O September 30, 2020 at 2:51 am #

    Hello Jason,
    Can you point me in the right direction for learning how to update the model on new data input in real time.

  188. Avatar
    lucass October 13, 2020 at 8:45 pm #

    Thank you for your great article!
    have a question….
    in the function of ‘def to_supervised’
    ‘data[in_start:in_end, 0]’ means we only use 1st feature in data?

    I expected it should have use ‘data[in_start:in_end, n] (n=0 to 7)’

    maybe I lost some code in the data processing, hope to have a answer
    once again really Thank you for share great code

    • Avatar
      Jason Brownlee October 14, 2020 at 6:17 am #

      Yes, that is a univariate (one variable) example as the title of that section says.

      • Avatar
        lucass October 14, 2020 at 2:08 pm #

        Thanks for reply!

  189. Avatar
    André de Sousa Araujo November 9, 2020 at 4:52 am #

    Hi Jason,
    Thanks for your precious time! Just a doubt the approach above.

    Context: I had used a similiar architeture but I had normalized betwen -1 and 1 beforing training. For some predictions on the test dataset after inversing the normalization I’m geting negative values, but I was prediction the ammount of precipitation.

    So, I can’t have negative precipitation, is’t hard to explain that. In casem it’s similar because you can have negative power supply.

    So, It is acceptable have a model that sometimes generate a negative output? What do you thing about that?

    • Avatar
      Jason Brownlee November 9, 2020 at 6:15 am #

      Perhaps you can use a sigmoid on the output layer and normalize the target to the range [0,1]

      • Avatar
        André November 30, 2020 at 7:54 am #

        In my case, I just add the actication on Dense layer and work well.

        model.add(TimeDistributed(Dense(1, activation=activations.tanh)))

        During my various test, I missed this detail, this layer in my case was missing a transformation to generate the correct output.

        Thanks, man!

  190. Avatar
    NIkolay Arapov November 10, 2020 at 12:31 am #

    Dear Jason,
    Thank you for your work. I have a few questions. Every time I run the model again, I get a different result. The average error is different, although I do not change the model configuration,but only do a restart. What is the reason for such non-stationarity of the model? I also built a model based on the guide from the tensorflow site and the launch results differ slightly. What can this be related to?

  191. Avatar
    Chhayly November 16, 2020 at 4:31 pm #

    Hello Jason,
    On the LSTM Model With the Univariate section, I change the verbose to 2 to monitoring the losses, and the results showed it’s (loss) over 250000+ every epoch. What is the unit of the loss? and doesn’t it a bit too high?

    • Avatar
      Jason Brownlee November 17, 2020 at 6:26 am #

      Sounds large.

      Perhaps try training the model a few times and compare results?
      Perhaps try adjusting the learning hyperparameters?
      Perhaps try scaling your data prior to modeling?

  192. Avatar
    Deepak Verma November 25, 2020 at 4:11 am #

    Hey I was trying LSTM Model With Univariate Input and Vector Output but my model get stuck at statement and epochs are not executing can you help me out ? I have 5min time series data.

  193. Avatar
    Deepak Verma November 26, 2020 at 6:42 pm #

    Is there any relation between dividing data in weekly in split_dataset() function and predicting next 7 time steps. Can we divide data in something different like in 14 days and predict next 7 time steps. Will it make any difference in model performance.

    • Avatar
      Jason Brownlee November 27, 2020 at 6:35 am #

      Yes, you can prepare the data any way you like for your model.

      I would encourage you to test different approaches and discover what works best for your dataset.

      • Avatar
        Deepak Verma November 27, 2020 at 4:22 pm #

        How to get predicted values(i.e submetering power values) of 7 time_steps that we predicted in future ?

        • Avatar
          Jason Brownlee November 28, 2020 at 6:35 am #

          Call model.predict() to get predicted values.

          • Avatar
            Deepak Verma December 1, 2020 at 8:49 pm #

            Thank You Jason.

            Hey, I’m using encoder-decoder LSTM for predicting 288 future values (i.e values for next day on 5min basis) but not able to get good results (RMSE value is large) any suggestions you can give ?

          • Avatar
            Jason Brownlee December 2, 2020 at 7:43 am #

            Yes, the tips here will help:

          • Avatar
            Deepak Verma December 2, 2020 at 5:59 pm #

            Hey Jason thank you
            Can you be more precise because for predication of power demand for next 12h (i. e 144 values on 5min basis) it is taking around 13hrs to train the model with RMSE around 700.I’m training on 10 Months of data.

            How can I reduce training time and RMSE value?

            Generally what length of input sequences are considered as large sequences in encoder-decoder LSTM ?

          • Avatar
            Jason Brownlee December 3, 2020 at 8:14 am #

            You’re welcome.

            Good question, see this:

  194. Avatar
    Andre November 30, 2020 at 9:18 am #

    Hi Jason,

    Just one more doubt.

    In order to get using MSE for loss and RMSE as a metric, e get OVERALL RSME just from the training phase. What do you recommend?

    a) Finishing the training do the same approach walk-forward validation over each time-lag (in your case one week), or

    b) Get the RMSE for the entire set creating a new function refactoring forecast (function)

    What’s your opinion about that? Both are correct?

    • Avatar
      Andre November 30, 2020 at 9:22 am #


      In order to get using MSE for loss and RMSE as a metric and get OVERALL RSME just from the training phase. What do you recommend?

    • Avatar
      Jason Brownlee December 1, 2020 at 6:14 am #

      Estimate modle performance using walk forward validation on a hold out dataset and calculate the metric you like, e.g. rmse.

  195. Avatar
    Max December 6, 2020 at 2:31 am #

    Hi Jason,
    I have a continuous dataset (Time step – Lateral position) which have some weird oscillations at some points. I guess the LSTM performance could be better if I apply a filter to the dataset. Do you have experience with “filtering datasets” ?
    Best regards 🙂

    • Avatar
      Jason Brownlee December 6, 2020 at 7:06 am #

      Yes, it is a good idea to prepare data prior to modeling, e.g. make stationary and scale values.

  196. Avatar
    john December 8, 2020 at 6:21 am #

    Hi Jason,

    Thanks for the article, great introduction to LSTM and time-series predictions with DL models.

    Currently, I’m trying to predict a score probability with supervised time data. The thing is that at each time step (let’s say 1 day) the data already have a shape like: (N_customers, N_features) where some features are dynamic and changing with time. The target shape is (N_customers,) at each time-step.

    In practice, I want to use the past 4 days to predict the 5th day score with a total of 14 days of historic. Therefore It’s not clear for me how to prepare correctly the data before feed it to the LSTM model since I have already 3D data (N_customers, N_features, N_time).

    Thanks a lot !

  197. Avatar
    JR December 10, 2020 at 4:34 am #

    Hi Jason and thanks for your awesome article !

    I do try to work with LSTM, but I am running into issues with the input shape.
    I do have an input (X) which is of shape (20, 1001) and the output (y) of shape (20,1001)

    I consider my data to be one feature, then I reshape X as 3D (20,1001,1) before feeding the LSTM model, but it returns NaN values in the process.
    If I take the same dataset and reshape it as (20,1,1001), basically considering 1 time step with multiple features, the network works and return me good enough prediction.

    Yet I feel like I’m missing the purpose of the LSTM in this case, i.e. the capability of the network to read the input as a sequence, with a direction (maybe I misunderstanding as well).

    Any advice ?
    thanks a lot !

  198. Avatar
    Faisal Mohammad December 12, 2020 at 6:30 am #

    Hi Mr Brownlee

    # evaluate a single model
    is not working in my case. It is showing error
    module ‘tensorflow’ has no attribute ‘get_default_graph’

    • Avatar
      Jason Brownlee December 12, 2020 at 6:32 am #

      Perhaps try updating your version of TensorFlow and Keras.

  199. Avatar
    Faisal Mohammad December 13, 2020 at 3:26 am #

    Thanks for your reply. I updated both of them, but still the same issue

  200. Avatar
    André Araújo December 14, 2020 at 5:30 am #

    Hi Jason,

    I’m here again! I have decided to evaluate RSEM in the training phase (this approach) but in my dataset.

    I got higher RSME than on the test subset, I have split my dataset in 80/20, where 80 I had training and 20 I had used walk-forward.

    What do you recommend?

    Thanks again for all your support and this excellent material!

    • Avatar
      Jason Brownlee December 14, 2020 at 6:25 am #

      I recommend testing a suite of data preparation techniques, models and model configs in order to discover what works best for your dataset.

  201. Avatar
    Max December 15, 2020 at 2:43 am #

    Hi Jason, nice work!
    I implemented an Encoder-Decoder LSTM Model With Univariate Input which predicts the next 10 next steps at each iteration. My goal is to predict the next 3000 steps, which exceeds the number of my test data. Therefore, It would be good to use recursive multi-step forecast as you mention in :

    Do you have an implemented example of Enc. Dec. usign Recursive forecast?


    • Avatar
      Jason Brownlee December 15, 2020 at 6:29 am #

      I may, I don’t recall sorry. Perhaps use the search box at the top of the blog.

  202. Avatar
    Max December 15, 2020 at 10:13 pm #

    Unfortunately, I couldnt find something useful. Is possible to re-fit or updated each step of the walk-forward validation as new data is made available?

  203. Avatar
    Deepak Verma December 22, 2020 at 4:50 pm #

    hello Jason
    What is the difference in fitting the model as this
    for i in range(nb_epoch):, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
    return model
    (i.e fitting model for each epoch taking batch size as 1)

    versus this

    verbose, epochs, batch_size = 0, 20, 16, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
    return model
    (i.e fitting model in one go taking batch size not 1)

    in both case we are doing walk forward validation.

    • Avatar
      Jason Brownlee December 23, 2020 at 5:29 am #

      If the model is stateful in both case, then the difference is that the state is reset after each epoch in the first case and not at all in the second case.

    • Avatar
      Belloni Andre December 29, 2020 at 12:46 pm #

      Hi Jason,
      I implemented the Encoder-Decoder LSTM Model With Univariate to have a vector output of 50 and 100 predicted elements :

      n_output =50
      It works fine.

      But when I implemented it taking a window of:
      n_output =100
      I got “Nan” for every prediction.
      Any clue about that?

      Pd.The train and test dataset are split in groups of 50 for the first case and 100 for the second.

      • Avatar
        Jason Brownlee December 29, 2020 at 1:33 pm #

        Perhaps your input data has a nan?
        Perhaps you need to scale data prior to modeling?
        Perhaps you need to change the activation function in your model?

        I hope at gives you some ideas.

        • Avatar
          Belloni Andre January 7, 2021 at 11:12 pm #

          I changed the activation function of the encoder-decoder model to LeakyReLU
          and I used MinMaxScaler. Thanks Jason, keep up this awesome work you are doing.

  204. Avatar
    Rose January 4, 2021 at 6:17 am #

    Dear Jason,
    I am grateful for the useful tutorial.

    I am a beginner and I could not understand which number I should use for my problem which is *** multi-variate multi-step ahead*** forecasting.

    I am not sure whether I should apply
    NUMBER 4. LSTMs for Multi-Step Forecasting
    NUMBER 7.Encoder-Decoder LSTM Model With Multivariate Input
    I want my model to forecast 9 features in terms of 3 time steps ahead.

    Please introduce a tutorial to apply to my problem as I have seen several links in this field.

    which one is proper for my problem, *** multi-variate multi-step ahead***.


    • Avatar
      Jason Brownlee January 4, 2021 at 7:58 am #

      You’re welcome.

      Perhaps try both and see which might be a good fit.

  205. Avatar
    Ramin January 12, 2021 at 10:49 pm #

    Hi Jason,

    I hope all is well with you.
    In the LSTM Model With Univariate Input and Vector Output section, you explained that the input and output would be as follows:

    Input, Output
    [x1 x2 x3], [x4 x5 x6]
    [x2 x3 x4], [x5 x6 x7]
    [x3 x4 x5], [x6 x7 x8]
    [x4 x5 x6], [x7 x8 x9]

    As you can see, we have predicted, for example, the x6 three times. So, how the final output prediction of x6 is calculated?

    Thank you for your time.

    • Avatar
      Jason Brownlee January 13, 2021 at 6:14 am #

      When training, this data must be available so we can correct the model.

      Once trained, the output is predicted from the input.

      • Avatar
        Ramin January 13, 2021 at 10:29 pm #

        Thank you for your reply. Unfortunately, I may have some problems yet. The output, which is predicted from the input, predicts some time steps multiple times, such as x6. So, if I find out about the final predicted value of x6, how can I know which one is correct?

        We should have one predicted value at the final step for the x6, isn’t that right?

        • Avatar
          Jason Brownlee January 14, 2021 at 6:14 am #

          You can frame the problem anyway you wish – there is no standard model/standard framing of a prediction problem.

  206. Avatar
    begin_ml January 25, 2021 at 6:40 pm #

    thanks for the great tutorial! How could I get the entire predicted results and metrics?

  207. Avatar
    helpme January 25, 2021 at 8:02 pm #

    How can I plot real values vs. predicted values in this example?

  208. Avatar
    Shilpa January 26, 2021 at 6:47 am #

    Hello Jason,

    How can we predict the output for next 8 hours, for univariate time series.

    How to pass hours instead of days?

    Thank you,

    • Avatar
      Jason Brownlee January 26, 2021 at 9:30 am #

      Some ideas:

      – You can model the input and output in hours directly.
      – You can model the output only in hours directly.

  209. Avatar
    Rahul January 28, 2021 at 2:13 pm #

    Hi Jason,

    Thanks for the wonderful article which helped me alot to code along on time series data. But, I am having difficulty in standardizing/normalizing the data rather I would say inverse_transforming the data as my input shape of data train_X is (1000, 60, 7) [nsamples, timesteps, features] and I want to predict next 5 days of one feature only so my output shape of data train_y is(1000, 5). After performing inverse transform on prediction, I am facing error “operands could not be broadcast together with shapes”. Can you please give me some hint or suggest on the same?


    • Avatar
      Jason Brownlee January 29, 2021 at 5:58 am #

      You’re welcome.

      Each variable would have to be scaled first, perhaps before reshaping.

      This tutorial may also help:

      • Avatar
        Rahul February 2, 2021 at 1:36 am #

        Thanks Jason, I solved that problem. Just wanted to know one more thing…in RepeatVector(), output from the encoder should be repeated “input timesteps” times or “output timesteps”(in my case it would be 5 as I want to predict next 5 days) times?

        Thanks in Advance

        • Avatar
          Jason Brownlee February 2, 2021 at 5:47 am #

          It is repeated “output times”, once for each time step in the output sequence.

          • Avatar
            Rahul February 3, 2021 at 7:54 am #

            That totally makes sense!
            Once again Thanks alot Jason!!!! Learning alot ML/DL from you 🙂

          • Avatar
            Jason Brownlee February 3, 2021 at 9:23 am #

            You’re welcome!

        • Avatar
          Maryam August 25, 2021 at 4:55 pm #

          If I may ask, how did you solve that issue with the inverse transform?

  210. Avatar
    Murilo Souza February 11, 2021 at 9:56 pm #

    Hello Jason,

    I’ve seen you imported the layers directly from keras, but didn’t see something like:

    from tensorflow import keras


    from tensorflow.keras import layers

    It is not needed anymore to import things from tensorflow?

  211. Avatar
    Myles February 12, 2021 at 3:20 am #


    Love this tutorial. I had a question in regards to the CNN-LSTM Encoder-Decoder Model.

    Why is the x_input reshaped from (14,) to (14, 1) on line 64 and also why is input_x reshaped from (14, ) to (1, 14, 1) on line 103 of the forecast method? I know the 14 pertains to the prior number of days of input but I’m not too sure what the other numbers represent.

    • Avatar
      Jason Brownlee February 12, 2021 at 5:48 am #


      This will help you understand the shape of input for LSTMs:

      • Avatar
        Myles February 12, 2021 at 6:23 am #

        Hi Jason,

        I think I have a handle on it now. I managed to get Multivariate CNN-LSTM and ConvLSTM Encoder-Decoder models working. Next thing is to try and get multi-step, working for parallel series.

        I have a few other questions, if you don’t mind.

        1. Is there a reason why there is no validation set included in the examples?
        2. Why is the accuracy metric not included for the model?
        3.In general, would it help the model if I encoded some features such as hour, weekday, month and is_weekend etc.?
        4.Would it be possible to to use these models as part of an ensemble with simpler models like ARIMA?

  212. Avatar
    JG February 12, 2021 at 5:40 am #

    Hi Jason,

    Sorry I could not perform (yet) any experiment o variation, on your code tutorial, to test by self the possible answer to the following question. So right now it is a conceptual issue.

    – what about if in this time series dataset problem defining mainly by 7 original features (inputs or multivariate problem) and one output (the global active power observation)…I decided to eliminate 6 features and retain only the input associate to the ” global active power”…so I convert the problem on univariate …to predict the same variable (“global active power”) vs your multivariate problem…what could be the result? better/worse?

    I mean I do not understand what the model learn (or add value) from the 6 additional inputs/features or even if this learning is better than leave alone an unique time serie input …to predict the future of this variable…

    • Avatar
      Jason Brownlee February 12, 2021 at 5:53 am #

      The additional inputs may or may not help predict the target.

      This would be an open question on any multivariate dataset, so we would design experiments to answer this question.

  213. Avatar
    Areti February 17, 2021 at 5:16 am #

    I have just read your article. Another great tutorial with meticulous work! I would ask you if the function to_supervised does the same job with the function series_to_supervised from your tutorial How to Convert a Time Series to a Supervised Learning Problem in Python
    I am here quoting the functions
    Thank you in advance!

    • Avatar
      Jason Brownlee February 17, 2021 at 5:32 am #

      Yes, the function does the same job – based on the same code.

      The function in the dedicated tutorial is more flexible.

      • Avatar
        Areti February 17, 2021 at 9:51 pm #

        Thank you for your prompt reply!

  214. Avatar
    JG February 21, 2021 at 4:52 am #

    Hi jason,

    Great tutorial am very interesting python coding details !.

    Anyway, I see a little bit confusing the data preparation, particularly the input(X)-outputs (Y) time series splitting (even, inside other generic functions), and also I do not share the necessity
    of defining proper forecast and evaluating functions instead of using the quick and simpler generic methods of keras models (.predict(), .evaluate()).

    I share my experiments mainly devotes to gain clear code structure:

    one of the big issues and ML time consuming coding is data preparation (e.g. “X” and “y” inputs-outputs for “supervised” ML learning), specially critical on time series splitting based on previous inputs “lags” (and or multivariate) and multiple-steps output forecast (e.g. vector output).

    -I realised you have performed dataset daily time conversion to weeks because but later on you feed your ML model on days (obliging to performed the reverse weeks to days conversion). So I cancel this parte of code to gain clarity.

    – I also grouping, in a common data preparation function, yours two different splits (one for training/test resulting on 2D numpy [days, 8 features) and the second one, performed inside others functions (the inputs-output splitting of each training-test resulting on 3D numpy [days, lags, 1 feats]), just to gain a clear code structure.

    – I also decided to use own keras direct model method such as model.predict() and model.evaluate(), of course adjusting before x_test and y_hat …to have equivalent data to compare. I also used custom metric “rmse”, as function defined on metric argument on model.compile(), so I do not need to perform manually “rmse” operations.

    – I got a tot al RMSE of 251.5 which is significantly much better that your own first global RMSE result (=399.4 ) using your first simple LSTM model option with 7 days lags of previous inputs and 7 days outputs predictions on walk-forward validation for univariate (single feature analysis) model.

    – My recommendation

    As said before because data preparation is some time confusing doing manually on time series I strongly recommend use your own function defined in this post ( )

    which provide a clearly lags previous input splitting, under uni or multiple variate, and also single or multiple time steps output forecast (some time called vector). I appreciate very much that clear function written by you.

  215. Avatar
    Murilo Souza February 22, 2021 at 11:54 pm #

    Hello Jason,

    I have two more questions here:

    1 – You said here that LSTM models expects data in the shape of [samples, timesteps, features]. Does this apply to every RNN models used to forecast something?

    2 – In the build model function, the LSTM layer has an input_shape=(n_timesteps, n_features). Why the number of samples is not present here according to what has been said in my first question?

    Thanks again for the great tutorial!

    • Avatar
      Jason Brownlee February 23, 2021 at 6:20 am #


      We don’t need to specify the number of samples in the model config as it is expected to vary.

  216. Avatar
    shebiimalik February 23, 2021 at 12:35 am #

    hi jason
    it’s a really helpful article

    have you tried simple CNN-LSTM (using minutely dataset ad lag size of 60)
    i mean shape of train_x and test_x should be (12345, 60,7) i am taking 12345 as an example here will be the total size of data

    if you have tried then please share .
    simple CNN-LSTM
    not CNN-LSTM encoder

    only simple CNN-LSTM

    • Avatar
      Jason Brownlee February 23, 2021 at 6:21 am #


      It is straight-forward to adapt the above examples in to these additional models that interest you.

  217. Avatar
    Vishy March 5, 2021 at 5:29 am #

    I am trying to run the code in this article . I am not able to copy the codes in the panel. What do I need to do ?


    • Avatar
      Jason Brownlee March 5, 2021 at 5:36 am #

      Click the code, hold control-C to copy the code, then paste into your txt file with control-V.

  218. Avatar
    JG March 6, 2021 at 9:26 pm #

    Hi Jason,
    many information contained in this tutorial. Thanks you!

    Please, an additional question regarding how I can repeat blocks or layers inside a model to get a more deepest or precise time series encoder-decoder learning model.

    I mean when we define a MLP model, e..g. I can repeat several times the Dense layer or when using Convolutional model (e.g. VGG16 I can repeat several times blocks of Conv2D + Conv2D+ MaxPooling). It is clear!.

    But what about your final ConvLSTM2D time series model? I do not see how to repeat a block e.g. ConvLSTM2D + Flatten + RepeateVector + LSTM or even a single ConvLSTM2D or LSTM layers,(taking into consideration the precise layer dimensions matching)?…
    I only see I can repeat the finals dense layers via repeating TimeDistributed(Dense)

    Any suggestion? Thanks

    • Avatar
      Jason Brownlee March 7, 2021 at 5:13 am #

      Good question, you would just repeat the LSTM layer or just the ConvLSTM2D layer.

      You might also want to work in some dropout and/or batchnorm into your blocks.

  219. Avatar
    Willem Pretorius March 8, 2021 at 1:25 am #

    Hi, how would one predict out of the sample 1 step ahead forecasts? You are using the training set to train the model on while testing the trained model on the test set and then u use evaluation metrics to check the goodness-of-fit. Now my question is, how do I predict 1 value ahead in time (1 time ahead in the test set)? In other words, lets say you have data from Jan 2020 to Dec 2020, and i want to use any of these models to predict a value for the following month Jan 2021 (index falls outside test set)?

  220. Avatar
    JG March 8, 2021 at 8:05 am #

    thanks Jason!

  221. Avatar
    Faisal Mohammad March 9, 2021 at 1:58 pm #

    hi Jason,

    Hope you are well. I am trying to train the models with the hourly data. But I cannot get the split data function, how to pass the parameters.

    • Avatar
      Jason Brownlee March 10, 2021 at 4:37 am #

      What problem are you having precisely? Perhaps you can summarize it in a few sentences?

  222. Avatar
    Ramin March 10, 2021 at 9:00 pm #

    Hi Jason,

    Thank you for everything.

    I had a question regarding the output vector. Is that really logical when we use Dense( Number of outputs)? I mean, how can it understand the sequence? If we are predicting the next 7 days ( dense(7) ), how can we be sure that the first dense neuron is for the first day?

    Shouldn’t we just use TimeDistributed(Dense(1)) instead of this Vector Output? How should I know which one is better and True?

    Thank you for your time. Please help me here.


    Thank you for your time. Please help me here.


    • Avatar
      Jason Brownlee March 11, 2021 at 5:11 am #

      There are many ways to design a model, perhaps try a few approaches and discover what works best for your dataset.

  223. Avatar
    Momo March 14, 2021 at 8:33 pm #

    Hello Jason,
    Thank you for your introduction about how to develop lstm model. And I have a related question. Whether the longer the input sequence, the more accurate the prediction. And how to choose the length of input sequence.

    • Avatar
      Jason Brownlee March 15, 2021 at 5:57 am #

      You’re welcome.

      Longer sequences may or may not be helpful. I recommend testing and discover the answer for your specific model and dataset.

  224. Avatar
    Mehedee Hassan March 30, 2021 at 6:47 pm #

    About the last code block before “Model Evaluation”, I think there is a mistake in line 16, it will be sep = “,” instead of sep=”;”

    • Avatar
      Jason Brownlee March 31, 2021 at 6:00 am #

      No, it is correct. Columns are separated by “;” in the raw file.

  225. Avatar
    Engineer April 11, 2021 at 2:20 am #

    Hello Sir,

    This article is an excellent tutorial and most of the times if I have any doubt, I visit your articles for a resolution.

    You have discussed the combination of CNN and LSTM for time series, however, I had a query : Does the combination of LSTM and CNN, with LSTM first, then CNN is useful?

    Like using LSTM with return_sequence = True and using Conv1D after it?
    If it is useful, can you suggest any references or papers?

    • Avatar
      Jason Brownlee April 11, 2021 at 4:55 am #


      No, I don’t think LSTM-CNN in that order would be helpful. Perhaps try it to confirm.

  226. Avatar
    Rodney April 12, 2021 at 8:24 am #

    This tutorial is GREAT! I am confused with the dimension that the test dataset should have in the prediction step. I think I may be missing something. Could you explain, please?

    I have a univariate time series (just one vector), which I divided in train and test. I want to predict the next 12 points, so I made my test data to have only the last 12 obs of the data.

    I took the train vector and implemented your overlapping window approach. So now I have, X_train = [105, 12, 1] and y_train = [105, 12]. My test vector should be in the [12, 12, 1] dimension? I did not understand which size it should have. By the avaluate_forecast function I understood it should have a 3D format, right?

  227. Avatar
    Sanket April 28, 2021 at 9:45 pm #

    Hi Jason,

    This tutorial is great!

    I need your guidance related to how I can apply it on near real time business scenario.

    Like, I need to submit prediction/forecast on day start by training model on previous data and LSTM algorithm takes time to train and forecast with limited GPU resources.

    Please suggest possible solution to implement LSTM in near real time business scenario?

    • Avatar
      Jason Brownlee April 29, 2021 at 6:27 am #

      Perhaps you can re-train your model each evening?

  228. Avatar
    Arundhuti Bushra April 29, 2021 at 4:22 am #

    This code is very helpful. Could you please share a solar PV power generation forecast code as well. I’m new in machine learning. Gathering interest trying out your codes.


  229. Avatar
    Ibtissam Saadi May 10, 2021 at 10:43 pm #

    thank you so much , is very helpful.
    can i use ConvLSTM2D Model With Multivariate Input and multi-output?

    • Avatar
      Jason Brownlee May 11, 2021 at 6:41 am #


      • Avatar
        basma May 14, 2021 at 7:48 pm #

        thank you for your response ,
        i have another question : in my case iam using ConvLSTM2D model for traffic flow prediction (5 features : Flow , Temp, rain , density and speed) when using prediction , all this features taking for Flow prediction also the column number one (flow) it is a normal?

        #i’m confused about the result of prediction , What will the forecast values help me?

        • Avatar
          Jason Brownlee May 15, 2021 at 6:32 am #

          Perhaps try a few different models and different framing of your prediction task and discover what works well or best.

  230. Avatar
    basma May 15, 2021 at 6:51 pm #

    in this tutorial , why the results of all models show just plot of loss ? I mean how this values predicted help me?

  231. Avatar
    basma May 25, 2021 at 1:14 am #

    Hello , Please can you help me for plot curve of Actuel and predict values?

    • Avatar
      Jason Brownlee May 25, 2021 at 6:09 am #

      There are many examples on the blog.

      This may help:

      • Avatar
        basma May 25, 2021 at 6:37 pm #

        Thank you for your response,
        1/ but I’m confused when I try plotting the plot of actual values and predict values for (12 input and 12 output [vector output]) like in the tutorial you have (7 output) for validate my work , my result shown plot of 12 values for actual time series data and 12 values for predicting time series data in the same plot.
        It is acceptable?? Because most of the article that my reading I found one line for actual and one for predicting.

        2/ if the result shown is accepted, what should I do in the comparison with baseline models? I mean for comparative results of each model with the actual data in the same plot, I think the result of each model is emerge with other model?

        3/ What’s your opinion about the plot of actual/predicted values for the result of this tutorial?

        • Avatar
          Jason Brownlee May 26, 2021 at 5:53 am #

          The model will make a one week prediction that can be compared to expected values in a plot.

          I recommend comparing error metrics to baseline models, not predictions directly.

          Nevertheless, you can adapt the code however you like for your project and requirements.

          • Avatar
            basma May 26, 2021 at 1:57 pm #

            Thank you so much,
            I don’t know how doing for comparison my models with baseline models, I mean should build the model of the baseline in my code? Or i will take the result from the article!

            ==> when I plot the result of rooms for each model with my models in the same plot how can I do? In this case we should build each model of baseline in my code for plot the result?

          • Avatar
            Jason Brownlee May 27, 2021 at 5:34 am #

            A baseline model may be a persistence model or the mean of the historical data.

            You can first evaluate this model, then use the score to determine if other models have skill or not:

  232. Avatar
    Peter May 28, 2021 at 6:48 pm #

    Thanks Jason for this article,
    I have a couple of question regarding the Univariate Encoder Decoder LSTM:

    1) Your wrote regarding the benefits “The important difference is that an LSTM model is used in the decoder, allowing it to both know what was predicted for the prior day in the sequence and accumulate internal state while outputting the sequence.”
    –> Is this not also the case for a normal LSTM or RNN? They can also accumulate internal states. I do not see why this is an advantage

    2) Why do you use “model.add(RepeatVector(7))” and why do you choose 7? What is done here? Basically I read (on other sides) that one advantage of a Encoder/Decoder LSTM is that is has a variable output size. When fixing this with 7 you get rid of this advantage and I so not see any benefit of this approach

    3) Your wrote (about the RepeatVector) “This means that each of the 200 units will output a value for each of the seven days, representing the basis for what to predict for each day in the output sequence.”
    –> Is the number of layers in the LSTM (here 200) not just the size of the hidden vectors? The basic numbers of LSTM units should be equal to the size of the sequence.

    4) Why do you use the TimeDistributedLayer only in the Decoder part and not for example in the Encoder or in the vanilla LSTM.

    • Avatar
      Jason Brownlee May 29, 2021 at 6:49 am #

      It may or may not be the case with simpler models, likely it is but to a lesser degree.

      We choose 7 for the number of time steps desired in the output sequence. We duplicate the bottleneck layer 7 times.

      Yes, each output time step is conditional on the encoded input sequence and the prior output time step.

      The number of nodes was arbitrarily chosen based on trial and error. Unrelated to sequence size.

      More on the time distributed layer:

      • Avatar
        Peter June 1, 2021 at 6:37 pm #

        Thanks Jason for your answer. I have some follup up questions and I’d highly appreciate it, if you could answer them:

        What I do not understand is why in the Vanilla LSTM the model predicts n_outputs = 7 days model.add(Dense(n_outputs)) while the Encoder-Decoder LSTM predicts just 1 day model.add(TimeDistributed(Dense(1))). Why do you not predict multiple days with the Encoder-Decoder LSTM? I read your text about it but I do not understand the idea behind the change of the prediciton horizon. What is even more confusing for me is that the input data structure is the same for both LSTM versions altough one of them maps the input to 1 output value and the other maps the inputs to 7 output values.

        In fact I tried to increase the number of days to 7 model.add(TimeDistributed(Dense(7))) but I received an error message because of the data format. How do I have to change the data format to predict 7 days in advance by the LSTM?

        • Avatar
          Jason Brownlee June 2, 2021 at 5:41 am #

          Both models approach the problem differently, e.g. vector output vs one-time step output.

          This tutorial explains each model in turn, no need for me to repeat here:

          Use the approach that works best for your dataset.

          • Avatar
            Peter June 7, 2021 at 8:33 pm #

            Thanks a lot Jason for your answer,

            I read your suggested tutorial but still I do not understand how to change the Encoder-Decoder LSTM or a stacked LSTM with return_sequence true to predict not 1 time slot but 7 by using the code model.add(TimeDistributed(Dense(7))). I get an error message telling “InvalidArgumentError: Incompatible shapes: [16,7,7] vs. [16,7]”. Basically this is the same error message that I get when using return_sequences=True in the Vanilla LSTM. So my question is either how to predict more than 1 output in the Encoder-Decoder LSTM or how to set return_sequences=Truein the Vanilla LSTM. What do I have to change in the training data and how can I do that?

            I’d appreciate every comment and would be quite thankful for your help.

          • Avatar
            Jason Brownlee June 8, 2021 at 7:15 am #

            You will need to change your data to match the expectations of the model.

          • Avatar
            Peter June 8, 2021 at 7:19 pm #

            Thanks Jason for your answer. But how can I change the data such that I can not only forecast 1 but 7 timeslots? Do I have to change the to_supervised function? If so how can I do that?

          • Avatar
            Jason Brownlee June 9, 2021 at 5:42 am #

            Yes, just change the parameters to the to_supervised() function when preparing your data.

          • Avatar
            Peter June 9, 2021 at 5:14 pm #

            Thanks Jason for your answer. I really appreciate it. The question is how do I have to change the to_supervised() function in order to not forecast 1 but 7 timeslots. For that I have 3 Questions.

            I think I only have to change the y-labels in the function, right (Question 1)?

            For a 1-timeslot forecast (as you implemented it) the code for the y-array is:
            y.append(data[in_end:out_end, 0]). This leads to an output array of (995,7) while the input x-array has the format (995,7,8) after the to_supversed function. If I understand correclty (I might be wrong on this one) I need an output shape of the labels in y-array of (995,7,7). The first parameter is the batch-size. The second parameter the number of timeslots for looking back (past data) and the third parameter the number of timeslots for the future prediciton. Is this correct (Question 2)?

            If so, how can I implement this in the code (Question 3)? I tried the following code y.append(data[in_end:out_end, in_end:out_end]) but it creates an error message. Then I tried the following code y.append(data[in_end:out_end, in_end+7:out_end+7]) The notion was instead of only using 1 value for the third dimension, I should use 7. And those 7 values are exactly the same as the past-data (2nd argument of the desired (955,7,7) array) but shifted 7 timeslots into the future. But unfortunately it does not work as intended as it creates an array of the shape (955, 7, 0). Do you have any suggestion for this?

          • Avatar
            Jason Brownlee June 10, 2021 at 5:24 am #

            The function does not need to change, only the parameters to the function – as I mentioned. If this is challenging, perhaps this tutorial will help:

          • Avatar
            Peter June 10, 2021 at 8:06 pm #

            Thanks a lot Jason for your answer. I really appreciate it.

            I read your link but this did not help at all. There it just says that I have to adjust the parameters of the function (train, n_input, n_out) exactly as you said. Well this still does not work with your suggested code. I have n_out at 7 but still I get the error “[16,7,7] vs. [16,7]” when trying to predict multiple time slots with the Encoder-Decoder LSTM (by using model.add(TimeDistributed(Dense(7)))). I also creates an error when I use the Vanilla LSTM with return_sequence=true.

            So I have to prepare the label y-data to the format [16,7,7] and the question is how can I do this with your code? What do I have to change in order to do this. You suggested approach of varying the input parameters of the “to_supervised” function just does not do that.

            Do you have a tutorial where you predict more than 1 timeslot (using model.add(TimeDistributed(Dense(7))) using a LSTM with return_sequence=true? This would help quite much as I have read over this and the other suggested articles over and over again but I am still quite confused as to how to prepare the input data for such a LSTM with return_sequence=true and multiple forecasting steps.

          • Avatar
            Jason Brownlee June 11, 2021 at 5:14 am #

            Sorr,y I don’t have the capacity to adapt the code for you.

            Yes, these examples are generic and may provide a better starting point for you to adapt for your needs:

            There are also many other tutorials on the blog that may help as a starting point for your project.

          • Avatar
            Peter June 11, 2021 at 4:42 pm #

            Thanks Jason for your answer. I really appreciate it.

            I can absolutely understand that you can’t do the coding for me. But can you tell from a basic point of view what has to be change in order to make a forecast of multiple steps by using model.add(TimeDistributed(Dense(7))) and return_sequence=true? In your suggested article this is not done as far as I see it.

            So basically you do not have a tutorial where you do this? Is this not a common case for LSTM forecasting? Using return_sequence true to get more training examples for the backpropagation and to forecast multiple timesteps?

          • Avatar
            Peter June 15, 2021 at 6:29 pm #

            @Jason: Any comments to my last comment? I’d highly appreciate every further comment from you.

          • Avatar
            Jason Brownlee June 16, 2021 at 6:19 am #

            I’ve explained the same thing a few ways, I don’t think I am the best person to help you, sorry.

          • Avatar
            Peter June 17, 2021 at 8:23 pm #

            Thanks a lot Jason for your answer,

            you said “I’ve explained the same thing a few ways”. Where exactly did you explain the problem of using return sequence true and predicting multiple time slots with a LSTM (or a RNN)?

            Do you have a tutorial in your blog that deals with the problem of using return sequence true and predicting multiple time slots with a LSTM (or a RNN)? I searched for it and I could not find one.

            Or do you have a tutorial where you have a comprehensive explanation as to how to prepare the data for LSTMs with return sequence true and mutiple-step prediction? I read a lot of your tutorials but you always seem to leave that case out (which I do not understand at all because I think that this is a quite common use case). But maybe I just missed one.

          • Avatar
            Peter June 21, 2021 at 4:56 pm #

            Hi Jason,

            thanks for your answers. Any comments to my last comments? I’d highly appreciate every further comment from you as I have cruical problems understanding your tutorial.

          • Avatar
            Jason Brownlee June 22, 2021 at 6:27 am #

            I recommend adapting an existing tutorial for your needs.

          • Avatar
            Peter June 22, 2021 at 6:18 pm #

            Thanks Jason for your comment. I really appreciate it.

            But how can I adapt the tutorial? Do you have some comprehensive tutorial where you explain how to prepare the data for a LSTM in a general way? I worked through 5 of your tutorials about LSTM but there it was never really exlained in a general way how to prepare the data for a LSTM. Especially the common case of using return sequence =true and predicting multiple time slots was not covered there and there were also no explanations as to how to adapt the data preparation procedure for that.

            There are two things that should be explained. 1) What kind of format the data should have and 2) How should the code look like. Altough having spent quite much time with your tutorials I do not understand how to do one of those tasks.

            I can understand that you can’t explain the 2) task but do you have a tutorial that covers the 1) task in a comprehensive way such that I can think about how to implement the very common case in prediction with return sequence = true and mutiple prediction horizons?

          • Avatar
            Peter June 24, 2021 at 12:26 am #

            Thanks Jason for your answer and effort,

            I read your suggested link and the things mentioned there are clear to me. Still I do not know how I should prepare the y-label-data when having return sequence=true and wanting to predict multiple time steps.

            In your the example of this tutorial the y-label-data has the format (995,7) which is basically [Sample, Timesteps] right? Do I need something like (995,7,7) when having return sequence=true and wanting to predict multiple time steps which would be [Sample, Timesteps, prediction-horizon]?

            The x-data for training of your suggested code has the format (995,7,8) which is [Sample, Timesteps, Features].

          • Avatar
            Jason Brownlee June 24, 2021 at 6:04 am #

            Generally, predicting sequences does not involve setting return_sequences=True. That is only used as input into another layer.

          • Avatar
            Peter June 24, 2021 at 5:56 pm #

            Thanks Jason for your answer, I really appreciate it,

            Isn’t the benefit of “return_sequences=True” at the last layer that the model computes the Loss and Backprop with every seq. So, it has more feedbacks which eventually helps in after convergence and better learning?

            Excerpt from Hands-on ML book…..” instead of training the model to forecast the next 10 values only at the very last time step, we can train it to forecast the next 10 values at each and every time step. In other words, we can turn this sequence-to-vector RNN into a sequence-to-sequence RNN. The advantage of this technique is that the loss will contain a term for the output of the RNN at each and every time step, not just the output at the last time step”

            As far as I understand return_sequence= true leads to better training results. And the big question for me – that I have not figured out yet – what format the y-label data has to have for predictiong multiple timeslots with return_sequence = true (and afterwards how to impelment this in Keras).

          • Avatar
            Jason Brownlee June 25, 2021 at 6:11 am #

            I have not used that approach in any of my models. Perhaps I am not the best person to give you advice on your model.

          • Avatar
            Peter June 25, 2021 at 5:05 pm #

            Hi Jason,

            thanks for your answer. I really appreciate it.

            Well, you have used both approaches (multi-step time-series forecasting and return_sequence = true) in you tutorials separately but not together.

            So I am wondering why you have not used it toghether? For me it makes a lot of sense to use them together. return_sequence = true is beneficial for training (see my post before) and generally multi-step time series forecasting is quite common as you normally do not just forecast 1 time-step as you do (when not using return_sequence =true).

          • Avatar
            Jason Brownlee June 26, 2021 at 4:54 am #

            I use return_sequence = true in LSTM layers that provide sequences into subsequent layers, not for output directly. I do not believe it is appropriate to provide output directly in this way.

          • Avatar
            Peter June 29, 2021 at 2:38 am #

            Thanks Jason for your answer and effort. I really appreciate it.

            Basically I also use return_sequence=true only for the subsequent layers not for the output. I just adjusted your example. When using the type from your suggested code in the tutorial it looks like this:
            #Stacked LSTM
            model = Sequential()
            model.add(LSTM(200, activation='relu', return_sequences=True, input_shape=(None, n_features)))
            model.add(LSTM(200, activation='relu', return_sequences=True))

            Basically this works. However, when I want to predict more than 1 time step (which is very common in time series forecasting as it does not make so much sense to only predict 1 time step as you do in your tutorial) then I have the following code:

            #Stacked LSTM
            model = Sequential()
            model.add(LSTM(200, activation='relu', return_sequences=True, input_shape=(None, n_features)))
            model.add(LSTM(200, activation='relu', return_sequences=True))

            But here I get the error message “InvalidArgumentError: Incompatible shapes: [250,7,7] vs. [250,7]”. So my question is how I can adjust the training data such that I can use return_sequence=true (which is beneficial for the training process) not for the output layer but for the layers before the output layer and still predict more than 1 time step?

          • Avatar
            Jason Brownlee June 29, 2021 at 4:51 am #

            I would recommend not using return_sequences=true in the second LSTM layer.

          • Avatar
            Peter June 29, 2021 at 11:24 pm #

            Thanks Jason for your answer and effort, I really appreciate it.

            Basically I tried what you said and it works only if I do not use a TimeDistributed layer. Do you think that it is generally okay not to use a TimeDistributed layer or is the TimeDistributed layer important for good results?

            But generally thanks a lot for your great help.

          • Avatar
            Jason Brownlee June 30, 2021 at 5:20 am #

            It is not for me to say. You use whatever model you like that gives the best performance on your specific dataset.

  233. Avatar
    ezer elbaz June 1, 2021 at 5:53 am #

    “Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome”

    is there a way to add to the model automatic repetition with variable n – for number of repetitions?

    thanks 🙂

    • Avatar
      Jason Brownlee June 2, 2021 at 5:34 am #

      You could try to repeat the walk-forward validation process, e.g. wrap the whole thing in a loop and average the results.

      • Avatar
        ezer elbaz June 8, 2021 at 6:48 pm #

        thanks Jason,

        i did simple repeat function, with two append lists i can after calculate the statistics and plot by day(score) or total(scores)

        all_score = []
        all_scores = []

        num = 10
        for x in range(num):
        n_input = 1
        score, scores = evaluate_model(train, test, n_input)

        i have few more questions:

        n_input, i saw in one of the examples that the number refer as number of days as input,
        but if the data is divided to 7 days, so why n_input not equal to 1 week?

        if it is days, is it possible to input 1 day and forecast 7, or minimum must be 7 days?

        thanks again for this great article and for your responsiveness and help with questions 🙂

        • Avatar
          Jason Brownlee June 9, 2021 at 5:42 am #

          The number of input time steps can be anything you want – whatever leads to the best prediction results.

          Forecasting 7 days from 1 day input sounds hard for the model to do. Perhaps try it and see.

  234. Avatar
    LoudEye June 1, 2021 at 3:51 pm #

    This does not work.

  235. Avatar
    betty June 5, 2021 at 4:47 am #

    Hello sir ,

    i want ask you about the cnn in this case ! the filter is used for extract the features?
    can you give me example for that features ?

    in my problem of traffic flow prediction , i use cnn for extarct the spatial faetures .
    i read in one article this sentences but i don’t understand how it mean “Some unnecessary
    information is filtered out during the pooling process to obtain more abstract data”
    can you help me what does it mean? i need example about ‘unnecessary
    information ‘ and ‘abstract data’

    please another question ! do you have an idea about the complex linear traffic flow data
    what is mean ‘linear traffic flow data’?

  236. Avatar
    basma June 6, 2021 at 8:03 pm #

    Hello sir ,
    thanks for this tutorail . i want ask you about encoder-decoder ConvLstm
    where is the decoder in this case? convlstm2d as encoder!
    can i considred layer of LSTM as a decoder?

    second qst please ? in my problem for prediction i use (3 layer of convlstm2d as encoder , and 1 layer Bi-Lstm as decode ) is possible??

    thank you .

  237. Avatar
    Bill June 10, 2021 at 4:20 am #

    Can you explain in more detail why did you break the dataset in train and test in the format [len(), 7, 8]? I can’t understand how ,8 works and why with this the train in our problem becomes bigger?

    Thank you in advance

  238. Avatar
    Kostas June 10, 2021 at 7:37 am #

    Can I make out-of-sample predictions using LSTM and if yes, how?

  239. Avatar
    Bill June 11, 2021 at 3:25 am #


    Can with the help of LSTM forecast out-of-sample data?


    • Avatar
      Jason Brownlee June 11, 2021 at 5:16 am #

      Yes, call model.predict() with the input required by your model.

      There are many examples on the blog.

      • Avatar
        Bill June 13, 2021 at 1:51 am #

        First of all thank you for your response!
        Yes, but in your examples you call model.predict() with testX where testX has the actual values inside. For example, testX[2]=testY[1] so it is not out-of-sample. I’ve tried to make a prediction and then with this prediction as input get another prediction but it didn’t work, that’s why I am asking you.

        Thank you in advance

        • Avatar
          Jason Brownlee June 13, 2021 at 5:50 am #

          The input to X predict() is whatever data is required to predict y, the X and y parts of a sample are related and the definition of X and y are is provided by you when you prepare the data.

          For example, maybe your model takes 7 days of input to predict 7 days. Therefore, to predict the next 7 days you need the previous 7 days of input.

          • Avatar
            Bill June 15, 2021 at 3:51 am #

            OK, thank you!
            One last thing, when you provide the model.predict() with the last elements of train, this predicts the next week or the last train’s week?

          • Avatar
            Jason Brownlee June 15, 2021 at 6:08 am #

            Sure, you can do that if you like.

  240. Avatar
    Said June 11, 2021 at 11:27 pm #

    Hello Jason, I really appreciate your efforts your blog is amazing it really helpful.
    My question is : when I try to extract curves for actual and predicted values from the code to plot them, the predicted values are shifted and delayed by one step ! Do you think that it is normal ?
    Thank you

  241. Avatar
    Hoon June 14, 2021 at 4:14 pm #

    Hello jason
    it’s a really helpful article. Thank you for

    I have one question that.
    I studied that it is commonly used to learn LSTMs using long-term data and to predict future values.
    For example, using house power consumption data from 2000 to 2020 to predict house power consumption for 2021.

    But How can I train an LSTM if I have 10 or more short time series data?
    For example, to predict the flight Trajectory of a UAV, I generated 10 short flight data.
    How can i train LSTM with this data?

  242. Avatar
    Said June 14, 2021 at 6:54 pm #

    Thank you so much Jason, this problem made an obstacle for us during the last two months and you have enlightend us with your response, we really appreciate it.

  243. Avatar
    hatty June 15, 2021 at 11:44 pm #

    Hello sir ,

    there is no any percentage of train and test!
    (e.g 80 train ; 20 test)
    in this tutorial zhat is the percentages ?

  244. Avatar
    Peter June 21, 2021 at 8:36 pm #

    I would be interested about how do you just simply train the model with multiple product’s time series data with a singe feature (ex.: number of sales) ex.: e-commerce sales of products than make prediction for each new products.

  245. Avatar
    Kola June 22, 2021 at 4:17 pm #

    Hello everyone.
    Can anyone explain this slicing? I think it is by row slicing but I do not understand because the csv dataset doesn’t have such.
    train, test = data[1:-328], data[-328:-6]

    Alternatively, I used iloc and it worked but I am currently faced with a VaueError:
    ValueError: could not convert string to float: ‘MinTemp(degC)’
    ”MinTemp(degC” is one of the labels of my headers. So, I am of the opinion that iloc may have compartibility issues with the def split_dataset(data) function. I may be wrong.
    Please, help!!!

  246. Avatar
    Irene June 24, 2021 at 2:28 pm #

    Thank you for your informative guide.
    I saw the post ‘how-to-develop-lstm-models-for-time-series-forecasting’ also.

    Is there any difference between “Encoder-Decoder LSTM Model With Multivariate Input” and “Multiple Input Multi-Step Output”?.. other than time stamp?

    If I understood right.. those two both deal with multivariate input and multi-step prediction.

    When I try to multivariate time series forecasting over 30days period(long term forecasting), which is more appropriate option?

    I appreciate in advance for your time to answer..

    • Avatar
      Jason Brownlee June 25, 2021 at 6:10 am #

      You’re welcome.

      Yes, I believe you are referring to two different types of models. Perhaps evaluate each on your dataset and discover what works well or best.

  247. Avatar
    Shebiimalik June 30, 2021 at 2:56 am #

    Hi, Jason!

    Can you explain the main difference between CNN-LSM and Encoder-decoder CNN-LSTM?
    I am a little confused about them. in both of them, CNN was used for feature extracting and LSTM used these extracted features.
    what is the actual difference between them?
    I am waiting for your response

    • Avatar
      Jason Brownlee June 30, 2021 at 5:22 am #

      They are different model architectures. They may have similar or different performance on a problem, it really depends.

      The encoder-decoder explicitly tries to encode the input to an internal representation, then decode it for an output. Designed for seq2seq problems. The CNN-LSTM is more general, it does not try to use the same encode-decode approach and does not specific to seq2seq.

      Use whatever works best for your dataset.

  248. Avatar
    shebiimalik June 30, 2021 at 6:33 pm #

    hi Jason
    First of all thanks for your answer

    Here is the paper link .they used CNN-LSTM on the Household dataset and gets results.
    thy are using CNN-LSTM on seq2seq data like Household power consumption.
    can you explain how they use cnn-lstm on this
    I am confused about both of them.
    please explain
    I am waiting for your kind response
    thanks in advance

    • Avatar
      Jason Brownlee July 1, 2021 at 5:02 am #

      I’m not familiar with that paper and don’t have the capacity to review it for you, sorry.

      There is an example of a CNN-LSTM above, perhaps you can start with that for your project.

  249. Avatar
    Abdelrahman Shouman July 5, 2021 at 11:08 pm #

    Hello Jason,

    Thanks for all your effort.

    I am trying the first example (LSTM Model With Univariate Input and Vector Output) and
    I was wondering if you have a list of package versions used in this example.

    When I tried to run the example code, I got the following error

    NotImplementedError: Cannot convert a symbolic Tensor (lstm_2/strided_slice:0) to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported
    (Stackoverflow ref:

    I looked it up and it seems downgrading python to 3.8.5 and numpy to version 1.19.5 (or lower) should help. However, doing so would mean losing access to numpy.split function which seems to have been introduced in the recent versions.

    Any chance I can know which versions did you use to get the example to run?

  250. Avatar
    Deepak Verma July 22, 2021 at 9:23 pm #

    Hello Jason,

    In Encoder-Decoder LSTM Model With Multivariate Input how or where in code we are telling the model that this is target variable (that we are predicting) and other are independent features ?

    • Avatar
      Jason Brownlee July 23, 2021 at 5:58 am #

      Inputs to the model are independent variables, the output is dependent.

      • Avatar
        Deepak Verma July 23, 2021 at 3:08 pm #

        Thank you Jason for reply.

        But where we are mentioning that this column should be output variable i.e. for this column we are doing prediction.

  251. Avatar
    Deepak Verma July 27, 2021 at 11:36 pm #

    Hello Jason,

    For multivariate input and multi-step output, once model is trained and saved.

    Model is load now for prediction.

    Now for prediction for next day can we give other features (other than target variable) of next day (as that are know to us) to the model as input including the last day input (including all features) to predict the next day target variable ?

    If yes how ?

    • Avatar
      Jason Brownlee July 28, 2021 at 5:27 am #

      The model must take the same features as input when making a prediction as were used when training the model.

      • Avatar
        Deepak Verma August 16, 2021 at 9:18 pm #

        Thank you Jason.

        But my doubt is suppose I’m predicting target variable for next day and I know values of other variable for tomorrow. So how to use or give next day values of other variable to model as well as previous day values to predict target variable.

        For example – I’m predicting next day electricity demand and I know weather forecast for next day so how to use that weather forecast for predicting next day electricity demand.

        Hope my doubt is clear for you.

        • Adrian Tam
          Adrian Tam August 17, 2021 at 7:46 am #

          Simply speaking, just align the data and feed into the model for training and you will get it fit. If you’re using pandas dataframe, you can check out the shift() function which can help you move the data one day forward. So try to do that in your preprocessing step before fitting your model.

  252. Avatar
    Kadhem Al-Daffaie July 29, 2021 at 3:09 am #

    Hi Jason,
    Many thanks for your great efforts.

    Could you please tell me how to get the “actual” & “predicted” values in the above codes?
    I need to calculate some other evaluating metrics.

    Many thanks in advance.

    • Avatar
      Jason Brownlee July 29, 2021 at 5:13 am #

      Actual are data from the test set, predicted come from calling model.predcit().

      Sorry, I cannot prepare custom code for you.

  253. Avatar
    Efrem Obsie August 7, 2021 at 1:38 am #

    Thank you for your great blog Jason, I wonder to know how to develop a time series forecasting model for irregularly sampled data such as clinical data which in which the time steps are not uniform. Could you please suggest me deep learning or machine learning model that can handle this type of time series data?

  254. Avatar
    Anna Kaplan August 21, 2021 at 6:24 pm #

    Hi Jason,

    This is a great article and has helped me a lot with my Masters thesis I am working on.

    I would love to get your thoughts on how I should approach using a LSTM for a time series problem where I have repeated measures.

    In particular; I have approximately 2000 individuals in my dataset and for each individual I have 5 years worth of quarterly water consumption. How would you recommend I include the unique identifier into my LSTM model?

    I could group these households by suburbs for example and decrease the unique identifiers to about 200 suburbs. However, I am still not particularly sure how to include the unique suburb identifier as including it as a categorical variable would require one-hot-encoding of the variable, and this doesn’t seem like the best way to deal with the attribute.

    If any of your books touch on this please let me know and I am more than happy to purchase it.

    • Adrian Tam
      Adrian Tam August 23, 2021 at 4:55 am #

      If you would consider each suburbs are separate with no similarity at all, you may want to build 200 different LSTM model and train each of them separately. That should be the easiest way to proceed. But I don’t think there should be as much as 200. May be classify based on some border category?

  255. Avatar
    NewbieData August 22, 2021 at 11:43 am #

    Hi Jason. May I just clarify that in this example, the final output (the graph) is a graph showing the RMSE? So in order for me to know what is my predicted value, I will need to call model.predict separately?

    • Adrian Tam
      Adrian Tam August 23, 2021 at 5:16 am #

      Yes, the graph is showing RMSE. Usually model.predict() can be used to find the prediction based on input, which can also be your training data.

  256. Avatar
    podi August 27, 2021 at 5:42 pm #

    Hi Jason, thank you for your great article.

    I would ask in the function of fill_missing(values) what is 60 refers to?

    I’m so sorry for the lack of my understanding. I really appreciate your time to answer.

    • Adrian Tam
      Adrian Tam August 28, 2021 at 4:03 am #

      There are 60 minutes in one hour.

  257. Avatar
    Heidy August 28, 2021 at 2:57 pm #

    Hi, Jason thank you for your efforts to make this great article.
    I tried to run this tutorial with a dataset that contains raw historical data from sensor network traffic. Unfortunately, I could not take as much as your mentioned dataset in this tutorial, so the dataset that I used was only an almost 3 month period.

    Here the resampled daily of the dataset is having a shape (86,6) and I tried to divide train(70,6) and test(14,6) also I split it into a train(10,7,6) and test(2,7,6). I set n_input = 7 and n_output=7.

    I am not sure what I did was correct or not, hmm maybe not because when I run and fitting in LSTM univariate models, I got an error “Input contains NaN, infinity or a value too large for dtype(‘float32’).”. I also checked whether the inside dataset contains nan values or not and it returned False.

    What should I do Jason? Could you explain to me, why NaN values have occurred??

    I am in the early stages of learning by myself in machine learning. Could you give me some hints to resolve it?

    Any thoughts I really appreciate it.
    Big thanks in advance.

    • Adrian Tam
      Adrian Tam August 28, 2021 at 11:01 pm #

      First you need to check if your data has any NaN. Some numpy or pandas function can do this. Then, you can think about how to fill in those NaN, e.g., fill it with zero? fill it with previous value? fill it with mean?

  258. Avatar
    Sandy November 19, 2021 at 4:12 am #

    Hello Json,

    Thanks for a wonderful post. Can you please help me clear out some confusions please.

    Can you explain the line for me,

    “The LSTM layer is followed by a fully connected layer with 200 nodes that will interpret the features learned by the LSTM layer.”

    Does this refer to,
    model.add(Dense(100, activation=’relu’))
    if yes, then there is 100 units, right? why you said 200 nodes?


    “Finally, an output layer will directly predict a vector with seven elements, one for each day in the output sequence”

    Does this refer to,


    model.compile(loss=’mse’, optimizer=’adam’)

    what is this final line for,
    model.compile(loss=’mse’, optimizer=’adam’) ??

    • Adrian Tam
      Adrian Tam November 19, 2021 at 10:35 am #

      Thanks Sandy, it should say 100 nodes. It is corrected now.

      The “Finally, …” line refer to the Dense(n_outputs) so you know n_outputs=7 and it mean for each day in a week. The compile() line is to set up the gradient descent algorithm for this neural network only.

  259. Avatar
    Mutasem Jarrah November 24, 2021 at 3:14 pm #

    Thanks a lot for your efforts Dr. Jason
    I have question about following:

    train = array(split(train, len(train)/7))
    train, test = data[1:-328], data[-328:-6]

    What do you mean by -328, -6?

    Best Regards

    • Adrian Tam
      Adrian Tam November 25, 2021 at 3:34 am #

      That’s python’s way of indexing arrays. “a[-328]” means last 328 elements of the array “a”.

  260. Avatar
    Sandy December 24, 2021 at 7:09 am #

    Hello Jason,

    Getting r square value resulted in very low accuracy for me. Can we say that CNN-LSTM model is better than others just based of RMSE values comparison.

    If r square is low anyway, can we say that this model is good enoung?
    Or is r 2quare not that important while dealing with time series forecasting?

    Thank you,

    • Avatar
      James Carmichael December 26, 2021 at 8:15 am #

      Hi Sandy…It is often recommended that when comparing model performance, that model RMSE be compared to naive and classical statistical performance as a baseline. If the RMSE is better than naive or classical statistical methods then the model performance is considered “good”.

      The following resources will provide more insight into establishing metrics of performance of training machine learning models:

      • Avatar
        Sarik Dhungel December 31, 2021 at 3:05 pm #

        Thank you James.

        So, even if r square is low we can say that the model is good because its RMSE value is lower than classical approaches?

        Say for instance r square is 0.4 for classical approach and 0.5 for new approach. This means new approach is better than classical but still 0.5 is very low for r2 value.

        Can you please share some insights on this.

        • Avatar
          James Carmichael January 1, 2022 at 12:16 pm #

          You are very welcome Sarik! In general, the model with the lowest RMSE during training and testing may be considered a “better” model at least for the datasets used during training and testing, however there are also other ways to measure performance, such as convergence rate and most importantly how well a model performs on data never seen by the network.

          The following may be of some benefit in understanding how to use learning curves to compare model performance.


  261. Avatar
    Gilles January 11, 2022 at 6:01 am #

    Hi, great thread.

    Quick question – is there a reason why you are not doing any feature scaling? I have been reading various articles and it seems people scale the features (for instance between 0-1) as it allows the model to learn more “efficiently”

  262. Avatar
    Alex January 17, 2022 at 11:46 pm #

    Hey, very interesting topic.
    I’m trying to figure out all the steps in the Encoder-Decoder LSTM Model With Multivariate Input section, in particular, I don’t understand why you implemented the walk forward validation in that way. Reading this post and the one here what I got is that Walk forward validation is used to train several models each time with different portions of training and test sets (at each iteration training set becomes one week larger while test set is always one week ahead). In the code above it looks you just trained one model and computed the RMSE using always the same trained model (just predicting the next week using the test set). Therefore, you didn’t trained several models. Therefore, I just want to ask you what actually Walk Foward validation is and why you used in a different way w.r.t. how you explained.

  263. Avatar
    Konstantinos February 13, 2022 at 11:43 am #

    Lets say that i have a Multivariate (8 variables) multistep (24 step ahead) problem. For step 1 i use the last 24 actual 8 variables to predict one of them (lets say the first of the eight).
    In the 2nd step i will use the last 24 again but the 23 are actual and the one is the predicted (the first variable that i predict). What about the other 7?

  264. Avatar
    kostas February 13, 2022 at 9:50 pm #

    Dear Jason

    In the Encoder-Decoder LSTM Model With Multivariate Input exhample i have a question.
    To predict the power consumption of the 2nd day in a special week, is the forcasted power consumpton value of the 1st day used by the algorithm?

    • Avatar
      James Carmichael February 14, 2022 at 12:33 pm #

      Hi Kostas…Yes, each previous time step is critical the forecast of the future values. I would also recommend applying CNNs to the same prediction problem and comparing results to deepen your understanding and confidence in your models.

  265. Avatar
    frank February 16, 2022 at 6:18 am #

    whay do you compute the overall rsme in this way?

    # calculate overall RMSE
    s = 0
    for row in range(actual.shape[0]):
    for col in range(actual.shape[1]):
    s += (actual[row, col] – predicted[row, col]) ** 2
    score = np.sqrt(s / (actual.shape[0] * actual.shape[1]))

    if I compute the Rsme in this way:
    y = 0
    for row in range(actual.shape[0]):
    y += np.sqrt((mean_squared_error(actual[row, :], predicted[row, :])))
    score3 = (y / (actual.shape[0]))

    the result is different. Are both ways right?

  266. Avatar
    frank February 16, 2022 at 6:21 am #

    Hi, why do you compute overall rmse in this way?

    s = 0
    for row in range(actual.shape[0]):
    for col in range(actual.shape[1]):
    s += (actual[row, col] – predicted[row, col]) ** 2
    score = np.sqrt(s / (actual.shape[0] * actual.shape[1]))

    Can I compute in this other way? Are both ways right?

    y = 0
    for row in range(actual.shape[0]):
    y += np.sqrt((mean_squared_error(actual[row, :], predicted[row, :])))
    score3 = (y / (actual.shape[0]))

  267. Avatar
    frank February 16, 2022 at 6:21 am #

    the results are obviously different

    • Avatar
      James Carmichael February 16, 2022 at 11:21 am #

      Hi Frank…Thank you for the feedback. Do you have a specific question that I may help address?

  268. Avatar
    Maria April 3, 2022 at 8:39 am #

    Hi, many thanks for your useful article. My question is whether the walk-forward method implemented in this code is an expanding walk-forward or a sliding walk-forward model?

    • Avatar
      James Carmichael April 3, 2022 at 10:27 am #

      Hi Maria…It is a sliding walk-forward model.

  269. Avatar
    Chandra Sekhar Vorugunti April 6, 2022 at 1:04 am #

    HI Jason, Thanks a lot for wonderfull resource. I have a irregular time series. I need to predict next 10,20,30 steps. Please suggest which technique or ,model will be usefull.

  270. Avatar
    Chandra Sekhar Vorugunti April 15, 2022 at 12:59 am #

    HI Jason, Thanks a lot for wonderfull resource. I have a irregular time series. I need to predict next 10,20,30 steps. Please suggest which technique or ,model will be usefull.

    • Avatar
      James Carmichael April 15, 2022 at 7:22 am #

      Hi Chandra…

      Some time series data is discontiguous.

      This means that the interval between the observations is not consistent, but may vary.

      You can learn more about contiguous vs discontiguous time series datasets in this post:

      Taxonomy of Time Series Forecasting Problems
      There are many ways to handle data in this form and you must discover the approach that works well or best for your specific dataset and chosen model.

      The most common approach is to frame the discontiguous time series as contiguous and the observations for the newly observation times as missing (e.g. a contiguous time series with missing values).

      Some ideas you may want to explore include:

      Ignore the discontiguous nature of the problem and model the data as-is.
      Resample the data (e.g. upsample) to have a consistent interval between observations.
      Impute the observations to form a consistent interval.
      Pad the observations for form a consistent interval and use a Masking layer to ignore the padded values.

  271. Avatar
    mohamed mahmoud April 20, 2022 at 6:24 pm #

    I am working on this dataset on CNN-LSTM i have a series issue of how to handle CNN with LSTM in single model also the input shape

  272. Avatar
    Ali Khan April 21, 2022 at 1:27 am #

    Hi, first of all, thank you so much for such as nice article. I was wondering that is it possible to tackle a variable sized window for time series forecasting such as:

    Predict t(1) given t(0) … **window size = 1**
    Predict t(2) given t(0) + t(1) … **window size = 2**
    Predict t(3) given t(0) + t(1) + t(2) … **window size = 3**
    Predict t(4) given t(0) + t(1) + t(2) + t(3)… **window size = 4**
    Predict t(N) given t(0) + t(1) + t(2) + t(3) … t(N-1) … **window size = N**

    So there isn’t any fixed window, is it possible to do it and if yes then how?

    I will be thankful to you

  273. Avatar
    tiago April 23, 2022 at 1:30 am #

    Hi, thanks you so much for this fantastic article!

    But i have a doubt!

    When you split data into Train and Test Sets, why you use this :
    train, test = data[1:-328], data[-328:-6]

    Why 328? I dont´t understand.

    • Avatar
      James Carmichael April 24, 2022 at 3:32 am #

      Hi Tiago…The value is used to specify how far back from the end of the full dataset to select. The following resource may help clarify:

      • Avatar
        tiago April 26, 2022 at 7:57 pm #

        Thanks for the answer!

        I saw the link and realized how to indexing and slicing in ptyhon.
        But I used my dataset and when I call the split_dataset function I get this error:
        “array split does not result in an equal division”.
        How can I fix it?

  274. Avatar
    Golam Kibria April 25, 2022 at 6:40 pm #

    Hello ! can u show me how i will show actual vs predicted with the help of graph?

  275. Avatar
    tiago April 26, 2022 at 7:56 pm #

    Thanks for the answer!

    I saw the link and realized how to indexing and slicing in ptyhon.
    But I used my dataset and when I call the split_dataset function I get this error:
    “array split does not result in an equal division”.
    How can I fix it?

  276. Avatar
    SreeY April 26, 2022 at 9:20 pm #

    Hi Jason, Thank you for the helpful tutorials. I am trying a multi-step, multivariate LSTM using a timeseries generator (TSG). The TSG generates X= (32, 32, 4) ==> Y=(32,4) to represent 32 lag variables for 4 features used to predict 32 future values. Can you please review with the below LSTM Model definition and let me know if the below is correct?

    I cant seem to get past an error: tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes

    –Thanks much

    rnn = Sequential()
    rnn.add(LSTM(units=128, activation=”tanh”,
    recurrent_activation=”sigmoid”, return_sequences=True,
    batch_input_shape=(32, 32, 4), stateful=True))
    for k in [True, False]:
    rnn.add(LSTM(units=128, return_sequences=k, activation=”tanh”,
    recurrent_activation=”sigmoid”, stateful=True))
    early_stopping = EarlyStopping(monitor=’loss’, patience=1)
    rnn.compile(loss=’mean_squared_error’, optimizer=’adam’), epochs=1, verbose=1, shuffle=False,

  277. Avatar
    tiago April 27, 2022 at 3:03 am #

    Sorry but I have another question…
    if I want to know what are the predict seven values for the next week using the code from part “LSTM Model With Univariate Input and Vector Output” what should I do?

  278. Avatar
    tiago April 27, 2022 at 7:21 pm #

    How can i see the predicted values for the next week?

    I already see this tutorial : and i have a question :

    What will be my X (using the data in this tutorial) when call model.predict(X, verbose=0)?

  279. Avatar
    Faisal Mohammad May 12, 2022 at 5:06 pm #

    Hope you are fine.
    How to start with the multivariate data in case of convlstm.

  280. Avatar
    Belle August 13, 2022 at 9:14 pm #

    Hi Jason, Thank you for the helpful tutorials. I have 1 question.

    What is “้history” in “def forecast(model, history, n_input)”

  281. Avatar
    Belle August 13, 2022 at 9:16 pm #

    CNN-LSTM Encoder-Decoder Model With Univariate Input

  282. Avatar
    Gimba Usman September 16, 2022 at 7:46 am #

    This was really a great tutorial, My question is on a small dataset where we do not have enough observations, for example, my dataset has monthly data reading in the span of 16 years totaling 180 data points what machine learning model will best forecast the future for me if possible 3 years ahead. And I saw in your code you flatten your dataset to increase its number during training.
    code snippet
    # flatten data
    data = train.reshape((train.shape[0]*train.shape[1], train.shape[2]))

    please can you explain it to me better?

  283. Avatar
    Jean October 13, 2022 at 3:35 am #

    Dear Jason, thank you for your article it is really interesting. From your experience which time-scale can be predicted with these methods ? I assume it depends on the training set. Let say if we have 1 year data for training. Do you know papers on this topic ? Thank you. Best,

  284. Avatar
    Ernesto October 26, 2022 at 9:00 am #

    Hi Jason,
    Is it possible to predict several points (t+1, t+2…, t+n) of a single output variable in the future based on variables (as inputs) of which it is known only up to time t?

    I understand that there are models of a single or multiple input variables where they use the single output variable as a new input (like in your article where the prediction from an hour is re enter to predict the next one) but how does it work when you have more than one input variable and you do not have the future values ( t+1, t+2) of these to re-enter them, just having the prediction values from the single variable output. LSTM can predict N steps past the step t=0 where it has the last values of all the variables from the input?

  285. Avatar
    Karim October 28, 2022 at 2:19 am #

    Thanks for the great content! If you have daily data but would like to predict sum per month, how would you go forward?

  286. Avatar
    C R December 10, 2022 at 7:03 am #

    Thanks for taking the time to post this! I still don’t very well understand how well this works, though. What is an example of a naive method? A linear regression? An average of past data? And what kinds of other models do roughly as well as this? If I were to fit a quadratic equation to the data, for example, would that give me a bump of 100 on RMSE to give a “skillful” model, as you’ve gotten with this DNN? My experience with DNNs is that they are bad at regression tasks, good at classification tasks, which is a little surprising since both outputs of the model are some mix of continuous and discrete. I’d love to understand how well this model is doing at such a task.

    • Avatar
      James Carmichael December 10, 2022 at 10:50 am #

      Hi C R…You are very welcome! Perhaps you could devise a specific application that you can apply LSTMs to and we can discuss the results.

  287. Avatar
    focus January 26, 2023 at 6:20 pm #

    how can i split data if i want to predict hourly demand not daily.
    in case i use original dataset?

  288. Avatar
    fah January 26, 2023 at 7:19 pm #

    how can you calculate this number ?

    thank you

    • Avatar
      James Carmichael January 27, 2023 at 10:52 am #

      Hi fah…Please elaborate and/or clarify your question so that we may better assist you.

  289. Avatar
    Gabe February 5, 2023 at 6:31 pm #


    Thank you for this tutorial and for the multiple posts on this website, they really taught me a lot about machine learning!

    I’m using the multistep model on stock price data, and my goal is to use 30 days to predict the next 15. Hence, my x data is shaped as (n_samples, 30, 1). Similarly, my y data is of shape (n_samples, 15, 1).
    When I use model.predict on validation data with shape (30, 1), it returns a list of shape (30, 15). Can I just consider the 1st vector as the prediction? What are the other ones?

    In your code, you do something similar on the forecast function:

    # forecast the next week
    yhat = model.predict(input_x, verbose=0)
    # we only want the vector forecast
    yhat = yhat[0]

    I believe that the shape of yhat is originally (7, 7), as you use 7 data points to predict the next 7. What are the other 6 vectors?

    Any link further explaining this would be of great help. Thank you once again!

  290. Avatar
    Javier March 2, 2023 at 10:07 pm #


    Thank you for your article, it’s very useful!

    However, I have a couple of questions:

    – If we had another relevant variable for prediction that you knew in advance for the time you wanted to run your prediction (such as ‘wind generation forecast’), could you use it as input to the decoder in order to improve the prediction?

    – If you wanted to increase the depth of your model by adding more LSTM layers, how would the network architecture look like? That is, at the decoder, how would you have to decide the number of LSTM units? In this way?

    ‘Define model’
    model = Sequential()
    input_shape = (n_timesteps, n_features)))
    ‘Internal representation’
    return_sequences = True))
    ‘Fully-connected & Output layer’
    model.add(TimeDistributed(Dense(100, activation = ‘relu’)))

    Or in this one?:

    ‘Define model’
    model = Sequential()
    input_shape = (n_timesteps, n_features)))
    ‘Internal representation’
    return_sequences = True))
    ‘Fully-connected & Output layer’
    model.add(TimeDistributed(Dense(100, activation = ‘relu’)))

    Thank you in advance for everything. I look forward to hearing from you soon!


  291. Avatar
    Russ March 21, 2023 at 12:55 pm #

    I ran the
    Encoder-Decoder LSTM Model With Multivariate Input
    and get the following results
    lstm: [1566.582] 1611.0, 1526.1, 1515.5, 1596.3, 1494.1, 1504.0, 1707.5

    which are significantly worse than the other approaches
    What am I doing wrong?

    • Avatar
      James Carmichael March 22, 2023 at 10:02 am #

      Hi Russ…Did you normalize or standardize the input data?

  292. Avatar
    Abdi March 24, 2023 at 7:03 am #

    I have a question here. Isn’t it required to consider some “zero” consumption in household energy consumption data as outliers and remove them to reach better results, or should they be kept for the correct forecasting trend? What is the best method here?

  293. Avatar
    Eva April 15, 2023 at 6:06 pm #

    Amazing work Jason! Thank you for sharing!

    • Avatar
      James Carmichael April 16, 2023 at 9:26 am #

      You are very welcome Eva! We appreciate your feedback!

  294. Avatar
    Yusuf Rufai July 9, 2023 at 6:36 pm #


    I found this tutorial useful. but I have same questions:

    1. I understand its a multivariant problem datasets, but you did not mention what feature you are forecasting
    meaning what are your input features and your target feature?

    I am asking this because I want to use or modify the code to my datasets which is the forecast forest fire on month in advance.

    • Avatar
      James Carmichael July 10, 2023 at 1:37 am #

      Hi Yusuf…The total power is being forecasted based upon the other available measurements. The model learns from the data itself (autocorrelation) so it is not strictly based upon “input” and “target” features as is done in regression.

  295. Avatar
    Diego January 2, 2024 at 4:31 pm #


    Thanks for the tutorial. With this example I believe there is leakage of information from the train set to the test set. Another inquiry that came to my mind was if it is correct to forecast y only with x variables excluding the x variable that is autoregressive with y.


  296. Avatar
    Mei February 15, 2024 at 12:21 pm #

    Thank you for the great tutorial! I’m afraid I’m still not clear about some concepts.

    Can you explain why total power consumption is included in X in to_supervised() and then fed into build_model()? I’d assume it is a target feature and should only be in y. Similarly, why is it included in input_x in forecast()? I have read some above referenced articles on Walk Forward Validation, but it is still not clear to me.

    I am looking at the Encoder-Decoder LSTM Model With Multivariate Input example. If I know the feature I’m predicting is discrete (e.g., 1, 2, 3, 4, 5, 6, 7), is there a way I can tell my model this? If this is a classification problem, could you point me to a resource?

  297. Avatar
    Frederik March 2, 2024 at 2:07 am #

    Hey James

    Thanks for the great tutorial!

    I’m looking into making LSTM that, much like yours, predicts energy consumption.

    However, my data is hourly, and i want my model to be able to make a prediction of the hourly demand for the entire next day, and the prediction should be done at 12 pm (in the middle of the day) the day before. So at 12 pm today i would have a forecast of what my hourly demand will be every hour the next day. This kinda correlates into a 36 hour prediction, however im only interested in the last 24 hours.

    If this makes sense to you, would you have any guidance has to how i should implement this using your code? My first attempts has been unsuccesful since i run into some issues regarding different array sized. This probably happens because im still interested in the 00:00 – 00:00 demand so i am, much like you, splitting my dataset into normal days of 24 hours. But the prediction is supposed to be done at 12 pm, which is a 12 hour shift from the nomal days.

    Please let me know if you have any inputs to this problem

    Best regards


    • Avatar
      James Carmichael March 2, 2024 at 1:10 pm #

      Hi Frederik…Please provide the exact verbiage of the errors you have encountered. This will enable us to better assist you.

  298. Avatar
    Tom March 13, 2024 at 6:43 am #

    Thank you very much for this tutorial. It is a great source of information!
    I have a question regarding the MSE and RMSE. If our data have a range (Max/min > 10), the MSE and RMSE will not really be penalizing around the minimum. In this case, wouldn’t be more appropriate to evaluate MSE and RMSE not in absolute but in relative (%)?

  299. Avatar
    Bhambho Moto April 4, 2024 at 8:30 pm #

    what is R2 value of the prediction in this example, please show that value as it is more important from research example.

    • Avatar
      James Carmichael April 5, 2024 at 8:32 am #

      Hi Bhambho…

      Certainly! To demonstrate how to calculate the R-squared (R²) value from an LSTM (Long Short-Term Memory) model for time series forecasting, let’s first outline the steps you need to follow:

      1. **Prepare the Time Series Data**: Split your time series data into training and testing datasets.
      2. **Normalize the Data**: LSTM models usually require input data to be normalized or standardized.
      3. **Define the LSTM Model**: Construct an LSTM model suitable for your time series data.
      4. **Train the Model**: Train the LSTM model using the training dataset.
      5. **Forecast**: Use the model to make predictions on the testing dataset.
      6. **Calculate R² Value**: Compare the predictions with the actual values in the testing dataset to calculate the R² value.

      Below is an example Python script that follows these steps. This example assumes you have a univariate time series data. Please adjust the input shape, model architecture, and preprocessing steps according to your specific dataset and problem.

      import numpy as np
      import pandas as pd
      from sklearn.preprocessing import MinMaxScaler
      from sklearn.metrics import r2_score
      from keras.models import Sequential
      from keras.layers import LSTM, Dense
      from keras.callbacks import EarlyStopping

      # Example time series data
      data = np.sin(np.linspace(0, 10*np.pi, 1000))

      # Preprocessing
      scaler = MinMaxScaler(feature_range=(0, 1))
      data_normalized = scaler.fit_transform(data.reshape(-1, 1))

      # Splitting data into training and testing
      train_size = int(len(data_normalized) * 0.8)
      test_size = len(data_normalized) - train_size
      train, test = data_normalized[0:train_size,:], data_normalized[train_size:len(data_normalized),:]

      # Convert an array of values into a dataset matrix
      def create_dataset(dataset, look_back=1):
      X, Y = [], []
      for i in range(len(dataset)-look_back-1):
      a = dataset[i:(i+look_back), 0]
      Y.append(dataset[i + look_back, 0])
      return np.array(X), np.array(Y)

      # reshape into X=t and Y=t+1
      look_back = 1
      X_train, Y_train = create_dataset(train, look_back)
      X_test, Y_test = create_dataset(test, look_back)

      # reshape input to be [samples, time steps, features]
      X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))
      X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))

      # Define the LSTM model
      model = Sequential()
      model.add(LSTM(4, input_shape=(1, look_back)))
      model.compile(loss='mean_squared_error', optimizer='adam')

      # Fit the model, Y_train, epochs=100, batch_size=1, verbose=2, callbacks=[EarlyStopping(monitor='loss', patience=10)])

      # Making predictions
      train_predict = model.predict(X_train)
      test_predict = model.predict(X_test)

      # Invert predictions
      train_predict = scaler.inverse_transform(train_predict)
      Y_train_inv = scaler.inverse_transform([Y_train])
      test_predict = scaler.inverse_transform(test_predict)
      Y_test_inv = scaler.inverse_transform([Y_test])

      # Calculate R2 score
      r2_test = r2_score(Y_test_inv.flatten(), test_predict.flatten())
      print(f'Test R2 score: {r2_test:.3f}')

      This script performs the following actions:
      – Normalizes the time series data.
      – Splits the data into training and testing datasets.
      – Defines and trains an LSTM model on the training data.
      – Makes predictions on the testing data.
      – Calculates and prints the R² value, which quantifies the goodness of fit of the LSTM model predictions compared to the actual values in the testing dataset.

      Note: Ensure you have the necessary libraries installed (numpy, pandas, sklearn, keras) to run this script. Adjust the look_back parameter and the LSTM model architecture as needed for your specific dataset.

  300. Avatar
    DR Sena May 20, 2024 at 10:02 pm #

    I am operating an LSTM for salinity prediction for a river system. I am trying to use a multivariate LSTM. I have rainfall (P) data daily times step, discharge(Q) and water level data (WL) at every 30 min for over 20 years until 2023 December. In a first, I’m trying to predict water level data (30 min) from P and Q values, then Salinity (using P,Q and WL first forwarded for my prediction dates which is around 4 months starting June 01,2024). How to handle data with differential time step. Thanking you in Advance.


    • Avatar
      James Carmichael May 21, 2024 at 4:32 am #

      Hi DR Sena…Handling data with different time steps in a multivariate LSTM model requires careful preprocessing to ensure that the input data is aligned correctly. Here’s a structured approach to tackle this issue:

      ### Step-by-Step Approach

      1. **Resampling Data:**
      – **Rainfall (P):** Resample the daily rainfall data to match the 30-minute interval of the discharge (Q) and water level (WL) data. You can achieve this by forward-filling the daily data or using interpolation.
      – **Discharge (Q) and Water Level (WL):** Since these are already at a 30-minute interval, no resampling is needed.

      import pandas as pd

      # Example of resampling daily rainfall data to 30-minute intervals
      rainfall_data = pd.read_csv('rainfall.csv', parse_dates=['timestamp'])
      rainfall_data.set_index('timestamp', inplace=True)
      rainfall_data_resampled = rainfall_data.resample('30T').ffill() # or .interpolate()

      2. **Aligning Data:**
      – Ensure that all datasets (rainfall, discharge, and water level) are aligned on the same timestamp index after resampling.

      discharge_data = pd.read_csv('discharge.csv', parse_dates=['timestamp'])
      discharge_data.set_index('timestamp', inplace=True)

      water_level_data = pd.read_csv('water_level.csv', parse_dates=['timestamp'])
      water_level_data.set_index('timestamp', inplace=True)

      # Merge datasets
      data = pd.concat([rainfall_data_resampled, discharge_data, water_level_data], axis=1).dropna()

      3. **Creating Lag Features:**
      – Generate lag features for each variable to capture temporal dependencies.

      def create_lag_features(df, lags):
      for col in df.columns:
      for lag in range(1, lags + 1):
      df[f'{col}_lag_{lag}'] = df[col].shift(lag)
      return df

      data_with_lags = create_lag_features(data, lags=6) # Example with 6 lags (3 hours)

      4. **Train-Test Split:**
      – Split the data into training and testing sets based on the timestamp.

      train_data = data_with_lags[data_with_lags.index < '2024-06-01'] test_data = data_with_lags[data_with_lags.index >= '2024-06-01']

      5. **Preparing Data for LSTM:**
      – LSTM models expect 3D input in the form of (samples, timesteps, features). Prepare the data accordingly.

      import numpy as np

      def prepare_lstm_data(df, target_col, timesteps):
      X, y = [], []
      for i in range(len(df) - timesteps):
      X.append(df.iloc[i:(i + timesteps)].drop(columns=[target_col]).values)
      y.append(df.iloc[i + timesteps][target_col])
      return np.array(X), np.array(y)

      timesteps = 6 # Example with 3 hours
      X_train, y_train = prepare_lstm_data(train_data, target_col='water_level', timesteps=timesteps)
      X_test, y_test = prepare_lstm_data(test_data, target_col='water_level', timesteps=timesteps)

      6. **Building and Training the LSTM Model:**
      – Define and compile the LSTM model.

      from keras.models import Sequential
      from keras.layers import LSTM, Dense

      model = Sequential()
      model.add(LSTM(50, input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True))
      model.add(LSTM(50, return_sequences=False))

      model.compile(optimizer='adam', loss='mse')

      # Train the model, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test), verbose=2)

      7. **Predicting Water Levels and Salinity:**
      – Use the trained model to predict water levels, then use these predictions along with P and Q to predict salinity.

      # Predict water levels
      predicted_water_levels = model.predict(X_test)

      # Use predicted water levels along with P and Q for salinity prediction
      # Prepare data for the second LSTM model or other regression model
      salinity_data = test_data.copy()
      salinity_data['predicted_water_level'] = np.nan
      salinity_data['predicted_water_level'][timesteps:] = predicted_water_levels

      X_salinity, y_salinity = prepare_lstm_data(salinity_data, target_col='salinity', timesteps=timesteps)

      # Train a second model for salinity prediction
      # Define, compile, and train the second model similar to the first

      ### Tips
      – **Scaling:** Ensure to scale the features appropriately using MinMaxScaler or StandardScaler.
      – **Regularization:** Consider adding dropout layers to prevent overfitting.
      – **Model Tuning:** Experiment with different architectures, number of layers, neurons, and hyperparameters.

      By following this approach, you can effectively handle the differential time steps in your data and build a robust LSTM model for your predictions. If you need further assistance with any of these steps, feel free to ask!

