Last Updated on October 21, 2020

Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.

This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems.

In this tutorial, you will discover how you can develop an **LSTM model for multivariate time series forecasting** with the Keras deep learning library.

After completing this tutorial, you will know:

- How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.

**Kick-start your project** with my new book Deep Learning for Time Series Forecasting, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Aug/2017**: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.**Update Oct/2017**: Added a new example showing how to train on multiple prior time steps due to popular demand.**Update Sep/2018**: Updated link to dataset.**Update Jun/2020**: Fixed missing imports for LSTM data prep example.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- Air Pollution Forecasting
- Basic Data Preparation
- Multivariate LSTM Forecast Model
- LSTM Data Preparation
- Define and Fit Model
- Evaluate Model
- Complete Example

- Train On Multiple Lag Timesteps Example

### Python Environment

This tutorial assumes you have a Python SciPy environment installed. I recommend that youuse Python 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend, Ideally Keras 2.3 and TensorFlow 2.2, or higher.

The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed.

If you need help with your environment, see this post:

### Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## 1. Air Pollution Forecasting

In this tutorial, we are going to use the Air Quality dataset.

This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.

The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:

**No**: row number**year**: year of data in this row**month**: month of data in this row**day**: day of data in this row**hour**: hour of data in this row**pm2.5**: PM2.5 concentration**DEWP**: Dew Point**TEMP**: Temperature**PRES**: Pressure**cbwd**: Combined wind direction**Iws**: Cumulated wind speed**Is**: Cumulated hours of snow**Ir**: Cumulated hours of rain

We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.

This dataset can be used to frame other forecasting problems.

Do you have good ideas? Let me know in the comments below.

You can download the dataset from the UCI Machine Learning Repository.

**Update**, I have mirrored the dataset here because UCI has become unreliable:

Download the dataset and place it in your current working directory with the filename “*raw.csv*“.

## 2. Basic Data Preparation

The data is not ready to use. We must prepare it first.

Below are the first few rows of the raw dataset.

1 2 3 4 5 6 |
No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir 1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0 2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0 3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0 4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0 5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0 |

The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas.

A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now.

The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from pandas import read_csv from datetime import datetime # load data def parse(x): return datetime.strptime(x, '%Y %m %d %H') dataset = read_csv('raw.csv', parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse) dataset.drop('No', axis=1, inplace=True) # manually specify column names dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'] dataset.index.name = 'date' # mark all NA values with 0 dataset['pollution'].fillna(0, inplace=True) # drop the first 24 hours dataset = dataset[24:] # summarize first 5 rows print(dataset.head(5)) # save to file dataset.to_csv('pollution.csv') |

Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “*pollution.csv*“.

1 2 3 4 5 6 7 |
pollution dew temp press wnd_dir wnd_spd snow rain date 2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0 2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0 2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0 2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0 2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0 |

Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have.

The code below loads the new “*pollution.csv*” file and plots each series as a separate subplot, except wind speed dir, which is categorical.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from pandas import read_csv from matplotlib import pyplot # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # specify columns to plot groups = [0, 1, 2, 3, 5, 6, 7] i = 1 # plot each column pyplot.figure() for group in groups: pyplot.subplot(len(groups), 1, i) pyplot.plot(values[:, group]) pyplot.title(dataset.columns[group], y=0.5, loc='right') i += 1 pyplot.show() |

Running the example creates a plot with 7 subplots showing the 5 years of data for each variable.

## 3. Multivariate LSTM Forecast Model

In this section, we will fit an LSTM to the problem.

### LSTM Data Preparation

The first step is to prepare the pollution dataset for the LSTM.

This involves framing the dataset as a supervised learning problem and normalizing the input variables.

We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.

This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:

- Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
- Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

We can transform the dataset using the *series_to_supervised()* function developed in the blog post:

First, the “*pollution.csv*” dataset is loaded. The wind direction feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.

The complete code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
# prepare data for lstm from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import MinMaxScaler # convert series to supervised learning def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() values[:,4] = encoder.fit_transform(values[:,4]) # ensure all data is float values = values.astype('float32') # normalize features scaler = MinMaxScaler(feature_range=(0, 1)) scaled = scaler.fit_transform(values) # frame as supervised learning reframed = series_to_supervised(scaled, 1, 1) # drop columns we don't want to predict reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True) print(reframed.head()) |

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).

1 2 3 4 5 6 7 8 9 10 11 12 13 |
var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \ 1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290 2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811 3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332 4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391 5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912 var7(t-1) var8(t-1) var1(t) 1 0.000000 0.0 0.148893 2 0.000000 0.0 0.159960 3 0.000000 0.0 0.182093 4 0.037037 0.0 0.138833 5 0.074074 0.0 0.109658 |

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

- One-hot encoding wind direction.
- Making all series stationary with differencing and seasonal adjustment.
- Providing more than 1 hour of input time steps.

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.

### Define and Fit Model

In this section, we will fit an LSTM on the multivariate input data.

First, we must split the prepared dataset into train and test sets. To speed up the training of the model for this demonstration, we will only fit the model on the first year of data, then evaluate it on the remaining 4 years of data. If you have time, consider exploring the inverted version of this test harness.

The example below splits the dataset into train and test sets, then splits the train and test sets into input and output variables. Finally, the inputs (X) are reshaped into the 3D format expected by LSTMs, namely [samples, timesteps, features].

1 2 3 4 5 6 7 8 9 10 11 12 13 |
... # split into train and test sets values = reframed.values n_train_hours = 365 * 24 train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs train_X, train_y = train[:, :-1], train[:, -1] test_X, test_y = test[:, :-1], test[:, -1] # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1])) test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1])) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) |

Running this example prints the shape of the train and test input and output sets with about 9K hours of data for training and about 35K hours for testing.

1 |
(8760, 1, 8) (8760,) (35039, 1, 8) (35039,) |

Now we can define and fit our LSTM model.

We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 8 features.

We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent.

The model will be fit for 50 training epochs with a batch size of 72. Remember that the internal state of the LSTM in Keras is reset at the end of each batch, so an internal state that is a function of a number of days may be helpful (try testing this).

Finally, we keep track of both the training and test loss during training by setting the *validation_data* argument in the fit() function. At the end of the run both the training and test loss are plotted.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
... # design network model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mae', optimizer='adam') # fit network history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False) # plot history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() |

### Evaluate Model

After the model is fit, we can forecast for the entire test dataset.

We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers.

With forecasts and actual values in their original scale, we can then calculate an error score for the model. In this case, we calculate the Root Mean Squared Error (RMSE) that gives error in the same units as the variable itself.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
... # make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], test_X.shape[2])) # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, 1:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) |

### Complete Example

The complete example is listed below.

**NOTE**: This example assumes you have prepared the data correctly, e.g. converted the downloaded “*raw.csv*” to the prepared “*pollution.csv*“. See the first part of this tutorial.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
from math import sqrt from numpy import concatenate from matplotlib import pyplot from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # convert series to supervised learning def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() values[:,4] = encoder.fit_transform(values[:,4]) # ensure all data is float values = values.astype('float32') # normalize features scaler = MinMaxScaler(feature_range=(0, 1)) scaled = scaler.fit_transform(values) # frame as supervised learning reframed = series_to_supervised(scaled, 1, 1) # drop columns we don't want to predict reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True) print(reframed.head()) # split into train and test sets values = reframed.values n_train_hours = 365 * 24 train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs train_X, train_y = train[:, :-1], train[:, -1] test_X, test_y = test[:, :-1], test[:, -1] # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1])) test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1])) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) # design network model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mae', optimizer='adam') # fit network history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False) # plot history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() # make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], test_X.shape[2])) # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, 1:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) |

Running the example first creates a plot showing the train and test loss during training.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Interestingly, we can see that test loss drops below training loss. The model may be overfitting the training data. Measuring and plotting RMSE during training may shed more light on this.

The Train and test loss are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset is printed.

We can see that the model achieves a respectable RMSE of 26.496, which is lower than an RMSE of 30 found with a persistence model.

1 2 3 4 5 6 7 8 9 10 11 12 |
... Epoch 46/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 47/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 48/50 0s - loss: 0.0144 - val_loss: 0.0133 Epoch 49/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 50/50 0s - loss: 0.0144 - val_loss: 0.0133 Test RMSE: 26.496 |

This model is not tuned. Can you do better?

Let me know your problem framing, model configuration, and RMSE in the comments below.

## Train On Multiple Lag Timesteps Example

There have been many requests for advice on how to adapt the above example to train the model on multiple previous time steps.

I had tried this and a myriad of other configurations when writing the original post and decided not to include them because they did not lift model skill.

Nevertheless, I have included this example below as reference template that you could adapt for your own problems.

The changes needed to train the model on multiple previous time steps are quite minimal, as follows:

First, you must frame the problem suitably when calling series_to_supervised(). We will use 3 hours of data as input. Also note, we no longer explictly drop the columns from all of the other fields at ob(t).

1 2 3 4 5 6 |
... # specify the number of lag hours n_hours = 3 n_features = 8 # frame as supervised learning reframed = series_to_supervised(scaled, n_hours, 1) |

Next, we need to be more careful in specifying the column for input and output.

We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

1 2 3 4 5 6 |
... # split into input and outputs n_obs = n_hours * n_features train_X, train_y = train[:, :n_obs], train[:, -n_features] test_X, test_y = test[:, :n_obs], test[:, -n_features] print(train_X.shape, len(train_X), train_y.shape) |

Next, we can reshape our input data correctly to reflect the time steps and features.

1 2 3 4 |
... # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], n_hours, n_features)) test_X = test_X.reshape((test_X.shape[0], n_hours, n_features)) |

Fitting the model is the same.

The only other small change is in how to evaluate the model. Specifically, in how we reconstruct the rows with 8 columns suitable for reversing the scaling operation to get the y and yhat back into the original scale so that we can calculate the RMSE.

The gist of the change is that we concatenate the y or yhat column with the last 7 features of the test dataset in order to inverse the scaling, as follows:

1 2 3 4 5 6 7 8 9 10 |
... # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, -7:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] |

We can tie all of these modifications to the above example together. The complete example of multvariate time series forecasting with multiple lag inputs is listed below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
from math import sqrt from numpy import concatenate from matplotlib import pyplot from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # convert series to supervised learning def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() values[:,4] = encoder.fit_transform(values[:,4]) # ensure all data is float values = values.astype('float32') # normalize features scaler = MinMaxScaler(feature_range=(0, 1)) scaled = scaler.fit_transform(values) # specify the number of lag hours n_hours = 3 n_features = 8 # frame as supervised learning reframed = series_to_supervised(scaled, n_hours, 1) print(reframed.shape) # split into train and test sets values = reframed.values n_train_hours = 365 * 24 train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs n_obs = n_hours * n_features train_X, train_y = train[:, :n_obs], train[:, -n_features] test_X, test_y = test[:, :n_obs], test[:, -n_features] print(train_X.shape, len(train_X), train_y.shape) # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], n_hours, n_features)) test_X = test_X.reshape((test_X.shape[0], n_hours, n_features)) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) # design network model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mae', optimizer='adam') # fit network history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False) # plot history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() # make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], n_hours*n_features)) # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, -7:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The model is fit as before in a minute or two.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
... Epoch 45/50 1s - loss: 0.0143 - val_loss: 0.0154 Epoch 46/50 1s - loss: 0.0143 - val_loss: 0.0148 Epoch 47/50 1s - loss: 0.0143 - val_loss: 0.0152 Epoch 48/50 1s - loss: 0.0143 - val_loss: 0.0151 Epoch 49/50 1s - loss: 0.0143 - val_loss: 0.0152 Epoch 50/50 1s - loss: 0.0144 - val_loss: 0.0149 |

A plot of train and test loss over the epochs is plotted.

Finally, the Test RMSE is printed, not really showing any advantage in skill, at least on this problem.

1 |
Test RMSE: 27.177 |

I would add that the LSTM does not appear to be suitable for autoregression type problems and that you may be better off exploring an MLP with a large window.

I hope this example helps you with your own time series forecasting experiments.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

- Beijing PM2.5 Data Set on the UCI Machine Learning Repository
- The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras
- Time Series Forecasting with the Long Short-Term Memory Network in Python
- Multi-step Time Series Forecasting with Long Short-Term Memory Networks in Python

## Summary

In this tutorial, you discovered how to fit an LSTM to a multivariate time series forecasting problem.

Specifically, you learned:

- How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

except wind *dir*, which is categorical.

Thanks, fixed!

how to use grid search for neurons

I want to apply grid search in this to tune neurons and add layers

and to find best parameters

See this post:

https://machinelearningmastery.com/tune-lstm-hyperparameters-keras-time-series-forecasting/

hello Jason,

I have run the code in my spyder and I know the RMSE index is good enough for this model. However, I added the accuracy index in this code, that is

model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])

and the accuracy is totally the same in each epoch and is very low (0.0761). I also use my own data to run your code, and the result is the same, with good RMSE values but bad accuracy. I have troubled by this for several days and looking forward to your reply.

You cannot measure accuracy for regression.

Learn more here:

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

Hi，Jason.

I have the same problem as qing.I don‘t know why we cannot measure accuracy for regression.And the website you provided cannot be opened.

Could you please help me with that?

Accuracy summarizes correct predictions for class labels. It cannot be used for regression. Instead you must calculate an error metric, like RMSE.

Learn more here:

https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-classification-and-regression

And here:

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

Thank you very much!

You’re welcome.

That is correct! You can only use accuracy for class labels. You could calculate RMSE or R^2 instead

hi，Jason，I‘m a new learner. There is no real curve and predicted curve in your tutorial.

I want to know how can I get it? I mean how to write it in the code?

Sorry, I don’t understand your question Mike, can you elaborate?

I guess he means the predicted value vs ground truth chart.

I see.

You can call model.predict() to get yhat and create a line plot with y and yhat.

I have done this in some other tutorials, for example:

https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/

If this is a challenge for you, I would suggest this tutorial is too advanced for you and I would encourage you to start with intro to time series here:

https://machinelearningmastery.com/start-here/#timeseries

Hi Jason, in all this implementation, how does thw feedback implementation occur? How do we account for lags in predicted time series?

Lags are accounted for as input time steps to the model.

Perhaps read this:

https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input

Many thanks for this incredibly useful example!

I think I might have a small suggestion: I’ve downloaded the “pollution” data set from the Github link provided, and I found out that maybe the column to be encoded is now column 8 and not 4 like in the original code, so I made this amendment and it all worked: (apologies if I’m missing something):

# I’ve replaced this line:

#values[:,4] = encoder.fit_transform(values[:,4])

# … with this line:

values[:,8] = encoder.fit_transform(values[:,8])

Thanks for your help!

Perhaps you downloaded the wrong dataset?

Here it is:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pollution.csv

good afternoon,i m new to machine learning and trying to run ur code on google colabs,but i getting the following error.

2003

2004 if not is_integer(x):

-> 2005 x = names.index(x)

2006

2007 self._reader.set_noconvert(x)

ValueError: ‘year’ is not in list

pls help me to slove out

Sorry, I don’t know about colab.

Try running the example on your workstation.

Hi Jason. Do you know why i can’t inverse scaler transform in inv_yhat and why appear this error?

operands could not be broadcast together with shapes (157,13) (7,) (157,13)

Perhaps this will help:

https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/

I know how I can help you! In Jason’s code it is as follows:

inv_yhat = concatenate((yhat, test_X[:, -7:]), axis=1)

But make sure instead of 7 you use number_of_features -1, otherwise you have the value error.

So in my case, I use 31 features (including the one I wanna predict), and it is the following code:

inv_yhat = concatenate((yhat, test_X[:, -30:]), axis=1)

as well as for inv_y:

inv_y = concatenate((test_y, test_X[:, -30:]), axis=1)

Hope this helps!

Great post Jason. Thank you so much for making this material available for the community..

Thanks Francois, I’m glad it helped!

hi, jason. There were some problems under my environment which were keras2.0.4and tensorflow-GPU0.12.0rc0.

And Bug was that “TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.”

The sentence that “model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))” was located.

Could you please help me with that?

Regards,

yao

I would recommend this tutorial for setting up your environment:

http://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

Thx a lot, doctor, it works! fabulous! 🙂

I’m glad to hear that.

Dr.Jason, I update TensorFlow then it works!

Sorry to bother you.

Thank you very much !

Best wishes !

I’m glad to hear that!

I met the same problem .

Did you uninstall all the programs previously installed or just set up the environment again?

Thx a lot!

Hi Jason,I set up my environment as the your tutorial.

scipy: 0.19.0

numoy: 1.12.1

matplotlib: 2.0.2

pandas: 0.20.1

statsmodels: 0.8.0

sklearn: 0.18.1

theano: 0.9.0.dev-c697eeab84e5b8a74908da654b66ec9eca4f1291

tensorflow: 0.12.1

Using TensorFlow backend.

keras: 2.0.5

But the bug still existed.Is the version of tensorFlow too odd?How could I do?

Thanks!

It might be, I am running v1.2.1.

Perhaps try running Keras off Theano instead (e.g. change the backend in the ~/.keras.jason config)

It seems that inv_y = scaler.inverse_transform(test_X)[:,0] is not the actual, should inv_yhat be compared with test_y but not pollution(t-1)? Because I think this inv_y here means pollution(t-1). Is this prediction equals to only making a time shifting from the current known pollution value (which means the models just take pollution(t) as the prediction of pollution(t+1))?

Sorry, I’m not sure I follow. Can you please restate your question, perhaps with an example?

Sorry for the confusing expression. In fact, the series_to_supervised() function would create a DataFrame whose columns are: [ var1(t-1), var2(t-1), …, var1(t) ] where ‘var1’ represents ‘pollution’, therefore, the first dimension in test_X (that is, test_X[:,0]) would be ‘pollution(t-1)’. However, in the code you calculate the rmse between inv_yhat and test_X[:,0], even though the rmse is low, it could only shows that the model’s prediction for t+1 is close to what it has known at t.

I am asking this question because I’ve ran through the codes and saw the models prediction pollution(t+1) looks just like pollution(t). I’ve also tried to use t-1, t-2 and so on for training, but still changed nothing.

Do you think the model tends to learn to just take the pollution value at current moment as the prediction for the next moment?

thanks 🙂

If we predict t for t+1 that is called persistence, and we show in the tutorial that the LSTM does a lot better than persistence.

Perhaps I don’t understand your question? Can you give me an example of what you are asking?

Hmm, it’s difficult to explain without a graph.

In a word, and also it’s an example, I want to ask two questions:

1. In the “make a prediction” part of your codes, why it computes rmse between predicted t+1 and real t, but not between predicted t+1 and real t+1?

2. After the “make a prediction” part of your codes run, it turns out that rmse between predicted t+1 and real t is small, is it an evidence that LSTM is making persistence?

RMSE is calculated for y and yhat for the same time periods (well, that was the intent), why do you think they are not?

Is there a bug?

I think Songbin Xu is right. By executing the statement at line 90: inv_y = inv_y[:,0], you compare the inv_yhat with inv_y. inv_y is the polution(t-1) and inv_yhat is the predicted polution(t).

On line 50 the second parameter the function series_to_supervised can be changed to 3 or 5, so more days of history are used. If you do so, an error occurs in the scaler.inverse_transform (line 89).

No worries, great tutorial and I learned a lot so far!

I see now, you guys are 100% correct. Thank you!

I have updated the calculation of RMSE and the final score reported in the post.

Note, I ran a ton of experiments on AWS with many different lag values > 1 and none achieved better results than a simple lag=1 model (e.g. an LSTM model with no BPTT). I see this as a bad sign for the use of LSTMs for autoregression problems.

Hi Dr. Jason,

As for this:

Updated Aug/2017: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.

It seems to have some errors on calculating RMSE based on (t-1) vs (t) different time slots before. I’m just curious how it is corrected? Can you elaborate that little bit more? Because for me, I’m still thinking it is RMSE based on (t-1) vs (t)

Thanks

I have updated tutorials that I think have better code and are easier to follow, you can get started here:

https://machinelearningmastery.com/start-here/#deep_learning_time_series

hey,Janson.The RMSE before you updated it was 3.386. Is this article RMSE 26.496 the correct answer after you updated it? In other words,inv_y = scaler.inverse_transform(test_X)[:,0] is not true，test_y = test_y.reshape((len(test_y), 1))

inv_y = concatenate((test_y, test_X[:, 1:]), axis=1)

inv_y = scaler.inverse_transform(inv_y) is the correct code,is it right?I find so many people use the incorrect code .

I don’t recall.

I recommend starting with a more recent tutorial using modern methods:

https://machinelearningmastery.com/start-here/#deep_learning_time_series

Hi Jason, great post!

Is it necessary remove seasonality (by seasonal differentiation) when we are using LSTM?

No, but results are often better.

Good article, thank.

Two questions:

What changes will be required if your data is sporadic? Meaning sometimes it could be 5 hours without the report.

And how do you add more timesteps into your model? Obviously you have to reshape it properly but you also have to calculate it properly.

You could fill in the missing data by imputing or ignore the gaps using masking.

What do you mean by “add more timesteps”?

But what should I do if all data is stochastic time sequence?

For example predicting time till the next event – when events frequency is stochastically distributed on the timeline.

Good question, this sounds like survival analysis to me, perhaps see if it applies:

https://en.wikipedia.org/wiki/Survival_analysis

Dr.Jason,

Thank you for an awesome post.

(I was practicing on load forecast using MLP and SVR (You also suggested on a comment in your other LSTM tutorials). I also tried with LSTM and it did almost perform like SVR. However, in LSTM, I did not consider time lags because I have predicted future predictor variables that I was feeding as test set. I will try this method with time lags to cross validate the models)

Nice Jack, let me know how you go.

Hi Jason,

Can I use ‘look back'(Using t-2 , t-1 steps data to predict t step air pollution) in this case?

If it’s available,that my input data shape will be [samples , look back , features] isn’t it?

You can Adam, see the series_to_supervised() function and its usage in the tutorial.

Hi Jason,

If I used n_in=5 in series_to_supervised() function,in your tutorial the input shape will be [samples, 1 , features*5].Can I reshape it to [samples, 5 , features]?If I can, what is the difference between these two shape?

The second dimension is time steps (e.g. BPTT) and the third dimension are the features (e.g. observations at each time step). You can use features as time steps, but it would not really make sense and I expect performance to be poor.

Here’s how to build a model multiple time steps for multiple features:

And that’s it. I just tested and it looks good. The RMSE calculation will blow up, but you guys can fix that up I figure.

Jason, great post, very clear, and very useful!! I’m about 90% with you and think a few folks may be stuck on this final point if they try to implement multi-feature, multi-hour-lookback LSTM.

Seems like by making adjustments above, I’m able to make a prediction, but the scaling inversion doesn’t want to cooperate. The reshape step now that we have multiple features and multiple timesteps has a mismatch in the shape, and even if I make the shape work, the concatenation and inversion still don’t work. Could you share what else you changed in this section to make it work? I’m not so concerned about the RMSE as much as that I can extract useful predictions. Thank you for any insight since you’ve been able to do it successfully.

# make a prediction

yhat = model.predict(test_X)

test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

# invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:,0]

…

Hi Jason,

Great and useful article.

I am somewhat puzzled by the number of features you specify to forecast the pollution rate based on data from the previous 24 hours.

Do not we have 8 features for each time-step and not 7?

After generating data to supervise with the function series_to_supervised(scaled,24, 1), the resulting array has a shape of (43800, 200) which is 25 * 8.

To invert the scaling for forecast I made few modifications. I used scaled.shape[1] below but in my opinion it could be n_features. Moreover, I don’t know if the values concatenated to yhat and test_y really matter, as long as they have been scaled with fit_transform and the array has the right shape.

yhat = model.predict(test_X)

test_X = test_X.reshape((test_X.shape[0], n_obs))

# invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:scaled.shape[1]]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:,0]

# invert scaling for actual

test_y = test_y.reshape((len(test_y), 1))

inv_y = concatenate((test_y, test_X[:, 1:scaled.shape[1]]), axis=1)

inv_y = scaler.inverse_transform(inv_y)

inv_y = inv_y[:,0]

The model has 4 layers with dropout.

After 200 epochs I have got

loss: 0.0169 – val_loss: 0.0162

And a rmse = 29.173

Regards.

We have 7 features because we drop one in section “2. Basic Data Preparation”.

Hi Jason,

It’s really weird to me :(, as I used your code to prepare the data (pollution.csv) and I have 9 fields in the resulting file.

[date, pollution, dew, temp, press, wnd_dir, wnd_spd, snow, rain]

😯

Date and wind direction are dropped during data preparation, perhaps you accidentally skipped a step or are reviewing a different file from the output file?

Hi Jason,

So that’s fine, in my case I have 8 features.

When reading the file, the field ‘date’ becomes the index of the dataframe and the field ‘wnd_dir’ is later label encoded, as you do above in “The complete example” lines 42-43.

It is now much clearer for me. I am not puzzled anymore. 😉

Thanks a lot for all the information contained in your articles and your e-books.

They are really very informative.

🙂

I’m glad to hear that!

Hi Jason,

I think the output is column var1(t), that means:

train_X, train_y = train[:, 0:n_obs], train[:, -(n_features+1)]

am I right?

In case the “pollution” is in the last column, it is easy to get train[:, -1]

am i right?

I just want to verify that I understand your post.

Thank you, Jason

I have some confusion for this problem.

I want to use a bigger windows (I want to go back in time more, for example t-5 to include more data to make a prediction of the time t) and use all of this to predict one variable (such as just the pollution), like you did. I think predicting one variable will be more accurate than predicting many. Such as pollution and temperature.

What should I do to apply more shift?

I show in another comment how to update the example to use lab obs as input.

I will update the post and add an example to make it clearer.

First of all, thanks for your work and the effort you put in!

I tried to implement your suggestion for increasing the timesteps (BPTT). I have intergrated your code but I keep getting this error in when reshaping test_X in the prediction step:

test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

ValueError: cannot reshape array of size 490532 into shape (35038,7)

Do you have any tips on how to proceed?

I will update the post with a worked example. Adding to trello now…

Hi Jason.

In the code you wrote above, should the following code:

train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))

be actually

train_X = train_X.reshape((train_X.shape[0]/n_hours, n_hours, n_features))

Why is that?

Hi,Janson.I am a new leaner. First, thank fou for your share! But, when I run the complete code, it has an error: pyplot.plot(history.history[‘val_loss’], label=’test’)

KeyError: ‘val_loss’

How can I sovle it!

Perhaps you did not use a validation dataset when fitting the model. In that case you cannot plot validation loss.

Hi Jason,

Thank you for this excellent tutorial. I recently started working on LSTM methods. I have a doubt regarding this input shape. In case if the n_hour >1 , how to inverse transform the scaled values? Thanks in advance. Thanks in advance.

You’re welcome.

This will help with the input shape:

https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input

Hi Jason, I get the following error from line # 82 of your ‘Complete Example’ code.

ValueError: Error when checking : expected lstm_1_input to have 3 dimensions, but got array with shape (34895, 8)

I think LSTM() is looking for (sequences, timesteps, dimensions). In your code, line # 70, I believe 50 is timesteps while input_shape (1,8) represents the dimensions. May be it’s missing ‘sequences’ ?

Appreciate your response.

Ensure that you first prepare the data (e.g. convert “raw.csv” to “pollution.csv”).

I have the same error too. Cannot figure out what’s wrong

Something changed, the problem is on the model evaluation section, specifically the reshape line

test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

as it is, is 2 dimensions (34895, 8)

we need to add one dimension but I can’t figure out how (noob here)

tried this: test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

but didn’t work (IndexError: tuple index out of range)

any ideas anyone?

You can use the reshape() function or the expand_dimensions() function in NumPy.

https://docs.scipy.org/doc/numpy/

Does that help?

Greetings Sir..

I’ve run into the same problem as well. And I’m confident that I’m using “pollution.csv” data.. How can I rectify this?

I have some suggestions here:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

Hi Jason, I am wondering what the issue that I’m getting is caused by, maybe a different type of dataset then the example one. basically when I run the history into the model, When i check the History.history.keys() I only get back ‘loss’ as my only key.

You must specify the metrics to collect when you compile the model.

For example, in classification:

Hi Jason,

If you replace in this example the target by a binary target, let us say one that says if the var_1 goes up or not in the next move, thus : :

reframed[‘var1(t)_diff’]=reframed[‘var1(t)’].diff(1)

reframed[‘target_diff’]=reframed[‘var1(t)_diff’].apply(lambda x : (x>0)*1)

it gives this error :

””

You are passing a target array of shape (8760, 1) while using as loss

`categorical_crossentropy`

.`categorical_crossentropy`

expects targets to be binary matrices (1s and 0s) of shape (samples, classes). If your targets are integer classes, you can convert them to the expected format via:””’

I have :

test_y.shape as (35038,)

but if we follow another example from you with the PIMA dataset on a simple classification : https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/

which was :

X = dataset[:,0:8]

Y = dataset[:,8]

model = Sequential()

model.add(Dense(12, input_dim=8, activation=’relu’))

model.add(Dense(8, activation=’relu’))

model.add(Dense(1, activation=’sigmoid’))

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

model.fit(X, Y, epochs=150, batch_size=10)

it gives no error whereas the Y have the same shape … why ?

How can we make it work for the lstm classification please ?

Thanks

I have an example of LSTMs for time series classification here:

https://machinelearningmastery.com/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/

Yes thanks I looked at it:

if you do one example inside :

trainX, trainy = load_dataset_group(‘train’, path + ‘HARDataset/’)

trainy = trainy – 1

Note :

set(list(pd.DataFrame(trainy)[0]))

Out[217]: {0, 1, 2, 3, 4, 5}

But

trainy_postcategorical = to_categorical(trainy)

trainy_postcat.shape

gives

print(trainy_postcat.shape)

(7352, 7)

which means one additional variable has been created while we were expecting 6 dummies only.

pd.DataFrame(trainy_postcat)[0].sum() gives 0 so empty column for 1st one

Come back to the sahpe of lstm.

the output of your pre process work gives :

trainy_postcat.shape

Out[219]: (7352, 7)

which for a single dummy (the case of this article and my original question)

is the analogy of

”’ You are passing a target array of shape (8760, 1) ”

which should be good.

Any idea ? the activity recognition analogy does not solve the shape issue.

Sorry, I don’t have the capacity to review/debug your code, more here:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

Hello Jason,

Thank you for such a nice tutorial.

Since you have published a similar topic and few other related topics in one of your paid books (LSTM networks), should the reader also expect some different topics covered in it?

I’m an ardent fan of your blogs since it covers most of the learning material and therefore, it makes me wonder that will be different in your book?

Thanks Arman.

The book does not cover time series, instead it focuses on teaching you how to implement a suite of different LSTM architectures, as well as prepare data for your problems.

Some ideas were tested on the blog first, most are only in the book.

You can see the full table of contents here:

http://machinelearningmastery.com/lstms-with-python/

The book provides all the content in one place, code as well, more access to me, updates as I fix bugs and adapt to new APIs, and it is a great way to support my site so I can keep doing this.

Thank you for accepting my opinions, such a pleasure!

Running the codes u modified, still something puzzles me here,

1. Have u drawn the waveforms of inv_y and inv_yhat in the same plot? I think they looks quite like persistence.

2. Curiously, I computed the rmse between pollution(t) and pollution(t-1) in test_X, it’s 4.629, much lower than your final score 26.496, does it mean LSTM performs even worse than persistence?

3. I’ve tried to remove var1 at t-1, t-2, … , and I’ve also tried to use lag values>1, and also assign different weights to the inputs at different timesteps, but none of them improved, they performed even worse.

Do you have any other ideas to avoid the whole model to learn persistence?

Looking forward to your advices 🙂

Thank you for pointing out the fault!

The final line plot shows loss on the transformed train and test sets.

Yes, LSTMs are no good at autoregression, yet I keep getting asked to develop examples (tens of emails per day)… See here:

http://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Consider developing a baseline with an MLP, you’ll find it tough to beat it with an LSTM!

Why are you only training with a single timestep (or sequence length)? Shouldn’t you use more timesteps for better training/prediction? For instance in https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py they use 40 (maxlen) timesteps

Yes, it is just an example to help you get started. I do recommend using multiple time steps in order to get the full BPTT.

Hi Jason and Varuna,

When the timesteps = 1 as you mentioned, does it mean the value of t-1 time was used to predict the value of t time? Is moving window a method to use multiple time steps? Is there any other way? Has Keras any functions of moving window?

Thank you very much.

Keras treats the “time steps” of a sequence as the window, kind of. It is the closest match I can think of.

Hi Jason,

I met some problem when learning your codes.

dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)

Traceback (most recent call last):

File “”, line 1, in

dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)

NameError: name ‘parse’ is not defined

>>>

It looks like you have specified a function “parse” but not defined it.

Hi Jason,

Can I use “keras.layers.normalization.BatchNormalization” as a substitute for “sklearn.preprocessing.MinMaxScaler”?

No, they do very different things.

Hi Jason, Its a very Informative article. Thanks. I have a question regarding forecasting in time series. You have used the training data with all the columns while learning after variable transformations and the same has been done for the test data too. The test data along with all the variables were used during prediction. For instance, If I want to predict the pollution for a future date, Should I know the other inputs like dew, pressure, wind dir etc on a future date which I’m not aware off? Another question is, Suppose we have same data about multiple regions(let us consider that the pollution among these regions is not negligible), How can we model so that the input argument while prediction is the region name along with time to forecast just for that one region.

It depends on how you define your model.

The model defined above uses the variables from the prior time step as inputs to predict the next pollution value.

In your case, maybe you want to build a separate model per region, perhaps a model that improves performance by combining models across regions. You must experiment to see what works best for your data.

Thanks! I missed the trick of converting the time-series to supervised learning problem. That alone is sufficient even for multiple regions I guess. We just have to submit the input parameters of the previous time stamp for the specific region during prediction. We may also try one-hot encoding on the region variable too during data preprocessing.

Thank you for your excellent blog, Jason. I’ve really learnt a lot from your nice work recently. After this post, I’ve already known how to transform data into data that formates LSTM and how to construct a LSTM model.

Like the question aksed by Naveen Koneti, I have the same puzzle.

Recently I’ve worked on some clinical data. The data is not like the one we used in this demo. It is consist of hunderds of patients, each patient has several vital sign records. If it is about one individual’s records through many years, I can process the data as what you told us. I wonder how I can conquer this kind of data. Could you give me some advice, or tell me where I can find any solutions about it?

If I didn’t state my question clearly and you’re interested it, pls let me know.

Thanks in advance.

PS. the data set in my situation is like this

[ID date feature1 feature2 feautre3 ]

[patient1 date1 value11 value12 value13 ]

[patient1 date2 value21 value22 value23 ]

[patient2 date1 value31 value32 value33 ]

[patient2 date2……………………………………..]

[patient3 ……………………………………………..]

You could model one patient at a time, or groups or all of them. Try different approaches and see what works best.

I cannot tell you what would work best – I have no idea – you must discover it.

See this post:

http://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/

Hi Naveen, I have the same your question: the model is defined such that if you know the input features at time t, then you can predict the target value at time t+1. If you want to predict the target variable at time t+2, though, you would need to know the input features at time t+1. If a feature does not change over time, it is no problem; but if a feature changes over time, then its value at time t+1 is not known and may be different from its value at time t.

I am thinking that to solve this, you would need to define such features as output of the model as well as the target variable. In this way, at time t, you can predict the target variable for time t+1, but also the feature for time t+1, so that this predicted value can be used as input to predict the target variable for time t+2.

What do you think about that? Did you think of a different solution?

Many thanks

Hi,

again a nice post for the use of lstm’s!

I had the following idea when reading.

I would like to build a network, in which each feature has its own LSTM neuron/layer, so that the input is not fully connected.

My idea is adding a lstm layer for each feature and merge it with the merge layer and feed these results to the output neurons.

Is there a better way to do this? Or would you recommend to avoid this because the features are poorly abstracted? On the other hand, this might also be interesting.

Thank you!

Try it and see if it can out-perform a model that learns all features together.

Also, contrast to an MLP with a window – that often does better than LSTMs on autoregression problems.

Hi Jason,

I have two questions:

1) I have a question/ notice regarding the scaling of the Y variable (pollution). The way you implement the rescaling between [0-1] you consider the entire length of the array (all of the 43799 observations -after the dropna-).

Is it rightto rescale it that way? By doing so we are incorporating information of the furture (test set) to the past (train set) because the scaler is “exposed” to both of them and therefore we introduce bias.

If you agree with my point what could be a fix?

2) Also the activation function of the output (Y variable) is sigmoid, that’s why we rescale it within the [0,1] range. Am I correct?

Thanks for sharing the article!

No, ideally you would develop a scaling procedure on the training data and use it on test and when making predictions on new data.

I tried to keep the tutorial simple by scaling all data together.

The activation on the output layer is ‘linear’, the default. This must be the case because we are predicting a real-value.

Hi,

First I wanna thanks for your helpful and practical blog.

I tried to separate train and test set to do normalization on training but I have gotten error related to test set shape something like that “ValueError: cannot reshape array of size 136 into shape (34,2,4)”, which I don’t know how to fix it!

Do you have an example on LSTM which run normalization on train and used in test, or do you explain that in your book?

Thanks

This post will help you learn how to reshape your input data:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Hi,

I did some changes and just use transform method on test set, is that correct?

firstly I divided my data-set to two different sets ,(train and test)

secondly I ran fit_transform on train set and transform on test set

But I get rmse=0 ? which seems weird. am I correct?

Sounds correct.

An RMSE of zero suggests a bug or a very simple modeling problem.

Thank you very much for your tutorial.

I have one question,

but I failed to read the NW in pollution. csv.(cbwd column)

values = values.astype(‘float32’)

ValueError: could not convert string to float: NW

How do you fix it?

sorry, I saw the text above and solved it.

Glad to hear it!

Hi, I would like to know how did you fix it? I still have that problem, tried to find the solution above but didn’t find one. Thank you !

You have to prepare the Data befor you convert (see “Basic Data Preparation”). In Jason’s complete Example of the LSTM this preparation step is missing (more likely left out).

Yes the note above the complete example says clearly:

Hi Jason!

I assume there is little mistake when you calculate RMSE on test data.

You must write this code before calculate RMSE:

inv_y = inv_y[:-1]

inv_yhat = inv_yhat[1:]

Thus, RMSE equals 10.6 (on the same data, in my case), that is much less than 26.5 in your case.

Sorry, I don’t understand your comment and snippet of code, can you spell out the bug you see?

This beats further exploration

I agree with @Dmitry here. The prediction “inv_yhat” is one index ahead of real output “inv_y”.

It can be seen by plotting predicted output v/s real output:

pyplot.plot(inv_y[:-1,], color=’green’, marker=’o’, label = ‘Real Screening Count’)

pyplot.plot(inv_yhat[1:,], color=’red’, marker=’o’, label = ‘Predicted Screening Count’)

pyplot.legend()

pyplot.show()

Compute RMSE by skipping first element of inv_yhat, and better RSME score is presented:

rmse = sqrt(mean_squared_error(inv_y[:-1,], inv_yhat[1:,]))

print(‘Test RMSE: %.3f’ % rmse)

rmse = sqrt(mean_squared_error(inv_y, inv_yhat))

print(‘Test RMSE: %.3f’ % rmse)

Hi Jason,

great post! I was waiting for meteo problems to infiltrate the machinelearningmastery world.

Could you write something about the changed scenareo where, given the weather conditions and pollution for some time, we can predict the pollution for another time or place with given weather conditions?

For example: We have the weather conditions and pollution given for Beijing in 2016, and we have the weather conditions given for Chengde (city close to Bejing) also in 2016. Now we want to know how was the pollution in Chengde in 2016.

Would be great to learn about that!

Great suggestion, I like it. An approach would be to train the model to generalize across geographical domains based only on weather conditions.

I have tried not to use too many weather examples – I came from 6 years of work in severe weather, it’s too close to home 🙂

Hi Jason,

I have read many of your posts about LSTM. I have not completely clear the difference between the parameters batch_size and time_steps. Batch_size means when the memory is reset (right?), but this shouldn’t have the same value of time_steps that, if I have understood correctly, means how often the system makes a prediction?

Great question!

Batch size is the number of samples (e.g. sequences) to that are used to estimate the gradient before the weights are updated. The internal state is reset at the end of each batch after the weights are updated.

One sample is comprised of 1 or more time steps that are stepped over during backpropagation through time. Each time step may have one or more features (e.g. observations recorded at that time).

Time steps and batch size and generally not related.

You can split up a sequence to have one-time step per sequence. In that case you will not get the benefit of learning across time (e.g. bptt), but you can reset state at the end of the time steps for one sequence. This an odd config though and really only good to showing off the LSTMs memory capability.

Does that help?

Thanks, now it’s more clear!

Hi,I ger this error at this step, could you help me please?

model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

—————————————————————————

TypeError Traceback (most recent call last)

in ()

—-> 1 model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

C:\Anaconda3\lib\site-packages\keras\models.py in add(self, layer)

431 # and create the node connecting the current layer

432 # to the input layer we just created.

–> 433 layer(x)

434

435 if len(layer.inbound_nodes) != 1:

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in __call__(self, inputs, initial_state, **kwargs)

241 # modify the input spec to include the state.

242 if initial_state is None:

–> 243 return super(Recurrent, self).__call__(inputs, **kwargs)

244

245 if not isinstance(initial_state, (list, tuple)):

C:\Anaconda3\lib\site-packages\keras\engine\topology.py in __call__(self, inputs, **kwargs)

556 ‘

`layer.build(batch_input_shape)`

‘)557 if len(input_shapes) == 1:

–> 558 self.build(input_shapes[0])

559 else:

560 self.build(input_shapes)

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in build(self, input_shape)

1010 initializer=bias_initializer,

1011 regularizer=self.bias_regularizer,

-> 1012 constraint=self.bias_constraint)

1013 else:

1014 self.bias = None

C:\Anaconda3\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)

86 warnings.warn(‘Update your

`' + object_name +`

call to the Keras 2 API: ‘ + signature, stacklevel=2)87 '

—> 88 return func(*args, **kwargs)

89 wrapper._legacy_support_signature = inspect.getargspec(func)

90 return wrapper

C:\Anaconda3\lib\site-packages\keras\engine\topology.py in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint)

389 if dtype is None:

390 dtype = K.floatx()

–> 391 weight = K.variable(initializer(shape), dtype=dtype, name=name)

392 if regularizer is not None:

393 self.add_loss(regularizer(weight))

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in bias_initializer(shape, *args, **kwargs)

1002 self.bias_initializer((self.units,), *args, **kwargs),

1003 initializers.Ones()((self.units,), *args, **kwargs),

-> 1004 self.bias_initializer((self.units * 2,), *args, **kwargs),

1005 ])

1006 else:

C:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in concatenate(tensors, axis)

1679 return tf.sparse_concat(axis, tensors)

1680 else:

-> 1681 return tf.concat([to_dense(x) for x in tensors], axis)

1682

1683

C:\Anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in concat(concat_dim, values, name)

998 ops.convert_to_tensor(concat_dim,

999 name=”concat_dim”,

-> 1000 dtype=dtypes.int32).get_shape(

1001 ).assert_is_compatible_with(tensor_shape.scalar())

1002 return identity(values[0], name=scope)

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)

667

668 if ret is None:

–> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)

670

671 if ret is NotImplemented:

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)

174 as_ref=False):

175 _ = as_ref

–> 176 return constant(v, dtype=dtype, name=name)

177

178

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name, verify_shape)

163 tensor_value = attr_value_pb2.AttrValue()

164 tensor_value.tensor.CopyFrom(

–> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))

166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

167 const_tensor = g.create_op(

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)

365 nparray = np.empty(shape, dtype=np_dt)

366 else:

–> 367 _AssertCompatible(values, dtype)

368 nparray = np.array(values, dtype=np_dt)

369 # check to them.

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in _AssertCompatible(values, dtype)

300 else:

301 raise TypeError(“Expected %s, got %s of type ‘%s’ instead.” %

–> 302 (dtype.name, repr(mismatch), type(mismatch).__name__))

303

304

TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.

Perhaps check that your environment is setup correctly:

http://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

Also, ensure that you have copied all of the code.

Hi Jason,

I was curious if you can point me in the right direction for converting data back to the actual values instead of scaled.

Yes, you can invert the scaling.

This tutorial demonstrates how to do that Neal.

Hi Jason, I did have an issue converting back to actual values, but was able to get past it using the drop columns on the reframed data which got me past it.

When looking at my predicted values vs actual values, I’m noticing that my first column has a prediction and a true value, but for every other variable, I only see what I can assume is a prediction? does this make a prediction on every column, or just one particular one.

Im sorry for asking a question such as this, I just think I’m confusing myself looking at my results.

The code in the tutorial only predicts pollution.

Dr. Jason,

I have been trying with my own dataset and I am getting an error “ValueError: operands could not be broadcast together with shapes (168,39) (41,) (168,39)” when I try to do

`inv_yhat = scaler.inverse_transform(inv_yhat)`

as you have in line 86 in your script. I still can not figure out where my issue is. I have`yhat.shape`

as (168,1) and test_X.shape`as (168,38). When I do this,`

inv_yhat = np.concatenate((yhat, test_X[:, 1:]), axis=1)`, my`

inv_yhat.shape`is (168,39)`

. I still can not figure why`inverse_transform`

gives that error.The shape of the data must be the same when inverting the scale as when it was originally scaled.

This means, if you scaled with the entire test dataset (all columns), then you need to tack the yhat onto the test dataset for the inverse. We jump through these exact hoops at the end of the example when calculating RMSE.

This seems to be the same issue I am having at the moment also. i concatenate my inv_yhat with my test_X like you said, but the shape of inv_yhat after is still not taking into account the 2nd numbers(in posts case (41,).

Ask a question in stackoverflow and post the link, I should be able to help. I spent lots of time on this and have a decent idea now.

Yes, you’re right! I did that and it worked, nice! Thank you for your comment!

Glad to hear that Jack.

How did you solve the problem??

here’s link to solution on stackoverflow:

https://datascience.stackexchange.com/questions/22488/value-error-operands-could-not-be-broadcast-together-with-shapes-lstm

Nice!

I am having the same problem, but cannot solve the issue. everytime i try to concatenante them together, there is not change to my inv_yhat variable. i still am unable to understand this issue if you can expand a bit more that would be amazing

@John Regilina,

Check the shape of data after you scale the data and then check the scale again after you do the concatenation. Remember, when your

`yhat`

shape will be (rowlength,1) and after concatenation`inv_yhat`

should be the same shape after you scaled the data. Look at Dr.Jason’s answer to my comment/question. Hope that will help. (Thanks to Dr.Jason saved a lot of my time)Hello Sir, thank you for the awesome tutorial. But I still couldn’t understand what exactly needs to be done. I am getting the error:

> operands could not be broadcast together with shapes (12852,27) (14,) (12852,27) ”

This the line which generates the error:

inv_yhat = scaler.inverse_transform(inv_yhat).fit()

Could you please give me a small example to understand what went wrong. Thanks in advance Sir.

I am also stuck with same thing. How did you fix it?

Same question here, how did everyone fix this? From your answers I cannot deduce what exactly went wrong in your case, and what you did to solve it.

Hi Jason, In dataset.drop(‘No’, axis =1, inplace = True), what is the purpose of ‘axis’ and ‘inplace’?

Great question.

We specify to remove the column with axis=1 and to do it on the array in memory with inplace rather than return a copy of the array with the column removed.

Fabulous tutorials Jason!

Thanks Lizzie.

Can you show how the multi variate forecast looks like?

Looks like you missed it in the article.

Sure,

You can plot all predictions as follows:

You get:

It’s a mess, you can plot the last 100 time steps as follows:

You get:

The predictions look like persistence.

Jason, what am I missing, looking at your plot of the most recent 100 time steps, it looks like the predicted value is always 1 time period after the actual? If on step 90 the actual is 17, but the predicted value shows 17 for step 91, we are one time period off, that is if we shifted the predicted values back a day, it would overlap with the actual which doesn’t really buy us much since the next hour prediction seems to really align with the prior actual. Am I missing something looking at this chart?

This is what a persistence forecast looks like, that value(t) = value(t-1).

So how would you get the true predicted value(t)? I am thinking of the last record in the time series where we are trying to predict the value for the next hour.

Sorry, I don’t follow. Perhaps you can restate your question?

Hello Jason Brownlee

Thank you for your great posts. I run the model above for my data and it works perfectly, how ever when I draw the real data (blue one – inv_y) and the prediction (the orange one – inv_yhat), the result shows the prediction is delay after 1 step. it should be predicted one step before as your graph. your model is the same with the matlab tool:

https://nl.mathworks.com/videos/maglev-modeling-with-neural-time-series-tool-68797.html

And after running the model, I applyed realtime this model for my problem to compute the inv_yhat in every step. I got the result is really bad, since I have never had the real inv_y. I took the prediction to feed the input ( instead of real data inv_y)

My problem is: I received some signals as inputs, then I labeled offline to have output (real data inv_y or the first column in train_X)

Do you have the model that trains without the real data in the first column?????? thank you

Your model may have low skill and be simply predicting the input as the output (e.g. persistence).

You may need to continue to develop your model, I list some ideas for lifting model skill here:

http://machinelearningmastery.com/improve-deep-learning-performance/

hi, i have the same confusion as you. i think the prediction problem should be value_predict(t-1) = value_real(t). the label “train_y” indicates value_real(t+1). we input the train_x(t) into the model to get the prediction and the prediction should match “train_y” , not one step after “train_y”. did you solve this problem?

It’s definitely similar to a persistence model since we trained the model using the

`var1(t-1)`

feature (i.e. the lagged pollution feature). The model certainly found that to be the strongest predictor. This would be ok if we were doing predictions later on an hour-by-hour basis. But, if, say we want to predict the pollution 20 hours from now, we aren’t yet going to know what the hour-19 pollution is. So it seems like cheating to include this variable in the training and prediction sets.I removed this variable to train the model, leaving other parameters about the same, and was then only able to get a minimum validation loss of 0.55 and test RMSE of 87.02

Nice work.

It’s not cheating, it comes down to different framings of the problem based on the requirements of the problem.

This post can help if you want to explore direct multi-step forecasting:

https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

It looks the prediction is pretty good. Can we say the lstm model is good?

I think LSTMs are poor at autoregression.

Hi, Jason.I have a question on the transform, which is I found the predicted data after inverse_transform() were not same as the original value. For example, my original data is at the range from 0 to 850, but the prediction data is at 0 to 8. Is there any problem?

Perhaps there is a bug in your implementation?

Hi Jason

I have two questions:

(a) based on the graphs that you have shown for the y_inv and yhat_inv, it looks like your model has overfit on the test set. Don’t you agree ?

(b) In all time series prediction posts I have seen, the validation part uses the tail of the data to do validation (predict(yhat)). How can we modify the code in order to predict the future which is not covered in the dataset.

The model in this tutorial is probably underfit – e.g. it learned a persistence model.

Fit the data on all available data then call model.predict() to predict out of sample.

Wind dir is label encoded not wind speed!!!

Yes.

First of all, thanks. All of this material on the blog is super interesting, and helpful and making me learn a lot.

Of course… I have a question.

I’m surprised by the use of LSTMs here. The property of them being “stateful” I guess is being used. But is there “sequence” information flowing?

So when I used LSTMs in Keras for text classification tasks (sentence, outcome), each “sentence” is a sequence. Each observation is a sequence. It’s an ordered array of the words in the sentence (and it’s outcome).

In this example, I could not see a sense in which var1(t-1) is linked to var1(t-2). Aren’t they being treated as independent Xs in a regression problem? (predicting var8(t))

Correct, we are not providing a sequence of observations and therefore not getting good BPTT.

Based on my tests, I have found LSTMs to be poor at autoregression, and in this case, as I added more history to the model (longer sequences), performance degraded.

I would strongly encourage you to use an MLP baseline that any MLP would have to out-perform.

See this post for more on the limitations of LSTM for time series:

https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Awesome article, as always.

Btw, what is your view on using an autoencoder/ restricted Boltzmann layer compressing features/ features before feeding an LSTM network ? For example, if one has a financial timeseries to forecast, e.g. a classifier trying to predict increase or decrease in a look ahead time window, via numerous technical indicators and/or other candidate exogenous leading indicators…..

Could you write an article based on that idea?

I have seen better results from large MLPs, nevertheless, try it and see how you go.

autoencoder/ restricted Boltzmann layers also deal with multicollinearity issues… do MLPs also deal with multicollinearity if you have multicollinearity in the features, right?

MLPs are more robust to multicollinearity than linear models.

Hi, I am always amazed at your article. Thank you.

I have a question.

Is this LSTM code now weighted for each features?

Nowdays, I’m predicting precipitation, that is the trend is correct, but the amount is not right.

What’s wrong with that?:(

Thanks!

Sorry, I’m not sure I understand the question, perhaps you could rephrase it?

I can say that I would expect better skill if the data was further prepared – e.g. made stationary.

Hi Jason,

Thanks for wonderful explanation!

Could you please help me to understand dimensionality reduction concept. Should PCA or statistical approach be used before feeding the data to LSTM OR LSTM will learn correlation with the inputs provided on its own? how to approach regression problem in LSTM when we have large set of features?

Your reply is greatly appreciated!

Generally, if you make the problem simpler using data preparation, the LSTM or any model will perform better.

How can I predict a single input ?

for example :

[0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001]

how do i reshape and do a model.predict () ?

Thank you

Perhaps this post will make it clearer:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

Thank you, Jason.

I applied:

my_x = np.array([0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001])

print(my_x.shape) # (8,)

my_x = my_x.reshape((1, 1, 8))

my_pred = model.predict(my_x)

print(my_pred)

The answer is the “scaled” answer which is 0.03436

I tried applying the scaler.inverse_transform(my_pred) to GET the actual number

But I get the following error:

on-broadcastable output operand with shape (1,1) doesn’t match the broadcast shape (1,8)

Thank you

Yes, the transform requires data in the same form as when you “fit” it.

Then what if I use multi-time step prediction? (use several lags for prediction)

The y_hat and X_test can not have the same dimension.

If the size of X or y must vary, you can use padding.

Hi Jason,

Thanks for the tutorial!

Maybe I missed something, but it seems that you provided the model with all of remaining data as ‘testdata’ and then tried predicting it? Isn’t that kind of pointless, since we should be interested in predicting unknown data in the future, instead of data that the model has already seen? Wouldn’t it make more sense to try the model to predict a first timestep into the future that neither the training nor the test data knew anything about? (Perhaps only give the model training data, but no test data, and afterwards ask it to predict first time step after training data?) How would I have to change the code to achieve that?

The model is fit on the training data, then makes a prediction for each step in the test data. The model did not “know” the answer to the test data prior to making each prediction.

Normally we would use walk-forward validation:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

I did use walk forward validation on other LSTM examples (use the blog search) but it confuses readers more than helps it seems.

Hi Jason.

I am digging into your example and maybe missing something because I agree with Fejwin.

I mean, as long as real Pollution in t-1 is introduced in the test_X set, instead of predicted Pollution in t-1, when you run model.predict(test_X) each output is not considered for future prediction.

This is with all the features, including real Pollution(t-1) the model predicts an output: predicted Pollution(t). But on the next step, when the model predicts Pollution(t+1) it doesn´t take predicted Pollution(t), it takes real Pollution(t) instead.

Can you clarify this point please?

Thank you.

Yes, the assumption in the setup of the problem is that each prior hours pollution is available when predicting t+1.

You could change the framing of the problem if you wish.

Hi Jason,

I applied your code to my real dataset and it worked fine all the way to getting predicted for test dataset. But I’m stuck with how to get predicted value for future beyond the max timestamp in the actual input dataset. I know one way of iteratively feeding each prediction back in as input but concerned about getting bigger and bigger error by keeping using predicted value as the input

Perhaps this will help:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

And this:

https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/

Can I use part of trainX to predict testY ? (lags needed to predict testY is in trainX) Not sure if it is a logical way to do it.

Yes.

Dear Jason Brownlee,

I have a little different question, Actually I have a sequence of characters as input and I want to project it into a multidimensional space.

I mean I want to project each sequence of chars (let say word) to an vector of 100 real numbers along my corpus, so my input is a sequence of chars (any char-emedding is welcome) and my output is a vector for each sequence (which is a word ) and Im really confused how to define the model,

I would appreciate if you give any clue help or sample code to define my model.

Thanks a lot in advance.

Keras provides an Embedding layer that you can use directly:

https://keras.io/layers/embeddings/

Hi,

I am also having trouble understanding the difference between the walk-forward validation (prediction) method, and the “simple” prediction method being carried out here in the example.

Why does the walk-forward prediction (with an appended history) give different predictions than the simply calling predict on the test set, if the model is not re-fitted (that is including the new available observations, and training again) ?

Has the cumbersome walk-forward any advantage over this approach here in the example?

Can the walk-forward be carried out also for multivariate-multistep forecasting ?

Thanks,

Balint

Walk-forward validation simulates how we expect to use the model in practice, it evaluates the model under those conditions.

The procedure can be adapted based on how you want to use the model, e.g. when to refit, when new obs are available, how many steps to predict, etc.

You can learn more about walk-forward validation here:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Hey, thanks for the quick answer.

So as far as I see your point, the walk forward approach, without refitting the model at each iteration, is the same as calling model.predict(X_test) at once.

And the reason why you still implement it without refitting, is to provide the framework properly, and make it easier for us to work further with it, right ?

If I am wrong, and it is not the same, why is it not the same? I went through many of your posts, including the one you posted, but I didnt manage to comprehend the difference, if there is any, so far.

For example: https://machinelearningmastery.com/update-lstm-networks-training-time-series-forecasting/

Here you explain the updating, which awesome, but at the baseline part, where you do not apply updating (so no iterative re-fit), you still do iterative walk-forward predicting instead of calling model.predict() on the test set as whole. Would that be the same in the no update case?

Sorry for being annoying. I really appreciate your help, and time.

Many thanks

Balint

Probably.

Sometimes I like to drive the epochs manually for lots of reasons – e.g. so I have more control over the process/do things in between epochs.

We use walk-forward validation as it is the only valid approach for evaluating models on sequence data:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

Hi Jason,

Thanks for the wonderful tutorial!

Could you please explain how to deal the problem when situation is “Predict the pollution for the complete month (assume month has 30 days. t+1…t+30) and given the “expected” weather features for that month…assuming we have been provided historic data of pollution and weather data on daily basis”

How should the data be prepared and how it should be feed into LSTM?

As I new to LSTM model, I have problem understanding the data preparation and feeding to LSTM.

Thanks in advance for your response

Predicting for a month is called multi-step forecasting.

Here is a post on the general approach:

https://machinelearningmastery.com/multi-step-time-series-forecasting/

Here is an example of doing multi-step forecasting with an LSTM:

https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

Hi Jason,

Thanks for sharing. I added accuracy info to model while training using ‘ metrics=[‘accuracy’] ‘.

So model.compile(loss=’mae’, optimizer=’adam’) becomes :

model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])

This adds acc & val_acc to output. After 100 epochs the acc value appears quite low : (0.0761) :

Epoch 100/100

1s – loss: 0.0143 – acc: 0.0761 – val_loss: 0.0132 – val_acc: 0.0393

The accuracy of the model appears very low ? Is this expected ?

Further info on acc & val_acc values : https://github.com/tflearn/tflearn/issues/357 “acc is the accuracy of a batch of training data and val_acc is the accuracy of a batch of testing data.”

This is a regression problem. Accuracy does not make sense.

Hi Jason, I’ve recently discovered your site and have been so pleased with your information – thank you. I’ve been trying to model data which is much like the air quality data described here, but every few time steps there will be a change in the number of features present.

Example: in my data a time step = 1 day and a sequence can be 800 – 1200 days long. Normally the data consists of features

– pm2.5: PM2.5 concentration

– DEWP: Dew Point

– TEMP: Temperature

– PRES: Pressure

– cbwd: Combined wind direction

– Iws: Cumulated wind speed

– Is: Cumulated hours of snow

– Ir: Cumulated hours of rain

But then every (random-ish amount of time) there will be an additional number of features for a day and then back to the baseline number of features.

I’ve no idea on how to handle variable feature length. I’ve seen and played with plenty of variable sequence length examples, but I have both variable sequenceS and features. I’d love your input!

Thanks!

-Eric

You will need to normalize the number of features to be consistent for all time.

Is it possible to use (what in TensorFlow – land is called) SparseFeatures or SparseTensors to represent sparse datasets, or is there a fundamental issue with handling sparse datasets within RNNs?

Good question, I’m not sure off the cuff. Keras may support sparse numpy arrays – try it and see?

Hi Jason,

Thanks for the amazing articles. They are really helpful.

Lets say I want to forecast with lead 2. I mean by that forecasting values at time t using t-2 values, without using t-1 elements. I have to remove columns from reframed after running function series_to_supervised right ? To remove all columns with values t-1?

reframed.drop(reframed.columns[…])

Thanks

Yep, looks good.

Hello!

Thanks for articles.

I have a question related with time series. Is it possible to forecast all variables? For example, I have ‘pollution’, ‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’ and want to predict all of them for the next hour. We know about trends and common rules (because of data amount: few years), so we can do forecasting. Where can I find more info about it?

Yes, this example can be modified to predict each variable.

Thank you Jason for the great tutorial! I’m adapting it for different data, and i’m trying to use >1 time step. However I noticed something strange in the series-to-supervised: Since the first loops ends at 0 and the last loops starts at 0, won’t there be two columns that are the same?

No, try it with the data and see.

Hi Jason,

Thanks for the tutorial. I had just one question though.

I’ve seen tutorial using multivariate time series to train a lot of dataset (all have correlation between each other) at the same time and were able to predict for each dataset used.

For sake of argument let’s say than one of the dataset is broke, the sensor that get the information to feed it is out of service (let’s say at some point one of the column of data only have 0 instead of whatever value). Do you think that we could use the other spot to continue to predict the broken one? (there is correlation between them and there would be a lot of non broken data from before the bug)

Best regards,

Yes, you could try it and see. Or impute the missing data and see if that is better.

Thank you Jason,

I shall try that as soon as possible.I guess that the overall accuracy will lower for every set prediction (since my goal is to use multivariate, feed it every spot data set and predict each of them (with possibility to predict a broken one)) so one spot being fed “wrong” data should lower each spot accuracy no?

Best regards,

It will.

Is there any time parser like date parser? I am working with data which is in milliseconds.

It can handle parsing dates and times I believe.

i got this error when i tried to run the program

pyplot.plot(history.history[‘val_loss’], label=’test’)

KeyError: ‘val_loss’

Ensure you copy all of the code.

Hi Jason,

Wouldn’t it be better to scale the data after you run the series_to_supervised function? As it stands now, the inverse scaling doesn’t work if n_in > 1 since the dimensions don’t line up anymore.

It would, but the scaling would be column-wise and incorrect.

Could you expand more on this and how the code might be modified to incorporate multi-step? I’m also playing around with turning this into a classification problem, would it still work if the feature we are trying to predict is a classifier?

I give the code to do this in another comment.

For classification, you will need to change the number of neurons in the output layer, the activation function in the output layer and the loss function.

I have a little question. I’ve successfully built my own LSTM multivariate NN using your code as a basis (thanks!). It forecasts export growth for the UK using past export growth and GDP. It perform decently but the financial crisis kinda messes things up.

Now I want to add data to this model, but I can’t go further back than 1980 for the time-series (not for now at least). So what I want to do is add the GDP growth rate of all the UK’s major trading partners. Should I be worried about adding another 20 input neurons (e.g. countries)? Do you have a post talking about the risks of using data that is low in rows (e.g. years) but high in columns (e.g. inputs).

I hope my question makes sense.

Cheers

I don’t have posts on the topic of more columns than rows. It does require careful handling.

As a start, I would recommend developing a strong test harness, then try adding data and see how it impacts the model skill. Experiment.

Jason

Thanks a lot for your tutorial!

Is there a feature importance plot for cases like this?

sometimes is very important to know it

Good question. I’m not sure about feature importance plots for LSTMs. I would expect that if feature importance can be calculated for MLPs, then it could be calculated for LSTMs, but this is not something I have looked into sorry.

Thanks a lot, Jason!

No problem.

Hi Jason,

Great post as always!

I have a question regarding scaling. My problem is quite different as I have to apply series to supervised function first on the data coming from different source and then combine the data… my question is, can I apply scaling at the end? Should scaling be applied column wise or on complete matrix/array?

The key is being able to scale the data consistently. The place in the pipeline is less important.

Hi Jason thank you very much for your tutorials!

I’m trying to develop an LSTM for time prediction having as input 3 features (2 measurements and a third one is a sort of control of the system) and the output (value to predict) is not a single value but a vector of 6 values. So, at every time step my network should be able to predict this entire vector. Two questions:

1. Since my inputs are not correlated between them, their order in the input array will not influence my predictions?

2. How can I shape my output in order to estimate all the 6 values of the vector for each time step?

Thanks for any kind of help!

This post will help you understand how to prepare data for multi-step forecasting:

https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

I replicated the example described on this page, and saved my test_y and yhat vectors to csv so that I could manually check how my prediction compared with the true values. However, when I did this, I discovered that every yhat value in my array is the exact same value (~34). I was expecting a unique yhat value for each input vector. Do you have any suggestions to help fix this?

Follow up on this — when this error arose, I was using my own data set that I want to perform time series forecasting on. When I duplicated the guide exactly as described above, the issue goes away. Do you have any idea why this issue comes up (where every predicted yhat value is the exact same) when I use a different data set?

Perhaps the model needs to be tuned to your specific dataset?

Hi Jason thank you very much for your tutorials! I try to delete the columns [‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’] from the train_X data, and I also get the almost same test RMSE. It is 26.461. It seems to show that the 8 weather conditions have no affect on the prediction result. The code is below.

# fit an LSTM network to training data

def fit_lstm(train, test, batch_size, neurons):

# split into input and outputs

train_X, train_y = train[:, 0:1], train[:, -1]

test_X, test_y = test [:, 0:1], test [:, -1]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

# design network

model = Sequential()

model.add(LSTM(neurons, input_shape=(train_X.shape[1], train_X.shape[2])))

model.add(Dense(1))

model.compile(loss=’mae’, optimizer=’adam’)

# fit network

history = model.fit(train_X, train_y, epochs=50, batch_size=batch_size, validation_data=(test_X, test_y), verbose=2, shuffle=False)

#history = model.fit(train_X, train_y, epochs=50, batch_size=72, verbose=2, shuffle=False)

return model

# make a prediction

def make_forecasts(model, test_X):

test_X = test_X[:, 0:1]

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

forecasts = model.predict(test_X)

return forecasts

Nice one!

The real motivation for me writing this post was to help the 100s of people asking how to develop a multivariate LSTM.

This is more substantial than I think is being acknowledged. What is the point of creating a multivariate lstm if all of the other variables don’t have an impact on the outcome? Has this been attempted with other data sets?

It is an example for those who want to explore the approach.

I don’t have more examples because it turns out the method is outperformed by MLPs for autoregression problems. At least in my experience.

even when we are looking at multivariate times series forecasting?

It really depends.

I recommend this framework:

https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/

Hi Dr. Brownlee,

As you mentioned that MLP ususally have a good performance for autoregression problems. Do you have any post with an example code for that? Thanks.

Yes, many examples – use the search box.

Perhaps start here:

https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/

Can you explain why the train_X and test_X data sets are reshaped to this?

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

The shape is: samples, time steps, features.

Hi Jason

Great post.

Suppose i want to predict the next 24h using previous one year dataset. How can we do it?

Thanks

I give an example in another comment.

Also, generally, see this post on multi-step forecasting with LSTMs:https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

I think I’m missing something fundamental in my understanding of LSTM/s and BPTT. I’ve read through many of your posts and have come to understand RNN’s and LSTM in particular much better because of them, so thank you for that!

My question that I hope you can shed some light on is what is the difference between passing the past information, i.e. var(t-n)…var(t-1) in the input vector for a single sample, and passing multiple sequences, of length n as a single sample?

To help clarify, using temsteps of length N, I have a configuration that looks like this:

Input to LSTM is [samples, timesteps, features].

Each sample/observation consists of a vector of timestamps (of size N+1) where each of these vector’s values corresponds to the input feature’s values I.e.

Observations for each time t, with features f and r

[

time t

[

[ f(t-N) r(t-N) ]

[ f(t-N+1) r(t-N+1) ]

[ f(t-N+2) r(t-N+2) ]

. .

. .

. .

[ f(t) r(t) ]

]

]

And for each observation/sequence the target is Y(t).

Or, as many of your examples do, you can include the the past information in the form of a windowed input, with a single time step, so something like:

Input is [samples, 1, features]. So for every observation, we include previous time values as features

Observations for each time t, with features f and r

[

time t

[

[ f(t-N), r(t-N), f(t-N+1), r(t-N+1), f(t-N+2), r(t-N+2), f(t), r(t) ]

]

]

And again, for each observation, the target is Y(t).

I understand that having sequences longer than 1 allows BPTT to work over the length of those sequences, but I don’t think I really understand the difference in these two methods.

I have tried the described two options, and I find the the latter is performing better based on preliminary tests. I can use a window size of 3 and a sequence length of 1 and get good results, but if I use the first approach and a window size of 12, the model actually fails to learn within the same amount of time.

Hence, I wonder if I don’t have a fundamental misconception. If you have some time, I would like to hear your explanation on this difference and how the LSTM responds in terms of “memory” based on these two different types of input setup. (I have read a lot of articles, blogs, git hub issues, and stack overflow posts trying to wrap my head around this, but I haven’t found anything that address this directly.)

Thanks!

Generally, the multiple steps for one sequence are required for BPTT:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

Without the history, the training will not have sufficient context to estimate the error gradient and your model will learn a function mapping rather than a sequence prediction problem.

Does that help?

With this line…

# drop columns we don’t want to predict

reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

I don’t understand the numbers used here, doesn’t the data not even have that many columns? There are 8 feature columns and 1 index column.

I’m adapting this code for my own use and have very different features but I’m not sure I’m getting that line adapted right.

Thanks for the great post!

Nevermind! I figured it out.

Glad to hear it Paul.

It does have that many columns after we reshape it to be a supervised learning problem.

This is awesome!

Helping me a lot in my real work!

Thanks, I’m glad to hear that.

Hi Dr. Jason, I am working on a project for sleep stage classification where the number of timesteps (observations) in the input series (ECG signal) is different than the number of timesteps in the output series (sleep stage scores).

The issue here is that the input and output time series are not equal in terms of timesteps as the examples you have shown in your problems.

I have tried to frame the problem in different ways without getting results that make sense. Could you please provide guidance on how to approach this problem?.

Thanks,

Vilmara

Generally, I would recommend an encoder-decoder model:

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

Hi Jason,

If we want to predict multiple features as output and having multiple feature as input. How can we solve this problem. For example input variables are temperature and humidity and want to predict both temperature and humidity, can we solve this with single LSTM model.

Thanks for your anticipated response.

Yes you can. Change the multivariate input model to output more than one value in the output layer.

Hi Jason,

Thank you for taking the time to write such an excellent post and follow up with questions. The mechanics of the data conversion & training work great.

However, my first reaction is that the LSTM doesn’t seem to have learned anything more than to copy the previous value. As BECKER states:

> it looks like the predicted value is always 1 time period after the actual?

These are the same results as in your Shampoo example: the predicted value appears to be equal to the previous value (possibly with some constant offset).

Have you found a different network architecture that performs better than a DNN without LSTM layers?

Agreed, LSTMs do not seem to be very good for autoregression. I would generally recommend using an MLP with a window for time series forecasting instead.

See this post:

https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Hi Jason,

Would like to understand how to go about when the problem statement is framed like below.

Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour. And this is to be done for next n days at hourly level, ie n * 24 time steps in the future with other variables given at those time steps.

Hope you can point out to some resources and if LSTM would be a good way to go for this formulation.

Thanks,

Avinish

You may need a multi-input model, e.g. one input for the sequence, and one for the static data, this will help:

https://machinelearningmastery.com/keras-functional-api-deep-learning/

Thank you so much Jason for the wonderful article, learnt a lot… I wanted to have a comparison shown on multivariate statistical methods and neural networks and I was looking for some post/article on multivariate time series model using ARIMA. I would be glad to know if anything you know of the same.

Thank you

You will need to look into using SARIMAX, sorry I do not have an example at this stage.

Hi Jason, is there any library available to perform feature extraction/ dimensionlity reduction for sequential LSTM model?

Often an embedding layer is used to project observations at each time step prior to feeding them into the LSTM.

How does multivariate LSTM compare to Multivariate ARIMAX? Are there use cases where one model outperforms the other?

I would recommend using a linear model first and only moving to a neural net if it delivers better results on your specific problem.

Hello,

There are some problem of scaling back when we use more than one shift in time, I mean something like this:

reframed = series_to_supervised(scaled, 6, 1)

I can train and test the model, but some errors appears in the scaling back section which I couldn’t fix.

Please have a look. I really appreciate it.

Hi Jason, thanks for the great series of articles. How should I modify the code from changing the LSTM code from preiction to classification?

One sample input data is 60 time steps over 2 features and I want to classify the 60 step input sequence into 3 classes. To start with is LSTM the right approach?

Hoping that you wold take any requests, I would definetly love to see an article on Multivariate classification in Keras using LSTM/GRU and it would be really helpful for analyzing sensor data. You could look at the Human Activity Recognition dataset

Change the loss function and the activation function of the output layer to categorical_crossentropy and softmax respectively.

Hi Jason, thanks yor nice article.

I have a question!

That algorithm is many to one right?

How can I slove many to many?? for example, i want predict pollution and rain

It is many-to-one in terms of features.

You can change it to be many-to-many by outputting multiple features.

3 Things:

1) Thanks so much for this. I’ve used this as a basis for some code I’m writing and it gave me a great head start.

2) One thing that would be great to help with understanding the meanings of variables you’re using is to first put them into variables rather than using the integers. For example,

x_size = 1

train_X, train_y = train[:, :-x_size], train[:, -x_size:]

test_X, test_y = test[:, :-x_size], test[:, -x_size:]

This way, as people are reading the code they understand why it’s “-1” in case their adapted usage has different dimensions, they can change one variable and have it used everywhere it’s needed.

3) For instance, I’m trying to make this code output multiple predictions and am having a bit of trouble figuring out all the variables I need to change.

I have 368 columns of data, the first 168 are what will be predicted based on the other 200 points.

x_size = 200

# split into input and outputs

train_X, train_y = train[:, :-x_size], train[:, -x_size:]

test_X, test_y = test[:, :-x_size], test[:, -x_size:]

# reshape input to be 3D [samples, timesteps, features]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

# design network

model = Sequential()

model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

model.add(Dense(1))

I get the error:

ValueError: Error when checking target: expected dense_1 to have shape (None, 1) but got array with shape (659, 200)

Should the Dense(1) be Dense(x_size) where for me that is 200? (this is why it would be great to use variables so I know what that 1 means). When I try it as 168 (which is what it seems like it should be), I get an error.

When I switch to x_size, it actually runs without errors, but I’m not sure if that means I’m correct or not.

I’m so confused.

Thanks!

I have an example of multiple timestep outputs here that you could use as a starting point:https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

Rather than trying to predict many timestep outputs, I’m looking to output multiple predicted values per timestep.

One thing I don’t understand is this section:

# invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:,0]

Why is it inserting the yhat values as the *first* column? The scaler has a different scale per column so positioning is important, and the Y data had been the last column in the row, hadn’t it? So won’t it get scaled incorrectly?

The first column is the pollution value, we remove it from the test data, concat our prediction so we have enough columns for the transform’s expectations, then invert the transform and get the predicted pollution values in the correct scale.

Does that help?

First of all ,thanks a lot for the great tutorial Jason.

I just have one question regarding the achieved predictions using the LSTM network.

I just don’t understand why are you making “trainPredict = model.predict(trainX)” .

I get the predict method using the testset testX, but using this method for trainX is not like if you were in some way cheating? I say this because we train the network using the trainX and trainY and trainY corresponds to the labels you are trying to predict in the predict method using trainX.

Is it performed for validation purposes only?

I’m still learning to work with the Keras API so I might be confused with the syntax of it

Many thanks

Where am I doing that exactly?

Jason

Thanks a lot for your tutorial!

I still have some question,looking forward to your answer.

If I want use the feature(t) 、 feature(t-1) and pollution(t-1) to predict pollution (t), how can I do to reshape my input?

Hi Jason, Thank you very much for the wonderful post. I have a few questions.

1. You did not de-trend by using diff for above example. Diff from multi step only works for series. Can you please share how can we de-trend of multivariate time series?

2. I’d like to use past 3 days of above data to predict 3 time steps for multivariate data as above. Can you please let me know how I can do that with the example above?

Thanks for your help.

You could de-trend each input series separately. Here is an example of using diff to detrend:

https://machinelearningmastery.com/remove-trends-seasonality-difference-transform-python/

I give an example in another comment of how to use multiple lag obs as input.

Hi, Jason. First of all, any thanks for your post. And I have some problems.

1. I don’t really get the meaning of hidden_units? Can you please explain a little bit.

2. I am building a lstm network as you do. I just follow your ways and build the network but got an error, as described here https://stackoverflow.com/questions/46811085/dimension-error-building-lstm-with-keras.Could you please help me?

Thanks!!

A hidden unit is a neuron or cell in a hidden layer.

A hidden layer is a layer that is not the output or the input layer.

Change your code to set “return_sequences” to be “False”.

So in your example you are using the data this way:

No,year, month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir

1, 2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0

Is possible to use the data in a way that lets say we could have multiple input numbers in one of the columns like for example, having

No, year, month, day, hour, pm2.5, newVariable

and in the new variable position instead of having just one integer like 20

to have a sequence of integers like (5,10,3,50,23)

Would that be possible using it on the same context, or is there any scenario that we could

use the data the way I mentioned ?

If you mean, can you predict a sequence output, then yes. Here is an example:https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

I might have not been clear enough, and sorry for that.

What I mean is that as an input I will have 4 different categories of data lets call them A, B, C, and D, that each one of them will have more than one integer, to be exact they will have 10 integers

so for example:

A = {3,4,6,8,34,65,43,1,54} and so on with the other three categories.

The sequence of numbers within the four categories belong on different time stamps, for example 3 -> t0 , 4-> t1 and so on.

So what I need is to classify them for different data samples.

These would be parallel series (columns) that could be all fed to one LSTM model like the example in the above tutorial.

The model will process the parallel series one at a time step at a time.

If the series extends beyond 200-400 time steps, then they could be split into multiple samples (e.g. multiple sub-parallel series).

Does that help?

So so helpful, I tried it and worked like a charm.

Great job, and so helpful all the material you provide, and the way you do it !!

Thanks a lot Jason !!

I’m glad to hear that, well done!

Really appreciate all the work you have done!

Thanks Tim.

Hi Dr Brownlee. Thank you for this tutorial.

inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

what does these steps do?

Because I am getting a ValueError: operands could not be broadcast together with shapes (1822,11) (6,) (1822,11) on this step.

I am applying on my own dataset

These steps add the prediction to the test input data so that we can inverse the transform and get the prediction back into the scale we care about.

Hi Abhinav,

I am facing a similar problem. What did you do to rectify it ?

Thanks

Hi Jason,

Thanks for sharing your awesome work, I’ve been learning a lot from you!

I have been struggling with increasing the second dimension to fully benefit from the BPTT though. I keep getting lost in the shapes. Would you mind sharing your code for multiple time steps aswell?

That would be awesome!

Keep up the good work!

This post might help clear things up:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Awesome work, thanks for sharing it!

Could it be possible that you switched up the chronological order of your predictions?

It looks to me that you predict the pollution of the previous hour, instead of predicting the future.

That is what a persistence model looks like exactly.

Hi Jason, I’m new to Deep Learning, so sorry if this is a fundamental question. I am trying to use an LSTM NN to create a super fast surrogate for a coastal circulation model (something sort of similar to this, but with time dependency: https://arxiv.org/pdf/1709.08725.pdf)

My training set looks something like this:

-samples: 2000 – (I modeled a year with hourly output)

-timesteps: 7 – (t-6, t-5, …, t)

-features: 4 – (offshore boundary tide, 1st derivative of offshore boundary tide, boundary river discharge for river-1, and boundary river discharge for river-2)

Currently, my target is velocity magnitude for one node in my model domain ([2000,1]

My question is: When you do this tutorial, you assign the time steps as additional features (i.e. for my problem, our train_X = [2000,1,28]). I did this and it works fine, but eventually I’d like to scale this, and I thought I’d try to reshape my data to it’s intended shape for the model (i.e. [2000,7,4]). However, when I do this, my training time goes way down (it’s probably 3-4x slower.

Does the model treat these two shapes differently? If not, why does it take so much longer to train with the latter shape?

More time steps is slower.

Perhaps this post will clear things up re input shapes:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Hi Jason,

Great article.

I have a small question:

In previous article you pointed out that we need to make the data stationary,

Do we need to do it for multi-variant as well?

Ideally, yes.

Nice article! I think one question remains unanswered. Why use RNNs if we only use one previous step to predict the next step? Why not SVM for example?

No reason at all, we cannot what will work best for a given problem.

Try it and compare the results!

Hi Jason,

Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

Thanks in advance!

Sure.

Great! One more small thing. When dealing with tails (let’s say 0 for lower, 1 for other than tail, 2 for upper tail), the classes and the features of course will be highly imbalanced. What would your approach be?

You might need to adjust the distribution via rescaling to make the least represented classes better represented.

Hi Jason,

Thanks for this very informative post! Before applying to my financial dataset, I would like to consult you about my case. The type of my data is almost the same. I have financial risk factors like equity values, interest rates, foreign exchanges etc. values on daily basis and their corresponding dependent variable which is profit or loss of a portfolio. My goal is to detect the patterns and features (if any) responsible for the highest profits or lowest losses. So my question is can I convert your code above to a classification problem if I label my classes as 0 for the lowest losses and 1 for the highest profits?

Thanks in advance!

Try it and see.

Hello

What we should do if the time itself would be a value that we must predict, such as predicting time and date for the next rainfall?

You could predict the likelihood of rainfall for each hour and then use code (an if statement) to interpret those predictions and only output the predictions with a probability above a given threshold.

Hello Jason,

Could you perhaps show me exactly where to change as to predict the temperature instead of pollution?

You can change the column used as the output variable when fitting the model.

Around line 52 in the full example where we drop columns we don’t care about. Change it to drop the pollution as well and not drop temperature.

Can you please help me further as i can’t manage to find where to change to predict for the temperature instead of pollution

“” Next, we need to be more careful in specifying the column for input and output.

We have 3 * 8 + 8 columns in our framed dataset. We will take 3 * 8 or 24 columns as input for the obs of all features across the previous 3 hours. We will take just the pollution variable as output at the following hour, as follows:

# split into input and outputs

n_obs = n_hours * n_features

train_X, train_y = train[:, :n_obs], train[:, -n_features]

test_X, test_y = test[:, :n_obs], test[:, -n_features]

print(train_X.shape, len(train_X), train_y.shape)

Where and how should i change to chose the temperature column?

Sorry, I cannot prepare an example for you.

You might want to explore getting more familiar with NumPy arrays first:

https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-learning-python/

Thanks Jason

can you at least point to me where in these lines the clue is?

train_X, train_y = train[:, :n_obs], train[:, -n_features]

test_X, test_y = test[:, :n_obs], test[:, -n_features]

Hi Jason,

Thanks for sharing your awesome work, I’ve been learning a lot from you!

I have a small question:

In previous article you pointed out that “Predict the pollution for the next hour as above and

given the “expected” weather conditions for the next hour.” , eg “pollution,dew,temp”.

What would your approach be?

For the case: “Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.”

You would not need to transform the dataset, you would simply pretend that the actual weather conditions for the next hour are a forecast and predict the pollution value at that time.

first thanks for the post I learned a lot. I have a fundamental question about LSTM. lets say, I have 3 variables X, Y, and Z. I want to predict on Z.

if I make the input(train_X in example above) time lagged. So I pass it x(t), x(t-1), x(t-2), x(t-3) etc…. then will the time component of LSTM matter or not? For example we have:

t, x, y, x-1, x-2, y-1, y-2, z-1, z-2, z

1, 1, 2, 0, 0, 0, 0 , 0, 0, 3

2, 2, 4, 1, 0, 2. 0, 3 0, 3

3, 3, 6, 2, 1, 4, 2, 3, 3, 6

4, 4, 8, 3, 2 6, 4 6, 3, 6

5, 5, 10, 4, 3, 8, 6 6, 6, 9

traditionally we would train on variables (x, y, x-1, x-2, y, y-1, y-2, z-2, z-2) on the first 4 time-steps then evaluate on the 5th.

my question is if I train it on time step,(1, 2, 4, 5) and evaluate on step 5, will I have the same result? mainly if I add the time-lag as an input can I reshuffle the data?

If you reshuffle the data and the result is better/same then the LSTM is probably not the right method to use. I would recommend using an MLP. See this post:

https://machinelearningmastery.com/get-the-most-out-of-lstms/

Hi Jason,

if we pass in previous time lag can we shuffle the data around in the model? in other words make the input timeless?

sorry when I refreshed my question didn’t appear, I thought it did not go through….did not mean to impatiently spam. apologies.

No problem, I moderate comments so there is some delay before they appear.

Thanks for this great post.

So how do you assess graphically your forecast with the actual?

You could plot both with matplotlib.

Hello, I have a problem that’s highly related to this guide.

I have a time series where the predicted variable is (allegedly) in part dependant on some features from that time step, and these features are known before it (they are “planned prices” and “expected value” for different feature). I would like to include them as input into the LSTM.

For one output, this turned out to be easy (just keep them in), but if I try to predict several outputs, I am having troubles formating the input correctly.

For better understanding, the desired input would be features x1 through x8 for t-1,t-2…etc and then x1 through x7 for t,t+1,t+2…etc.

Is this even possible with the example given here?

I believe you could adapt the example for your problem.

Spend some time with this post:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

PM2.5 is just one time series to predict, clearly. Predicting say 3 (or even 100,000) time series would be nice to look at too. An real life example where it’s useful is inventory management in retailing businesses. How many units will be sold in the next day of eggs, mascara, paper plates, frozen corn, 2% milk, skim milk, etc etc. Many of these TS will be correlated. Might need multi-tasking neural network outputs. LSTM would offer more automatic feature engineering than, say, using a boosted tree traditional machine learning algorithm which is natively unaware of time series. The latter needs manual feature creation of time-windowed aggregates by the data scientist. The LSTM just inputs the raw time series values directly by contrast, finding its own features. A bonus when using the LSTM is there may be some time-window or other features the human didn’t know about in advance. Another bonus is multiple-output (multitasking) that neural networks can naturally provide, unlike boosted trees for example. I’d suggest to start with only 2 or 3 TS at first, because a whole grocery store’s worth of items for even just a one day example is way too cumbersome to look at and manipulate easily on one small monitor screen. Just a warning: This may be frontier research, believe it or not.

Thanks for the suggestion Geoffrey. I hope to spend more time on this soon.

I plot inv_yhat and inv_y in a same figure, and I found an interesting fact, that the training result is shifted to right for an hour compared with the ground truth. That’s to say the predicted result is almost the one hour ago data, or X_t = X_{t-1} approximately.

Actually, the best estimation for RNN is to output the latest result, without doing any prediction. How do you think about this?

When a prediction looks like a shifted input it means the model has no skill because it is predicting the input as output, e.g. a persistence model:

https://machinelearningmastery.com/persistence-time-series-forecasting-with-python/

I’m using my own dataset and I’m not using the series_to_supervised method because I already have the dataset prepared in 2 files, train and test files. I still have the error:

Traceback (most recent call last):

File “teste.py”, line 64, in

inv_yhat = scaler.inverse_transform(inv_yhat)

File “C:\Users\rafae\AppData\Local\Programs\Python\Python35\lib\site-packages\sklearn\preprocessing\data.py”, line 385, in inverse_transform

X -= self.min_

ValueError: operands could not be broadcast together with shapes (52,12585) (12586,) (52,12585)

To load the datasets

#Train dataset

dataset = read_csv(‘trainning_small.csv’, header=None, index_col=None)

dataset.drop(dataset.columns[[0]], axis=1, inplace=True)

train = dataset.values

encoder = LabelEncoder()

train[:,-1] = encoder.fit_transform(train[:,-1])

train = train.astype(‘float32’)

scaler = MinMaxScaler(feature_range=(0, 1))

train = scaler.fit_transform(train)

#Test dataset

dataset_test = read_csv(‘test_passare.csv’, header=None, index_col=None)

dataset_test.drop(dataset_test.columns[[0]], axis=1, inplace=True)

test = dataset_test.values

encoder = LabelEncoder()

test[:,-1] = encoder.fit_transform(test[:,-1])

test = test.astype(‘float32’)

test = scaler.fit_transform(test)

train_x, train_y = train[:, :-1], train[:, -1]

test_x, test_y = test[:, :-1], test[:, -1]

train_x = train_x.reshape((train_x.shape[0], 1, train_x.shape[1]))

test_x = test_x.reshape((test_x.shape[0], 1, test_x.shape[1]))

print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)

THE RESULT FOR THE PRINT:

(838, 1, 12585) (838,) (52, 1, 12585) (52,)

Dr. Brownlee,

First of all, thanks for this wonderful post. I have applied your code with the following parameters:

lags=8, features=8, epochs=50, batch=104, neurons=150

And got almost perfect match between train and test. The test RMSE is 26.526.

My question is that what does this result stand for?

Well done. The result is a summary of the error between predicted and expected values.

I launched this example on my notebook (AMD FX-8800P Radeon R7, 8GB RAM), it runs already 4 hours and I even can’t see what is going on with the model training and how long will it run. Is it possible to include in the example some monitoring and visualization of the training process, ex. using callbacks.RemoteMonitor ?

P.S. previously I worked with Matlab, it was so nice to see number of epochs, accuracy, error, and many other parameters during the training process. It helped a lot to understand should I continue training, or should I change the model.

You should see the progress for each epoch and across epochs as output on the command line.

Hm, relaunched the example step-by-step and found out it’s stuck not at training, but at model compilation. Working for hours at 100% CPU load on block:

# design network

model = Sequential()

model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

model.add(Dense(1))

model.compile(loss=’mae’, optimizer=’adam’)

What’s wrong?

Ubuntu 16.4, Keras 2.0.6, Theano 0.9.0, Python 3.6.2, Anaconda custom

Are you running on the command line? If you run in a notebook, you may hide error or verbose messages.

I updated all libraries and anaconda and python and now it works! Sorry for disturbance 🙂 BTW, monitoring tool can be used for callbacks.RemoteMonitor is hualos-master

I’m glad to hear that, well done!

Thanks for the very well written article. I really appreciate the detailed walkthrough.

I have been looking for a way to apply multivariate input to a machine learning prediction model of any sort. I’m doing this in order to predict the growth of compute systems in excess of hundreds of thousands of nodes bases on 6 years of daily samples. Simply looking at the Y growth over time and feeding that into something like Facebook prophet has proved somewhat insufficient because it only looks at the problem as a function of past behavior.

In reality there are more variables at play that control or effect that line of growth. As such, simple univariate approaches fall short and the predictions can be very good or very bad.

When I found this article I thought to myself, Eureka! I will be able to use this approach in order to feed in multivariate data along with the growth of my systems in order to get better predictions. However I was somewhat crestfallen at the revelation of 2 key problems discussed over the last several months here in the comments…

One problem you acknowledged as a potential/known issue and linked to another article explaining why autoregression time series problems may not be best solved with lstm neural networks. The article posits that better results might be obtained by stacking or using more layers. Have you tried this? If so, what did it look like and what results did you get?

The second and more concerning problem was when one commenter performed the same exercise as laid out in this article, but removed all of the multivariate data and still obtained the same rmse rate as you did. It was as if none of the other variables had any bearing on the prediction. This is deeply concerning, because as I see it, either this event was anomalous and driven by the input data, or the overall approach itself may be flawed, or the implementation thereof is broken. I’m not sufficiently versed in the technology to make a value statement on any of those points.

I’m hoping that you would be willing to share your thoughts on possible answers to these questions.

The tutorial is a demonstration of a method, not the best way of solving or even framing the presented problem.

I should have made that clearer, but that is the philosophy behind every single blog post on my site. I show how to use the methods, not how to get the best results (for a specific problem). The former problem is tractable the latter is not.

Thanks for the clarity and candor! As a long-time comp-sci person, I find it very strange to run these tensorflow sessions and get different results for the same inputs (I’ve been putting your code through the paces) … I found I needed to add this, or every subsequent run would result in predictions that seemed to augment each previous run:

try:

keras.backend.clear_session()

except:

pass

For what it’s worth, I zeroed out all the other variables (instead of eliminating them) and it /did/ have bearing on the output. I don’t think this methodology can be dismissed as ineffective. It seems to be approximating a workable solution. More exploration is necessary.

Thank you for setting me on the path!

Damn.

Well, these are stochastic algorithms in general, but a single trained model should be deterministic and when it’s not, we’re in trouble.

Have you tried running multiple iterations and examining yhat_inv?

I keep getting different output, and I didn’t expect that. Am I looking in the wrong place?

I can send a catalog of my results if that helps…

I have not.

In general, we do expect different results across different runs given the stochastic nature of neural networks (forgive me if I am missing the point):

https://machinelearningmastery.com/randomness-in-machine-learning/

Hi Jason,

multivariate time series forecasting possible for multi-step??

Sure.

Hi,

Jason Can you please explain..How to prepare dataset for train models.. let’s suppose i have 5 feature and i want to predict t + 5 value..

For example..

x1 = (2,3,4,3,1,6,8,9,4,1)

x2 = (5,2,5,7,9,9,6,3,1,3)

x3 = (2,3,4,8,1,6,8,9,1,1)

x4 = (5,1,5,7,9,9,6,3,1,7)

x5 = (2,3,4,6,8,3,1,3,5,7)

y = (8,7,6,5,4,3,2,8,9,7)

Thanks,

What do you think about putting a dropout layer between the LSTM and Dense layers to address the overfitting phenomenon?

Try it and see, I’d love to hear how it goes.

Hi, Jason, we need a similar tutorial of Multivariate time series using the Recurrent neural network in R.

Thanks for the suggestion.

Hello Jason!

You say in your post:

“We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.”

Is it possible to do the same without prior knowledge of the pollution levels?

I am working on a very similar time series forecasting problem. However, in my case, I don’t have access to intermediate level of pollution.

Thank you

Yes, but it is important to spend time exploring different framings of the problem.

Hi,

I have a question about splitting the data.

I have the data month wise for around 20 years.

How should I split it?

Thanks.

See this post:

https://machinelearningmastery.com/prepare-univariate-time-series-data-long-short-term-memory-networks/

Hi Jason,

Thank you for this excellent tutorial!

This may or may not be a slight variation of your “Train On Multiple Lag Timesteps Example”, but I was wondering how I should modify your example to do a multivariate one to multiple time step prediction i.e. look at one time step of 8 dimensional data and predict 10 time steps of 8 dimensional data. Or a multivariate seq2seq prediction i.e. show 10 time steps of 8 dimensional data and predict 10 time steps of 8 dimensional data.

Thanks

Hmmm, I have to think about that. It might be best to do a multiple output model:

https://machinelearningmastery.com/keras-functional-api-deep-learning/

Hi Jason,

First of all, thank you very much for this excellent post. I would be grateful if you can show how to do multivariate time series forecasting per group. In other words, lets say we have data for many cities and we would like to add the forecasting per city ? How we can feed the data to LSTM for a given city and get inv_y, inv_yhat to compare to see how model does ?

Thanks again,

Sammy

You could model each city separately or combine all cities into a single dataset, or do both and ensemble the result.

Hi Jason.

I have a dataset of 169307 rows and 41 features. I want to use timestep of 5. So, when I am using X=np.reshape(X, (169307, 5, 41)), I am getting an error that “cannot reshape array of size 6941587 into shape (169307,5,41)”. Does this mean that n_samples*n_features in the orginal dataset should be divisible by n_timesteps? If this is true, then how can I be able to use timestep of my choice?

Perhaps this post will help:

https://machinelearningmastery.com/prepare-univariate-time-series-data-long-short-term-memory-networks/

Hi Jason.

I referred to this post. But it explains data preprocessing in which only 1 feature is present. But my dataset has multiple features..I am confused on how to reformulate the data and then reshape it…for example, let us say, the following is my dataset:

Slno f1 f2 f2 target

1. 2. 3. 1. 0

2. 1. 7. 9. 1

3 . 3. 3. 1. .1

……

Here it has three features f1 f2 f3..and a target label with two classes.here the classification cannot be done only on the current feature vector, since the output has a dependence on previous feature vectors..can u plz explain me the data formulation for this case to the format n_sample, time steps, n_features…where n_sample is the same as number of sample in the original dataset X and n_features is the same as number of feature I.e 3. Let’s say the time step is 5. Plz help in this.

This post will help you frame your data as a supervised learning problem:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Hi Jason,

I’m a little confused about the range of scaling.

In many other posts you mentioned the following:

“Transform the observations to have a specific scale. Specifically, to rescale the data to values between -1 and 1 to meet the default hyperbolic tangent activation function of the LSTM model.”

Is there a reason for the use of 0 to 1 ?

Isn’t -1 to 1 better for scaling, since the activation function is tanh?

Thank you,

Chris

Great question, a scale of 0-1 results in better skill in my experience.

Hi Jason,

Thank you so much for the wonderful tutorial! That was so helpful for me.

When i read your post, my questions was solved about how to predict multi-output multi-input system in multi-step time series because of your great illustration.

But I have a question, in my problem, we have many observations for some cases in each time (about 500), so we have multiple series inputs and outputs in each time.

Could you please help me how can solve this issue.

Any help will be useful for me. i will be very appreciated for your help.

Thank you,

Somayeh

I would recommend exploring many different framings of the problem to see what works best and consider a baseline MLP model.

May I ask how you solved your problem of multiple outputs? I am having trouble implementing it.

I see this question has been raised before, I’m sorry for beating a dead horse. I’ve been struggling with the inverse_transform step.

I tried to implement this algorithm using my own dataset and had trouble with it. Then I tried to run the example with the example dataset as in the tutorial and also had an error on the inverse_transform step.

inv_yhat = scaler.inverse_transform(inv_yhat)

(on my data)

ValueError: operands could not be broadcast together with shapes (15357,287) (8,) (15357,287)

on the tutorial data set:

ValueError: operands could not be broadcast together with shapes (35037,24) (8,) (35037,24)

PS. your blog is great. Keep up the the good work!

Generally, you must make sure that the data has the same shape and that columns have the same index when transforming and inverse transforming.

Confirm this before performing each operation.

Does that help? Let me know how you go.

Hi Jason,

I am unable to fix a similar valueerror. Initially when the data is normalized the shape is different. Can you give an example of what needs to be done from your tutorial?

First of all, a lot of people are getting this same mistake, I am not an exception, and I followed the exact code. There might be some problems in the code itself. This answer is so general and does not help at all.

Sorry, here are some specific things to check:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

This error is because he applied scaler.fit_transform on the dataframe that only had 8 columns (the original dataframe), but then he apply the scaler.inverse_transform on the test_X dataframe which had 16 columns; hence, the mismatch. I don’t know why he was able to upload the full code without reproducing this error.

The code works as is.

Ensure you have copied the code from the complete example.

The code doesn’t work, and you doesn’t help. Is it so hard to answer: what can I do with this mistake? I have copied the code from the example correctly

Perhaps these tips will help:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

HI jason,

Thanks for great tutorial. I have a question how to choose the no. of timesteps as you always choose 1 timestep ? From where can I see the predicted value as graph just showing training of model and how can I predict the value for different time intervals (e.g. if I want to predict the value for next 1, 2, 4 or hours)?

I recommend experimenting with different numbers of time steps on your problem to see what works best.

You can collect predicted values and plot your own graph using matplotlib. I provide examples on other posts, for example:https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

Hello Mr Jason Brownlee, Your tutorial is awesome, it helped me in my project. I have been really interested in machine learning and this place has given me a lot.

My next move was to find a way to input data to my code and predict the future value. Like for example, for predicting air pollution. A user will keep todays data like N02 and windspeed and the code will spit out tomorrow’s air pollution. In other words how to apply the code to practice?.

Thank you.

I think “yhat” is the predicted value regarding “test_X” actual value because we are providing test_X as input to predict.

Sounds correct.

Here is an example:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

Hi Jason,

In series_to_supervised() function, when we change the value of variable “n_in” (e.g. if we say 2 in this example ,does it mean we are now predicting for the next two hour because now the dataframe will have 16 columns instead of 8)? How the value of “n_out” effects please explain that also .

Best Regards,

You can learn more about that function here:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

Hi Jason,

i took the “yhat” array as my predicted values and “test_X” array as actual values because we predicted on test_X array and draw a plot using matplotlib , did I do the right ?

Hi Jason,

I wanted to have n_in: Number of lag observations as input (X) set to 3 (using my own data) as can be seen below

49 # frame as supervised learning

50 reframed = series_to_supervised(scaled, 3, 1)

I make the data samples

86 inv_yhat = scaler.inverse_transform(inv_yhat)

and I get the following error:

File “/usr/local/Cellar/python3/3.6.2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/sklearn/preprocessing/data.py”, line 385, in inverse_transform

X -= self.min_

ValueError: operands could not be broadcast together with shapes (67112,57) (19,) (67112,57)

I have initially 19 variables and I have number of observations set to 3 the text_X has following shape

>>> test_X.shape

(67112, 1, 57)

yhat = model.predict(test_X) and

>>> yhat.shape

(67112, 1)

I don’t understand the error above. I would be grateful if you can help me see what I am doing wrong.

Again, thanks a lot. You are awesome !

Sammy

Hi Sammy, did you try the section “Update: Train On Multiple Lag Timesteps Example”?

No as I didn’t see the update before. I will try it now. Thanks a lot

No problem.

Hi Jason,

First of all, many thanks for this great tutorial!

I’m trying to apply this to my own problem. However, I’m facing some problems.

Let’s say we have the time series of multivariate data structured like this:

x1,x2,x3,…x30, y1

x1,x2,x3,…x30, y2

….

where x1 – x30 are numeric (continues) values and y1 – yn are labels which I want to predict.

Y can only be 1 (on) or 0 (off). Some of these parameters are raw sensor data, which increase or decrease over n samples, so I know that this problem is ideal for RNN.

But I am not sure if my approach is ok.

Is it ok to re-factor the data in a way, that I take the first 10 samples (without y values of course), create the 2D array of them and try to predict the output of sample n10 and then move for 1 place and take next 10 samples and predict sample n11 and so on… So not to combine them into one vector like you did.

For example, if I have 10,000 samples, each for 100ms and I want to look at the last 10 samples (1 second) I train the data with samples of shape (99990, 10, 30 ) where 99990 represent the number of samples, each containing 10 readings (1 second) with the dimension of 30.

My current model looks like this, but it is not as successful as I want it to be (I think it can be a lot better):

model = Sequential()

model.add(LSTM(100, input_shape=(nsamples, nbatch, ndimension))

model.add(Dropout(0.2))

model.add(LSTM(100))

model.add(Dropout(0.2))

model.add(Dense(1, activation=’sigmoid’))

model.compile(loss=’binary_crossentropy’, optimizer=’adam’)

Can you please point me in the right direction?

Hi Maha,

Can you tell me why you are just applying “Activation Function” to just output layer I mean why there is no “Activation Function” for hidden layer?

We are using the default activation functions for the LSTM hidden layers.

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

I’m having a lot of troubles with these two lines.

I don’t understand why it isn’t like so

train_X = train_X.reshape((1, train_X.shape[0], train_X.shape[1]))

test_X = test_X.reshape((1, test_X.shape[0], test_X.shape[1]))

I thought (and obviously I’m wrong, but I want to know why) that we had 1 sample because we have one city, but have multiple timesteps one for each set of measurements.

If we had 3 cities would we then have 3 instead of 1?

In this example, we are only using a single time step per sample.

It is unrelated to the number of cities.

See this post for more on how to reshape data for LSTMs:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Hi Jason,

If I have data for every city then how can I build one LSTM model. Here data is for only one city and have to forecast pollution. Lets suppose if I append data for other cities so can we predict pollution using single LSTM

Yes,we can build model for each city separately but can we build a single model?

There is no one best way. I would encourage you to explore different ways to frame this problem, perhaps one model per city, perhaps one model for regions or all cities, perhaps ensembles of models. See what works best for your data.

Hi Jason,

If instead of single time series we have multiple time series, how should we normalize data?

i.e. if we have pollution data for 100 cities, normalization should be done citiwise or across all cities ?

It really depends on the model that you are constructing.

Your goal is to ensure input data to the model is consistent.

Hello Jason, one question is why didn’t you used scikit-learn train_test_split function instead of

# split into train and test sets

values = reframed.values

n_train_hours = 365 * 24

train = values[:n_train_hours, :]

test = values[n_train_hours:,

By all means, try it. Note that you cannot shuffle the series.

oh,jason,

in my computer, every epochs used 191s! emmmmmm……….. this time is too long .

i want to ask ,you used GPU to speed up ? or other problems?

thank you!!

GPU can speed up LSTMs somewhat, but not as much as MLPs.

Hi Jason,

Thank you so much for your brilliant website helping us all get good at machine learning!

Please could you clarify the line of code that outputs the next hour’s pollution reading? I’ve run the model and it return the RMSE but I’m interested to see the t+1 prediction.

What code would I add at the end so that when the model has finished running it prints the next hour’s predicted pollution reading?

Many thanks!

Thanks Mark!

See this post on how to make predictions with a finalized LSTM model:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

Thank you, Jason.

I’m almost ready to apply what you’ve taught me here to my use case. The only other thing that isn’t 100% clear to me is the dropping columns number references 9,10,11,12,13,14,15 (below):

# drop columns we don’t want to predict

reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)

I get that you’re dropping the columns after ‘pollution’ because you only want to predict the pollution readings but why are they referenced 9-15?

Thank you in advance!

We are dropping variables that we do not want to predict at the next time step. We only want to predict pollution.

I understand that. My question was around the numbering. If we’re dropping columns ‘dew’ through to ‘rain’ i.e. columns number 3 to 9 in the prepared “pollution.csv” dataset above then why isn’t the code written:

reframed.drop(reframed.columns[[3,4,5,6,7,8,9]], axis=1, inplace=True)

It’s the 9 – 15 that I just need an explanation for please.

Many thanks

We are dropping them from the new dataset that has lag variables.

Try printing the version of the dataset that we are modifying to get an idea of its shape.

Hello json,

again a very successful contribution.

What I would like to do is something like a early warning system that predicts as early as possible, as safely as possible for example in the case of natural disasters, financial forecast or driving data from the prediction output of a Multivariate Time Series LSTM Forecast.

Suppose I get the prediction, e.g. x, y and z and each area labeled with x or z must be K-units long, each time they occur. X and z make up 10 percent of the data.

The ground truth and Prediction would then look like e.g.

GT:y y y y y y y y x x x x x x y y y y y y z z z z z z y y y y y y y y y y y y y y y y

PR:y y y x x y y y x x x x x x y y y x y y y z z z y y y y y y y y y z z y y y x x y y

Now I would like to determine an overall probability for an event, based on the PR sequence.

Op:y – – – – – – – – X – – – – – – Y – – – – – – Z- – – – – -Y – – – – – – – – – – – – – – – – –

I had the idea of a window with a threshold or a sequence classification task.

Since I am fairly new to machine learning and co, but I’m thinking that this problem has probably been discussed and solved very often, I would be very happy about your advice.

There is not one best way to solve a problem like, this, but many. I’d encourage you to brainstorm different ways of framing this as a prediction problem and see what works best.

Hi Jason,

These days LSTM is also popular for sentimental analysis. Have you written any tutorial on Sentimental Analysis using LSTM or something like that ?

Yes, see here:

https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/

Hi,jason

can i save my model ? i don’t want to train it everytime….

oh,and do you have any article to talk how to predict next n step in Multivariate Time Series Forecasting with LSTMs in Keras??

thank you!!!

Yes you can save your model, here’s how:

https://machinelearningmastery.com/save-load-keras-deep-learning-models/

Here’s how to make predictions:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

Hi, jason

I read your article and run the code.But i have some questions .Can you give me some suggestions?

1. In this article, you prepare the pollution dataset for the LSTM. All features are normalized, your dataset is transformed a supervised learning problem . I want to ask ,why the code is ‘MinMaxScaler(feature_range=(0, 1)) ‘, rather than ‘MinMaxScaler(feature_range=(-1, 1))’ ?I remember the default activation function for LSTMs is the hyperbolic tangent (tanh), which outputs values between -1 and 1. Why we set (0,1) in there?

2. In this code,we don’t transform Time Series to Stationary. Why? I think we must transform Time Series to Stationary. It’s necessary，right?

3. the important arguments are batch_size, n_neuron and epochs. How shoud i adjust them?

4. Can i use CNN network to predict Multivariate Time Series ? Too many people all think LSTM is the best way, Really?

Thank you very much!

Results are better if you normalize the data.

Making the data stationary may improve the skill of the model. I was trying to keep the example simple.

Use experiments to see what values give the best results. Be systematic.

I think MLP is better at time series, here’s why:

https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

thank you jason,

your reply it’s very usefu. But i still don’t understand why the code is MinMaxScaler(feature_range=(0, 1))? in your other article ,you use feature_range=(0, 1),

so i’m very wondering . what is the reason? The activation function for LSTMs is changeable?

Sorry, I don’t follow?

i am foolish,I write it wrongly ,i am sorry,

my question is:

But i still don’t understand why the code is MinMaxScaler(feature_range=(0, 1))? in your other article ,you use feature_range=(-1, 1),The activation function for LSTMs is tanh? i think thnh is in (-1,1) , why in there ,we use (0,1)?

thank you so much….

LSTMs generally perform better with normalized data (in the range 0-1).

Hi Jason, great article.

Can you please explain why it is OK to use feature_range [0. 1] as opposed to [-1, 1].

In another article (https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/) you said that the feature_range should be [-1, 1] in order to be the same range as the hyperbolic tan (tanh) function, which default LSTM uses. In fact, you said “This is the preferred range for the time series data.”.

I am not sure why it is OK to now use [0, 1]. Are you taking absolute value of tanh somewhere in your LSTM layer?

The range [0,1] results in better skill.

Hi,Jason,

The work you have done is wonderful. i’m interested in time series forecasting with lstm.

i have two questions.

1.In some cases in time series forecasting, especially the single series, the features are the data of previous time(t-1,t-2…). For example,only the series of pm2.5, i want to predict the value on t+1,depending on the data of t-k……t-1,t. how should i set the “time-steps” and “features”, [samples, k+1, 1]or [samples, 1, k+1](treat the previous data as features).

2.you have mentioned “LSTM does not appear to be suitable for autoregression type problems”. did you mean that LSTM didn’t perform well in the cases like the example i mentioned in the first question(single series ,and predict t+1 with data before t).

This post may help you with preparing the data:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

And this post has an example:

https://machinelearningmastery.com/prepare-univariate-time-series-data-long-short-term-memory-networks/

Correct.

Hello Jason,

I hope you are doing fine.

I am getting this error and i don’t know why. I used my own data set for Ammarilo Texas.

raceback (most recent call last):

File “/Users/Ahmed/Desktop/Coding/P.prediction.py”, line 118, in

inv_yhat = scaler.inverse_transform(inv_yhat)

File “/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/sklearn/preprocessing/data.py”, line 385, in inverse_transform

X -= self.min_

ValueError: operands could not be broadcast together with shapes (3567,13) (10,) (3567,13)

The size of your data may not match the expectations of your model?

Hi Jason,

Currently I am working on a project and I am following your tutorials , they are great but I have some questions regarding LSTM. First is can you briefly tell what timestep is exactly and how that affects the performance of model?

In the above example, we used model.add(LSTM(50)), if we increase the no. LSTM cells, how that will affect the performance of model ?

In the above example, why did you assign shuffle = False, If we keep it true , dont you think that will increase the performance ?

How can I check the underfitting and overfitting of my model and result accuracy of the model ?

Best Regards,

You can learn more about LSTM inputs here:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

I recommend testing different numbers of cells on your problem to see what works best.

We do not want to shuffle inputs because all samples are sequential, learn more here:

https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

More about model diagnostics here:

https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/

hi Jason, I want to ask why you do normalization (scale) for data before “series to supervised operation”. for another example, this may cause denormalization errors when using n_in=2, n_out=1 .

So , It is better to do normalization after “series to supervised” operation?

I recommend normalizing before splitting the series into multiple features.

Hi Jason,

Again appreciation for your blogs and thanks for the quick response but still have some queries.

I am working on a dataset whose size is approximately 2.5 Million and more than 10 features and this is a time series data and interval is 5 min so in my case should I use Truncated Backpropagation Through Time or just I should increase the no. of timesteps to 250-500 as mentioned in one of your blog ?

I have followed many of your tutorials but I did not see “dropout” anywhere but I have read at some places it dcreases the learning time ?

No. of timesteps tells that how many times we are going to backpropagate ? Please correct me if I am wrong.

One big confusion is when to use LSTM and when to Bidirectional LSTM .e.g. as I mentioned my dataset above what will be useful in my case ?

Best Regards,

Here are some ideas on strategies for dealing with long sequences:

https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

Here is an example of dropout with LSTMs:

https://machinelearningmastery.com/use-dropout-lstm-networks-time-series-forecasting/

Yes, time steps define BPTT, here’s more on BPTT:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

Try bidirectional and see if it lifts model skill, here is an example:

https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/

hello, nice example.

If you want to “compress” time, before entering the LSTM, using convNet1D how would you do ?

thanks in advance,

Rui

Depends on the problem.

Perhaps you can compress all obs from an hour, day or week into a CNN output vector to feed into an LSTM.

Hi Jason,

I do not understand why you swap “samples” and “timesteps” meaning. From the Keras’ FAQ, a sample is an element of the dataset. In the case of timeseries prediction, an element of the dataset is a timeseries. In this case, you have just one timeseries. Instead you have N timeseries with just 1 timestep. A timeseries with 1 timestep is not really a timeseries. Anyway, you are not even setting the stateful property and the internal state is going to be reset at each step (sample in your case). So, how does the network remember?

Best regards

When we frame our time series problem as a supervised learning problem, we can choose what constitutes a sample or a time step.

Indeed, we need multiple timesteps in order to achieve true BPTT:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

LSTMs can remember across samples if internal state is not reset.

Hi Jason,

Really great blogs. I have never seen such nice blogs. But again I am disturbing you.

If I have a time series dataset at 5min interval which contain 250000 rows and 10 features and I want to predict one feature and If I apply Backpropagation Through Time (BPTT) using 200 timesteps:

1-> I have to reshape into [samples, timesteps, features] = [ 250000, 200, 10] ?

or

2-> I will have to split the 250000 time steps into 1250 sub-sequences of 200 time steps each and I have to reshape into [samples, timesteps, features] = [ 1250, 200, 10] ?

Which approach is the right for BPTT, both of them have mentioned in your blogs and now I am totally confused between these two ?

And kindly mention the reshape [samples, timesteps, features] for the above example in case of Truncated Backpropagation Through Time (TBPTT).

Regards,

Good question, here are some ideas that may help:

https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

Dear Jason,

I am trying to Solve a problem using RNN and wish to explain that problem using this example and want to know how to apply RNN

If the test data had every other data other than PM2.5 ( Pollution) for few days , how to predict pollution using the Training data and test data with RNN

thanks

Sorry, I’m not sure I follow. Can you perhaps rephrase your question?

Dear Jason,

Let me Rephrase my question

We have a problem to solve similar to example you have explained above.

Instead of explaining my problem, I would like to pose a question on this problem hoping that would provide some clues to solve my problem

You had Stated

Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.

Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

The first one is clear. But the second line is not clear to me

Are you predicting the pollution for next hour based on Model created using past data AND using weather conditions like temperature, pressure for next hour ?

if yes, then i would go ahead and read more on the solution you have posted

if no, i am wondering how RNN can be used to solve a problem like

Predict the pollution , not just for next hour but , say, for next 15 hours based on past data and with weather conditions also provided for those 15 hours

Thanks

Yes, I use the weather conditions for the next hour with the conceit that we pretend they are forecast weather condition rather than obs.

Hi，jason

if i want to make Multivariate Time Series classification Forecasting with LSTMs in Keras.

what should i do ? my dataset is Y: classified variable(0/1) , X1：numericalvariable，X2：numerical variable，X3：numerical variable，and all of these variables are timeseries. i want to predict Y’s class.

thank you very much!

Perhaps you can use the above tutorial as a guide?

HI Jason,

You are not using in this blog “stateful = True”, how your network will remember the previous history ?

When we use property “returnSequences = True” ?

Please give a brief description.

This model is not rolling-forecast, so we don’t need to manually reset the cells memory as of reset_states() method, and therefore the model is not required to be “stateful = True”

“returnSequences = True” is necessary for LSTM multi-layer stacking (probably not only), when each previous layer should return the same vectors as it received from the previous layer. In this post model Jason used only 1 LSTM layer, so it should transmit only one flat value to Dense(1) layer.

Am i right?

The LSTM is still stateful, although state is reset at the end of each batch.

Return sequences is appropriate when stacking LSTMs or when outputting a sequence.

Hi Jason!

Is it important (or even necessary) to include the pollution of the previous timestep as the feature of observation to predict next?

var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \

1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290

var7(t-1) var8(t-1) var1(t)

1 0.000000 0.0 0.148893

I’m asking about var1(t-1)

Bacause if the pollution value is a result of all the other variables in the past, so why should we feed it to the LSTM?

Thanks for your great work!

Test and see.

Hello Jason,

thank you very much for your turorial. I am wondering if it is possible to adapt your code to the a multi-step forecasting problem.

Can I predict multiple time steps of the pollution value under consideration of the other variables?

Thank you for your great work!

Yes, use this post as a template:https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

Hi Jason!

Thanks for your tutorial, and the time you have dedicated to make it and answer all of us. And also sorry for my bad english!

I’m making a prediction model for water consumption, and I have for inputs, the real aggregated consume of a pool of people of the previous day, the previous-day forecast of consume for the day, if the day is labor/no labor, day of the week, and the average anual consume and standard dev for 10 subtypes of persons.

For last inputs, I have 20 columns, 10 for average consume, and 10 for standard dev.

With this, my question is, may I link in any way average consumue and std-dev, as something similar than a tuple, as input? I’m afraid that the model misunderstand relations between them.

Thank you in advance!! Best regards.

I would recommend brainstorming many different ways of framing the problem and test each to see what works best for your data, even ensemble a few of them together.

Thanks for this blog on using RNN and using LSTM for forecasting.

and its very enlightning

i have been working on an energy dataset with dimensions(87647,7).(approx five years of data).The data is collected at every half an hour

.I have trained my model using a single LSTM and Dense Layer with test batchsize of 4 years and predicted and validated over a 1 year of data .

The test rmse is about 0.458 and train rmse is 0.058 .does this means my model badly overfits the data. i have scaled the data using minmax scaler just like your post

i have read your other blog of diagnosis of