Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables.

This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems.

In this tutorial, you will discover how you can develop an LSTM model for multivariate time series forecasting in the Keras deep learning library.

After completing this tutorial, you will know:

- How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.

Let’s get started.

**Updated Aug/2017**: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.

## Tutorial Overview

This tutorial is divided into 3 parts; they are:

- Air Pollution Forecasting
- Basic Data Preparation
- Multivariate LSTM Forecast Model

### Python Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed.

If you need help with your environment, see this post:

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## 1. Air Pollution Forecasting

In this tutorial, we are going to use the Air Quality dataset.

This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China.

The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows:

**No**: row number**year**: year of data in this row**month**: month of data in this row**day**: day of data in this row**hour**: hour of data in this row**pm2.5**: PM2.5 concentration**DEWP**: Dew Point**TEMP**: Temperature**PRES**: Pressure**cbwd**: Combined wind direction**Iws**: Cumulated wind speed**Is**: Cumulated hours of snow**Ir**: Cumulated hours of rain

We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour.

This dataset can be used to frame other forecasting problems.

Do you have good ideas? Let me know in the comments below.

You can download the dataset from the UCI Machine Learning Repository.

Download the dataset and place it in your current working directory with the filename “*raw.csv*“.

## 2. Basic Data Preparation

The data is not ready to use. We must prepare it first.

Below are the first few rows of the raw dataset.

1 2 3 4 5 6 |
No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir 1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0 2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0 3,2010,1,1,2,NA,-21,-11,1019,NW,6.71,0,0 4,2010,1,1,3,NA,-21,-14,1019,NW,9.84,0,0 5,2010,1,1,4,NA,-20,-12,1018,NW,12.97,0,0 |

The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas.

A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now.

The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from pandas import read_csv from datetime import datetime # load data def parse(x): return datetime.strptime(x, '%Y %m %d %H') dataset = read_csv('raw.csv', parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse) dataset.drop('No', axis=1, inplace=True) # manually specify column names dataset.columns = ['pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain'] dataset.index.name = 'date' # mark all NA values with 0 dataset['pollution'].fillna(0, inplace=True) # drop the first 24 hours dataset = dataset[24:] # summarize first 5 rows print(dataset.head(5)) # save to file dataset.to_csv('pollution.csv') |

Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “*pollution.csv*“.

1 2 3 4 5 6 7 |
pollution dew temp press wnd_dir wnd_spd snow rain date 2010-01-02 00:00:00 129.0 -16 -4.0 1020.0 SE 1.79 0 0 2010-01-02 01:00:00 148.0 -15 -4.0 1020.0 SE 2.68 0 0 2010-01-02 02:00:00 159.0 -11 -5.0 1021.0 SE 3.57 0 0 2010-01-02 03:00:00 181.0 -7 -5.0 1022.0 SE 5.36 1 0 2010-01-02 04:00:00 138.0 -7 -5.0 1022.0 SE 6.25 2 0 |

Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have.

The code below loads the new “*pollution.csv*” file and plots each series as a separate subplot, except wind speed dir, which is categorical.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from pandas import read_csv from matplotlib import pyplot # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # specify columns to plot groups = [0, 1, 2, 3, 5, 6, 7] i = 1 # plot each column pyplot.figure() for group in groups: pyplot.subplot(len(groups), 1, i) pyplot.plot(values[:, group]) pyplot.title(dataset.columns[group], y=0.5, loc='right') i += 1 pyplot.show() |

Running the example creates a plot with 7 subplots showing the 5 years of data for each variable.

## 3. Multivariate LSTM Forecast Model

In this section, we will fit an LSTM to the problem.

### LSTM Data Preparation

The first step is to prepare the pollution dataset for the LSTM.

This involves framing the dataset as a supervised learning problem and normalizing the input variables.

We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step.

This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include:

- Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
- Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.

We can transform the dataset using the *series_to_supervised()* function developed in the blog post:

First, the “*pollution.csv*” dataset is loaded. The wind speed feature is label encoded (integer encoded). This could further be one-hot encoded in the future if you are interested in exploring it.

Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed.

The complete code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
# convert series to supervised learning def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() values[:,4] = encoder.fit_transform(values[:,4]) # ensure all data is float values = values.astype('float32') # normalize features scaler = MinMaxScaler(feature_range=(0, 1)) scaled = scaler.fit_transform(values) # frame as supervised learning reframed = series_to_supervised(scaled, 1, 1) # drop columns we don't want to predict reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True) print(reframed.head()) |

Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour).

1 2 3 4 5 6 7 8 9 10 11 12 13 |
var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \ 1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290 2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811 3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332 4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391 5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912 var7(t-1) var8(t-1) var1(t) 1 0.000000 0.0 0.148893 2 0.000000 0.0 0.159960 3 0.000000 0.0 0.182093 4 0.037037 0.0 0.138833 5 0.074074 0.0 0.109658 |

This data preparation is simple and there is more we could explore. Some ideas you could look at include:

- One-hot encoding wind speed.
- Making all series stationary with differencing and seasonal adjustment.
- Providing more than 1 hour of input time steps.

This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems.

### Define and Fit Model

In this section, we will fit an LSTM on the multivariate input data.

First, we must split the prepared dataset into train and test sets. To speed up the training of the model for this demonstration, we will only fit the model on the first year of data, then evaluate it on the remaining 4 years of data. If you have time, consider exploring the inverted version of this test harness.

The example below splits the dataset into train and test sets, then splits the train and test sets into input and output variables. Finally, the inputs (X) are reshaped into the 3D format expected by LSTMs, namely [samples, timesteps, features].

1 2 3 4 5 6 7 8 9 10 11 12 |
# split into train and test sets values = reframed.values n_train_hours = 365 * 24 train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs train_X, train_y = train[:, :-1], train[:, -1] test_X, test_y = test[:, :-1], test[:, -1] # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1])) test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1])) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) |

Running this example prints the shape of the train and test input and output sets with about 9K hours of data for training and about 35K hours for testing.

1 |
(8760, 1, 8) (8760,) (35039, 1, 8) (35039,) |

Now we can define and fit our LSTM model.

We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 8 features.

We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent.

The model will be fit for 50 training epochs with a batch size of 72. Remember that the internal state of the LSTM in Keras is reset at the end of each batch, so an internal state that is a function of a number of days may be helpful (try testing this).

Finally, we keep track of both the training and test loss during training by setting the *validation_data* argument in the fit() function. At the end of the run both the training and test loss are plotted.

1 2 3 4 5 6 7 8 9 10 11 12 |
# design network model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mae', optimizer='adam') # fit network history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False) # plot history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() |

### Evaluate Model

After the model is fit, we can forecast for the entire test dataset.

We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers.

With forecasts and actual values in their original scale, we can then calculate an error score for the model. In this case, we calculate the Root Mean Squared Error (RMSE) that gives error in the same units as the variable itself.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], test_X.shape[2])) # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, 1:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) |

### Complete Example

The complete example is listed below.

**NOTE**: This example assumes you have prepared the data correctly, e.g. converted the downloaded “*raw.csv*” to the prepared “*pollution.csv*“. See the first part of this tutorial.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
from math import sqrt from numpy import concatenate from matplotlib import pyplot from pandas import read_csv from pandas import DataFrame from pandas import concat from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import LabelEncoder from sklearn.metrics import mean_squared_error from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # convert series to supervised learning def series_to_supervised(data, n_in=1, n_out=1, dropnan=True): n_vars = 1 if type(data) is list else data.shape[1] df = DataFrame(data) cols, names = list(), list() # input sequence (t-n, ... t-1) for i in range(n_in, 0, -1): cols.append(df.shift(i)) names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)] # forecast sequence (t, t+1, ... t+n) for i in range(0, n_out): cols.append(df.shift(-i)) if i == 0: names += [('var%d(t)' % (j+1)) for j in range(n_vars)] else: names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)] # put it all together agg = concat(cols, axis=1) agg.columns = names # drop rows with NaN values if dropnan: agg.dropna(inplace=True) return agg # load dataset dataset = read_csv('pollution.csv', header=0, index_col=0) values = dataset.values # integer encode direction encoder = LabelEncoder() values[:,4] = encoder.fit_transform(values[:,4]) # ensure all data is float values = values.astype('float32') # normalize features scaler = MinMaxScaler(feature_range=(0, 1)) scaled = scaler.fit_transform(values) # frame as supervised learning reframed = series_to_supervised(scaled, 1, 1) # drop columns we don't want to predict reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True) print(reframed.head()) # split into train and test sets values = reframed.values n_train_hours = 365 * 24 train = values[:n_train_hours, :] test = values[n_train_hours:, :] # split into input and outputs train_X, train_y = train[:, :-1], train[:, -1] test_X, test_y = test[:, :-1], test[:, -1] # reshape input to be 3D [samples, timesteps, features] train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1])) test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1])) print(train_X.shape, train_y.shape, test_X.shape, test_y.shape) # design network model = Sequential() model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2]))) model.add(Dense(1)) model.compile(loss='mae', optimizer='adam') # fit network history = model.fit(train_X, train_y, epochs=50, batch_size=72, validation_data=(test_X, test_y), verbose=2, shuffle=False) # plot history pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() # make a prediction yhat = model.predict(test_X) test_X = test_X.reshape((test_X.shape[0], test_X.shape[2])) # invert scaling for forecast inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1) inv_yhat = scaler.inverse_transform(inv_yhat) inv_yhat = inv_yhat[:,0] # invert scaling for actual test_y = test_y.reshape((len(test_y), 1)) inv_y = concatenate((test_y, test_X[:, 1:]), axis=1) inv_y = scaler.inverse_transform(inv_y) inv_y = inv_y[:,0] # calculate RMSE rmse = sqrt(mean_squared_error(inv_y, inv_yhat)) print('Test RMSE: %.3f' % rmse) |

Running the example first creates a plot showing the train and test loss during training.

Interestingly, we can see that test loss drops below training loss. The model may be overfitting the training data. Measuring and plotting RMSE during training may shed more light on this.

The Train and test loss are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset is printed.

We can see that the model achieves a respectable RMSE of 26.496, which is lower than an RMSE of 30 found with a persistence model.

1 2 3 4 5 6 7 8 9 10 11 12 |
... Epoch 46/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 47/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 48/50 0s - loss: 0.0144 - val_loss: 0.0133 Epoch 49/50 0s - loss: 0.0143 - val_loss: 0.0133 Epoch 50/50 0s - loss: 0.0144 - val_loss: 0.0133 Test RMSE: 26.496 |

This model is not tuned. Can you do better?

Let me know your problem framing, model configuration, and RMSE in the comments below.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

- Beijing PM2.5 Data Set on the UCI Machine Learning Repository
- The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras
- Time Series Forecasting with the Long Short-Term Memory Network in Python
- Multi-step Time Series Forecasting with Long Short-Term Memory Networks in Python

## Summary

In this tutorial, you discovered how to fit an LSTM to a multivariate time series forecasting problem.

Specifically, you learned:

- How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

except wind *dir*, which is categorical.

Thanks, fixed!

Great post Jason. Thank you so much for making this material available for the community..

Thanks Francois, I’m glad it helped!

hi, jason. There were some problems under my environment which were keras2.0.4and tensorflow-GPU0.12.0rc0.

And Bug was that “TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.”

The sentence that “model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))” was located.

Could you please help me with that?

Regards,

yao

I would recommend this tutorial for setting up your environment:

http://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

Thx a lot, doctor, it works! fabulous! 🙂

I’m glad to hear that.

Dr.Jason, I update TensorFlow then it works!

Sorry to bother you.

Thank you very much !

Best wishes !

I’m glad to hear that!

I met the same problem .

Did you uninstall all the programs previously installed or just set up the environment again?

Thx a lot!

Hi Jason,I set up my environment as the your tutorial.

scipy: 0.19.0

numoy: 1.12.1

matplotlib: 2.0.2

pandas: 0.20.1

statsmodels: 0.8.0

sklearn: 0.18.1

theano: 0.9.0.dev-c697eeab84e5b8a74908da654b66ec9eca4f1291

tensorflow: 0.12.1

Using TensorFlow backend.

keras: 2.0.5

But the bug still existed.Is the version of tensorFlow too odd?How could I do?

Thanks!

It might be, I am running v1.2.1.

Perhaps try running Keras off Theano instead (e.g. change the backend in the ~/.keras.jason config)

It seems that inv_y = scaler.inverse_transform(test_X)[:,0] is not the actual, should inv_yhat be compared with test_y but not pollution(t-1)? Because I think this inv_y here means pollution(t-1). Is this prediction equals to only making a time shifting from the current known pollution value (which means the models just take pollution(t) as the prediction of pollution(t+1))?

Sorry, I’m not sure I follow. Can you please restate your question, perhaps with an example?

Sorry for the confusing expression. In fact, the series_to_supervised() function would create a DataFrame whose columns are: [ var1(t-1), var2(t-1), …, var1(t) ] where ‘var1’ represents ‘pollution’, therefore, the first dimension in test_X (that is, test_X[:,0]) would be ‘pollution(t-1)’. However, in the code you calculate the rmse between inv_yhat and test_X[:,0], even though the rmse is low, it could only shows that the model’s prediction for t+1 is close to what it has known at t.

I am asking this question because I’ve ran through the codes and saw the models prediction pollution(t+1) looks just like pollution(t). I’ve also tried to use t-1, t-2 and so on for training, but still changed nothing.

Do you think the model tends to learn to just take the pollution value at current moment as the prediction for the next moment?

thanks 🙂

If we predict t for t+1 that is called persistence, and we show in the tutorial that the LSTM does a lot better than persistence.

Perhaps I don’t understand your question? Can you give me an example of what you are asking?

Hmm, it’s difficult to explain without a graph.

In a word, and also it’s an example, I want to ask two questions:

1. In the “make a prediction” part of your codes, why it computes rmse between predicted t+1 and real t, but not between predicted t+1 and real t+1?

2. After the “make a prediction” part of your codes run, it turns out that rmse between predicted t+1 and real t is small, is it an evidence that LSTM is making persistence?

RMSE is calculated for y and yhat for the same time periods (well, that was the intent), why do you think they are not?

Is there a bug?

I think Songbin Xu is right. By executing the statement at line 90: inv_y = inv_y[:,0], you compare the inv_yhat with inv_y. inv_y is the polution(t-1) and inv_yhat is the predicted polution(t).

On line 50 the second parameter the function series_to_supervised can be changed to 3 or 5, so more days of history are used. If you do so, an error occurs in the scaler.inverse_transform (line 89).

No worries, great tutorial and I learned a lot so far!

I see now, you guys are 100% correct. Thank you!

I have updated the calculation of RMSE and the final score reported in the post.

Note, I ran a ton of experiments on AWS with many different lag values > 1 and none achieved better results than a simple lag=1 model (e.g. an LSTM model with no BPTT). I see this as a bad sign for the use of LSTMs for autoregression problems.

Hi Jason, great post!

Is it necessary remove seasonality (by seasonal differentiation) when we are using LSTM?

No, but results are often better.

Good article, thank.

Two questions:

What changes will be required if your data is sporadic? Meaning sometimes it could be 5 hours without the report.

And how do you add more timesteps into your model? Obviously you have to reshape it properly but you also have to calculate it properly.

You could fill in the missing data by imputing or ignore the gaps using masking.

What do you mean by “add more timesteps”?

But what should I do if all data is stochastic time sequence?

For example predicting time till the next event – when events frequency is stochastically distributed on the timeline.

Good question, this sounds like survival analysis to me, perhaps see if it applies:

https://en.wikipedia.org/wiki/Survival_analysis

Dr.Jason,

Thank you for an awesome post.

(I was practicing on load forecast using MLP and SVR (You also suggested on a comment in your other LSTM tutorials). I also tried with LSTM and it did almost perform like SVR. However, in LSTM, I did not consider time lags because I have predicted future predictor variables that I was feeding as test set. I will try this method with time lags to cross validate the models)

Nice Jack, let me know how you go.

Hi Jason,

Can I use ‘look back'(Using t-2 , t-1 steps data to predict t step air pollution) in this case?

If it’s available,that my input data shape will be [samples , look back , features] isn’t it?

You can Adam, see the series_to_supervised() function and its usage in the tutorial.

Hi Jason,

If I used n_in=5 in series_to_supervised() function,in your tutorial the input shape will be [samples, 1 , features*5].Can I reshape it to [samples, 5 , features]?If I can, what is the difference between these two shape?

The second dimension is time steps (e.g. BPTT) and the third dimension are the features (e.g. observations at each time step). You can use features as time steps, but it would not really make sense and I expect performance to be poor.

Here’s how to build a model multiple time steps for multiple features:

And that’s it. I just tested and it looks good. The RMSE calculation will blow up, but you guys can fix that up I figure.

Jason, great post, very clear, and very useful!! I’m about 90% with you and think a few folks may be stuck on this final point if they try to implement multi-feature, multi-hour-lookback LSTM.

Seems like by making adjustments above, I’m able to make a prediction, but the scaling inversion doesn’t want to cooperate. The reshape step now that we have multiple features and multiple timesteps has a mismatch in the shape, and even if I make the shape work, the concatenation and inversion still don’t work. Could you share what else you changed in this section to make it work? I’m not so concerned about the RMSE as much as that I can extract useful predictions. Thank you for any insight since you’ve been able to do it successfully.

# make a prediction

yhat = model.predict(test_X)

test_X = test_X.reshape((test_X.shape[0], test_X.shape[2]))

# invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:,0]

…

Hi Jason,

Great and useful article.

I am somewhat puzzled by the number of features you specify to forecast the pollution rate based on data from the previous 24 hours.

Do not we have 8 features for each time-step and not 7?

After generating data to supervise with the function series_to_supervised(scaled,24, 1), the resulting array has a shape of (43800, 200) which is 25 * 8.

To invert the scaling for forecast I made few modifications. I used scaled.shape[1] below but in my opinion it could be n_features. Moreover, I don’t know if the values concatenated to yhat and test_y really matter, as long as they have been scaled with fit_transform and the array has the right shape.

yhat = model.predict(test_X)

test_X = test_X.reshape((test_X.shape[0], n_obs))

# invert scaling for forecast

inv_yhat = concatenate((yhat, test_X[:, 1:scaled.shape[1]]), axis=1)

inv_yhat = scaler.inverse_transform(inv_yhat)

inv_yhat = inv_yhat[:,0]

# invert scaling for actual

test_y = test_y.reshape((len(test_y), 1))

inv_y = concatenate((test_y, test_X[:, 1:scaled.shape[1]]), axis=1)

inv_y = scaler.inverse_transform(inv_y)

inv_y = inv_y[:,0]

The model has 4 layers with dropout.

After 200 epochs I have got

loss: 0.0169 – val_loss: 0.0162

And a rmse = 29.173

Regards.

We have 7 features because we drop one in section “2. Basic Data Preparation”.

Hi Jason,

It’s really weird to me :(, as I used your code to prepare the data (pollution.csv) and I have 9 fields in the resulting file.

[date, pollution, dew, temp, press, wnd_dir, wnd_spd, snow, rain]

😯

Date and wind direction are dropped during data preparation, perhaps you accidentally skipped a step or are reviewing a different file from the output file?

Hi Jason,

So that’s fine, in my case I have 8 features.

When reading the file, the field ‘date’ becomes the index of the dataframe and the field ‘wnd_dir’ is later label encoded, as you do above in “The complete example” lines 42-43.

It is now much clearer for me. I am not puzzled anymore. 😉

Thanks a lot for all the information contained in your articles and your e-books.

They are really very informative.

🙂

I’m glad to hear that!

Hi Jason,

I think the output is column var1(t), that means:

train_X, train_y = train[:, 0:n_obs], train[:, -(n_features+1)]

am I right?

In case the “pollution” is in the last column, it is easy to get train[:, -1]

am i right?

I just want to verify that I understand your post.

Thank you, Jason

Hi Jason, I get the following error from line # 82 of your ‘Complete Example’ code.

ValueError: Error when checking : expected lstm_1_input to have 3 dimensions, but got array with shape (34895, 8)

I think LSTM() is looking for (sequences, timesteps, dimensions). In your code, line # 70, I believe 50 is timesteps while input_shape (1,8) represents the dimensions. May be it’s missing ‘sequences’ ?

Appreciate your response.

Ensure that you first prepare the data (e.g. convert “raw.csv” to “pollution.csv”).

Hi Jason, I am wondering what the issue that I’m getting is caused by, maybe a different type of dataset then the example one. basically when I run the history into the model, When i check the History.history.keys() I only get back ‘loss’ as my only key.

You must specify the metrics to collect when you compile the model.

For example, in classification:

Hello Jason,

Thank you for such a nice tutorial.

Since you have published a similar topic and few other related topics in one of your paid books (LSTM networks), should the reader also expect some different topics covered in it?

I’m an ardent fan of your blogs since it covers most of the learning material and therefore, it makes me wonder that will be different in your book?

Thanks Arman.

The book does not cover time series, instead it focuses on teaching you how to implement a suite of different LSTM architectures, as well as prepare data for your problems.

Some ideas were tested on the blog first, most are only in the book.

You can see the full table of contents here:

http://machinelearningmastery.com/lstms-with-python/

The book provides all the content in one place, code as well, more access to me, updates as I fix bugs and adapt to new APIs, and it is a great way to support my site so I can keep doing this.

Thank you for accepting my opinions, such a pleasure!

Running the codes u modified, still something puzzles me here,

1. Have u drawn the waveforms of inv_y and inv_yhat in the same plot? I think they looks quite like persistence.

2. Curiously, I computed the rmse between pollution(t) and pollution(t-1) in test_X, it’s 4.629, much lower than your final score 26.496, does it mean LSTM performs even worse than persistence?

3. I’ve tried to remove var1 at t-1, t-2, … , and I’ve also tried to use lag values>1, and also assign different weights to the inputs at different timesteps, but none of them improved, they performed even worse.

Do you have any other ideas to avoid the whole model to learn persistence?

Looking forward to your advices 🙂

Thank you for pointing out the fault!

The final line plot shows loss on the transformed train and test sets.

Yes, LSTMs are no good at autoregression, yet I keep getting asked to develop examples (tens of emails per day)… See here:

http://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Consider developing a baseline with an MLP, you’ll find it tough to beat it with an LSTM!

Why are you only training with a single timestep (or sequence length)? Shouldn’t you use more timesteps for better training/prediction? For instance in https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py they use 40 (maxlen) timesteps

Yes, it is just an example to help you get started. I do recommend using multiple time steps in order to get the full BPTT.

Hi Jason and Varuna,

When the timesteps = 1 as you mentioned, does it mean the value of t-1 time was used to predict the value of t time? Is moving window a method to use multiple time steps? Is there any other way? Has Keras any functions of moving window?

Thank you very much.

Keras treats the “time steps” of a sequence as the window, kind of. It is the closest match I can think of.

Hi Jason,

I met some problem when learning your codes.

dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)

Traceback (most recent call last):

File “”, line 1, in

dataset = read_csv(‘D:\Geany\scriptslym\raw.csv’, parse_dates = [[‘year’, ‘month’, ‘day’, ‘hour’]],index_col=0, data_parser=parse)

NameError: name ‘parse’ is not defined

>>>

It looks like you have specified a function “parse” but not defined it.

Hi Jason,

Can I use “keras.layers.normalization.BatchNormalization” as a substitute for “sklearn.preprocessing.MinMaxScaler”?

No, they do very different things.

Hi Jason, Its a very Informative article. Thanks. I have a question regarding forecasting in time series. You have used the training data with all the columns while learning after variable transformations and the same has been done for the test data too. The test data along with all the variables were used during prediction. For instance, If I want to predict the pollution for a future date, Should I know the other inputs like dew, pressure, wind dir etc on a future date which I’m not aware off? Another question is, Suppose we have same data about multiple regions(let us consider that the pollution among these regions is not negligible), How can we model so that the input argument while prediction is the region name along with time to forecast just for that one region.

It depends on how you define your model.

The model defined above uses the variables from the prior time step as inputs to predict the next pollution value.

In your case, maybe you want to build a separate model per region, perhaps a model that improves performance by combining models across regions. You must experiment to see what works best for your data.

Thanks! I missed the trick of converting the time-series to supervised learning problem. That alone is sufficient even for multiple regions I guess. We just have to submit the input parameters of the previous time stamp for the specific region during prediction. We may also try one-hot encoding on the region variable too during data preprocessing.

Thank you for your excellent blog, Jason. I’ve really learnt a lot from your nice work recently. After this post, I’ve already known how to transform data into data that formates LSTM and how to construct a LSTM model.

Like the question aksed by Naveen Koneti, I have the same puzzle.

Recently I’ve worked on some clinical data. The data is not like the one we used in this demo. It is consist of hunderds of patients, each patient has several vital sign records. If it is about one individual’s records through many years, I can process the data as what you told us. I wonder how I can conquer this kind of data. Could you give me some advice, or tell me where I can find any solutions about it?

If I didn’t state my question clearly and you’re interested it, pls let me know.

Thanks in advance.

PS. the data set in my situation is like this

[ID date feature1 feature2 feautre3 ]

[patient1 date1 value11 value12 value13 ]

[patient1 date2 value21 value22 value23 ]

[patient2 date1 value31 value32 value33 ]

[patient2 date2……………………………………..]

[patient3 ……………………………………………..]

You could model one patient at a time, or groups or all of them. Try different approaches and see what works best.

I cannot tell you what would work best – I have no idea – you must discover it.

See this post:

http://machinelearningmastery.com/a-data-driven-approach-to-machine-learning/

Hi,

again a nice post for the use of lstm’s!

I had the following idea when reading.

I would like to build a network, in which each feature has its own LSTM neuron/layer, so that the input is not fully connected.

My idea is adding a lstm layer for each feature and merge it with the merge layer and feed these results to the output neurons.

Is there a better way to do this? Or would you recommend to avoid this because the features are poorly abstracted? On the other hand, this might also be interesting.

Thank you!

Try it and see if it can out-perform a model that learns all features together.

Also, contrast to an MLP with a window – that often does better than LSTMs on autoregression problems.

Hi Jason,

I have two questions:

1) I have a question/ notice regarding the scaling of the Y variable (pollution). The way you implement the rescaling between [0-1] you consider the entire length of the array (all of the 43799 observations -after the dropna-).

Is it rightto rescale it that way? By doing so we are incorporating information of the furture (test set) to the past (train set) because the scaler is “exposed” to both of them and therefore we introduce bias.

If you agree with my point what could be a fix?

2) Also the activation function of the output (Y variable) is sigmoid, that’s why we rescale it within the [0,1] range. Am I correct?

Thanks for sharing the article!

No, ideally you would develop a scaling procedure on the training data and use it on test and when making predictions on new data.

I tried to keep the tutorial simple by scaling all data together.

The activation on the output layer is ‘linear’, the default. This must be the case because we are predicting a real-value.

Thank you very much for your tutorial.

I have one question,

but I failed to read the NW in pollution. csv.(cbwd column)

values = values.astype(‘float32’)

ValueError: could not convert string to float: NW

How do you fix it?

sorry, I saw the text above and solved it.

Glad to hear it!

Hi Jason!

I assume there is little mistake when you calculate RMSE on test data.

You must write this code before calculate RMSE:

inv_y = inv_y[:-1]

inv_yhat = inv_yhat[1:]

Thus, RMSE equals 10.6 (on the same data, in my case), that is much less than 26.5 in your case.

Sorry, I don’t understand your comment and snippet of code, can you spell out the bug you see?

Hi Jason,

great post! I was waiting for meteo problems to infiltrate the machinelearningmastery world.

Could you write something about the changed scenareo where, given the weather conditions and pollution for some time, we can predict the pollution for another time or place with given weather conditions?

For example: We have the weather conditions and pollution given for Beijing in 2016, and we have the weather conditions given for Chengde (city close to Bejing) also in 2016. Now we want to know how was the pollution in Chengde in 2016.

Would be great to learn about that!

Great suggestion, I like it. An approach would be to train the model to generalize across geographical domains based only on weather conditions.

I have tried not to use too many weather examples – I came from 6 years of work in severe weather, it’s too close to home 🙂

Hi Jason,

I have read many of your posts about LSTM. I have not completely clear the difference between the parameters batch_size and time_steps. Batch_size means when the memory is reset (right?), but this shouldn’t have the same value of time_steps that, if I have understood correctly, means how often the system makes a prediction?

Great question!

Batch size is the number of samples (e.g. sequences) to that are used to estimate the gradient before the weights are updated. The internal state is reset at the end of each batch after the weights are updated.

One sample is comprised of 1 or more time steps that are stepped over during backpropagation through time. Each time step may have one or more features (e.g. observations recorded at that time).

Time steps and batch size and generally not related.

You can split up a sequence to have one-time step per sequence. In that case you will not get the benefit of learning across time (e.g. bptt), but you can reset state at the end of the time steps for one sequence. This an odd config though and really only good to showing off the LSTMs memory capability.

Does that help?

Thanks, now it’s more clear!

Hi,I ger this error at this step, could you help me please?

model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

—————————————————————————

TypeError Traceback (most recent call last)

in ()

—-> 1 model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))

C:\Anaconda3\lib\site-packages\keras\models.py in add(self, layer)

431 # and create the node connecting the current layer

432 # to the input layer we just created.

–> 433 layer(x)

434

435 if len(layer.inbound_nodes) != 1:

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in __call__(self, inputs, initial_state, **kwargs)

241 # modify the input spec to include the state.

242 if initial_state is None:

–> 243 return super(Recurrent, self).__call__(inputs, **kwargs)

244

245 if not isinstance(initial_state, (list, tuple)):

C:\Anaconda3\lib\site-packages\keras\engine\topology.py in __call__(self, inputs, **kwargs)

556 ‘

`layer.build(batch_input_shape)`

‘)557 if len(input_shapes) == 1:

–> 558 self.build(input_shapes[0])

559 else:

560 self.build(input_shapes)

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in build(self, input_shape)

1010 initializer=bias_initializer,

1011 regularizer=self.bias_regularizer,

-> 1012 constraint=self.bias_constraint)

1013 else:

1014 self.bias = None

C:\Anaconda3\lib\site-packages\keras\legacy\interfaces.py in wrapper(*args, **kwargs)

86 warnings.warn(‘Update your

`' + object_name +`

call to the Keras 2 API: ‘ + signature, stacklevel=2)87 '

—> 88 return func(*args, **kwargs)

89 wrapper._legacy_support_signature = inspect.getargspec(func)

90 return wrapper

C:\Anaconda3\lib\site-packages\keras\engine\topology.py in add_weight(self, name, shape, dtype, initializer, regularizer, trainable, constraint)

389 if dtype is None:

390 dtype = K.floatx()

–> 391 weight = K.variable(initializer(shape), dtype=dtype, name=name)

392 if regularizer is not None:

393 self.add_loss(regularizer(weight))

C:\Anaconda3\lib\site-packages\keras\layers\recurrent.py in bias_initializer(shape, *args, **kwargs)

1002 self.bias_initializer((self.units,), *args, **kwargs),

1003 initializers.Ones()((self.units,), *args, **kwargs),

-> 1004 self.bias_initializer((self.units * 2,), *args, **kwargs),

1005 ])

1006 else:

C:\Anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py in concatenate(tensors, axis)

1679 return tf.sparse_concat(axis, tensors)

1680 else:

-> 1681 return tf.concat([to_dense(x) for x in tensors], axis)

1682

1683

C:\Anaconda3\lib\site-packages\tensorflow\python\ops\array_ops.py in concat(concat_dim, values, name)

998 ops.convert_to_tensor(concat_dim,

999 name=”concat_dim”,

-> 1000 dtype=dtypes.int32).get_shape(

1001 ).assert_is_compatible_with(tensor_shape.scalar())

1002 return identity(values[0], name=scope)

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py in convert_to_tensor(value, dtype, name, as_ref, preferred_dtype)

667

668 if ret is None:

–> 669 ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)

670

671 if ret is NotImplemented:

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in _constant_tensor_conversion_function(v, dtype, name, as_ref)

174 as_ref=False):

175 _ = as_ref

–> 176 return constant(v, dtype=dtype, name=name)

177

178

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\constant_op.py in constant(value, dtype, shape, name, verify_shape)

163 tensor_value = attr_value_pb2.AttrValue()

164 tensor_value.tensor.CopyFrom(

–> 165 tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))

166 dtype_value = attr_value_pb2.AttrValue(type=tensor_value.tensor.dtype)

167 const_tensor = g.create_op(

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape)

365 nparray = np.empty(shape, dtype=np_dt)

366 else:

–> 367 _AssertCompatible(values, dtype)

368 nparray = np.array(values, dtype=np_dt)

369 # check to them.

C:\Anaconda3\lib\site-packages\tensorflow\python\framework\tensor_util.py in _AssertCompatible(values, dtype)

300 else:

301 raise TypeError(“Expected %s, got %s of type ‘%s’ instead.” %

–> 302 (dtype.name, repr(mismatch), type(mismatch).__name__))

303

304

TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.

Perhaps check that your environment is setup correctly:

http://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

Also, ensure that you have copied all of the code.

Hi Jason,

I was curious if you can point me in the right direction for converting data back to the actual values instead of scaled.

Yes, you can invert the scaling.

This tutorial demonstrates how to do that Neal.

Hi Jason, I did have an issue converting back to actual values, but was able to get past it using the drop columns on the reframed data which got me past it.

When looking at my predicted values vs actual values, I’m noticing that my first column has a prediction and a true value, but for every other variable, I only see what I can assume is a prediction? does this make a prediction on every column, or just one particular one.

Im sorry for asking a question such as this, I just think I’m confusing myself looking at my results.

The code in the tutorial only predicts pollution.

Dr. Jason,

I have been trying with my own dataset and I am getting an error “ValueError: operands could not be broadcast together with shapes (168,39) (41,) (168,39)” when I try to do

`inv_yhat = scaler.inverse_transform(inv_yhat)`

as you have in line 86 in your script. I still can not figure out where my issue is. I have`yhat.shape`

as (168,1) and test_X.shape`as (168,38). When I do this,`

inv_yhat = np.concatenate((yhat, test_X[:, 1:]), axis=1)`, my`

inv_yhat.shape`is (168,39)`

. I still can not figure why`inverse_transform`

gives that error.The shape of the data must be the same when inverting the scale as when it was originally scaled.

This means, if you scaled with the entire test dataset (all columns), then you need to tack the yhat onto the test dataset for the inverse. We jump through these exact hoops at the end of the example when calculating RMSE.

This seems to be the same issue I am having at the moment also. i concatenate my inv_yhat with my test_X like you said, but the shape of inv_yhat after is still not taking into account the 2nd numbers(in posts case (41,).

Ask a question in stackoverflow and post the link, I should be able to help. I spent lots of time on this and have a decent idea now.

Yes, you’re right! I did that and it worked, nice! Thank you for your comment!

Glad to hear that Jack.

I am having the same problem, but cannot solve the issue. everytime i try to concatenante them together, there is not change to my inv_yhat variable. i still am unable to understand this issue if you can expand a bit more that would be amazing

@John Regilina,

Check the shape of data after you scale the data and then check the scale again after you do the concatenation. Remember, when your

`yhat`

shape will be (rowlength,1) and after concatenation`inv_yhat`

should be the same shape after you scaled the data. Look at Dr.Jason’s answer to my comment/question. Hope that will help. (Thanks to Dr.Jason saved a lot of my time)I am also stuck with same thing. How did you fix it?

Hi Jason, In dataset.drop(‘No’, axis =1, inplace = True), what is the purpose of ‘axis’ and ‘inplace’?

Great question.

We specify to remove the column with axis=1 and to do it on the array in memory with inplace rather than return a copy of the array with the column removed.

Fabulous tutorials Jason!

Thanks Lizzie.

Can you show how the multi variate forecast looks like?

Looks like you missed it in the article.

Sure,

You can plot all predictions as follows:

You get:

It’s a mess, you can plot the last 100 time steps as follows:

You get:

The predictions look like persistence.

Jason, what am I missing, looking at your plot of the most recent 100 time steps, it looks like the predicted value is always 1 time period after the actual? If on step 90 the actual is 17, but the predicted value shows 17 for step 91, we are one time period off, that is if we shifted the predicted values back a day, it would overlap with the actual which doesn’t really buy us much since the next hour prediction seems to really align with the prior actual. Am I missing something looking at this chart?

This is what a persistence forecast looks like, that value(t) = value(t-1).

So how would you get the true predicted value(t)? I am thinking of the last record in the time series where we are trying to predict the value for the next hour.

Sorry, I don’t follow. Perhaps you can restate your question?

Wind dir is label encoded not wind speed!!!

Yes.

First of all, thanks. All of this material on the blog is super interesting, and helpful and making me learn a lot.

Of course… I have a question.

I’m surprised by the use of LSTMs here. The property of them being “stateful” I guess is being used. But is there “sequence” information flowing?

So when I used LSTMs in Keras for text classification tasks (sentence, outcome), each “sentence” is a sequence. Each observation is a sequence. It’s an ordered array of the words in the sentence (and it’s outcome).

In this example, I could not see a sense in which var1(t-1) is linked to var1(t-2). Aren’t they being treated as independent Xs in a regression problem? (predicting var8(t))

Correct, we are not providing a sequence of observations and therefore not getting good BPTT.

Based on my tests, I have found LSTMs to be poor at autoregression, and in this case, as I added more history to the model (longer sequences), performance degraded.

I would strongly encourage you to use an MLP baseline that any MLP would have to out-perform.

See this post for more on the limitations of LSTM for time series:

https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

Awesome article, as always.

Btw, what is your view on using an autoencoder/ restricted Boltzmann layer compressing features/ features before feeding an LSTM network ? For example, if one has a financial timeseries to forecast, e.g. a classifier trying to predict increase or decrease in a look ahead time window, via numerous technical indicators and/or other candidate exogenous leading indicators…..

Could you write an article based on that idea?

I have seen better results from large MLPs, nevertheless, try it and see how you go.

autoencoder/ restricted Boltzmann layers also deal with multicollinearity issues… do MLPs also deal with multicollinearity if you have multicollinearity in the features, right?

MLPs are more robust to multicollinearity than linear models.

Hi, I am always amazed at your article. Thank you.

I have a question.

Is this LSTM code now weighted for each features?

Nowdays, I’m predicting precipitation, that is the trend is correct, but the amount is not right.

What’s wrong with that?:(

Thanks!

Sorry, I’m not sure I understand the question, perhaps you could rephrase it?

I can say that I would expect better skill if the data was further prepared – e.g. made stationary.

Hi Jason,

Thanks for wonderful explanation!

Could you please help me to understand dimensionality reduction concept. Should PCA or statistical approach be used before feeding the data to LSTM OR LSTM will learn correlation with the inputs provided on its own? how to approach regression problem in LSTM when we have large set of features?

Your reply is greatly appreciated!

Generally, if you make the problem simpler using data preparation, the LSTM or any model will perform better.

How can I predict a single input ?

for example :

[0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001]

how do i reshape and do a model.predict () ?

Thank you

Perhaps this post will make it clearer:

https://machinelearningmastery.com/make-predictions-long-short-term-memory-models-keras/

Thank you, Jason.

I applied:

my_x = np.array([0.036, 0.338, 0.197, 0.836, 0.333, 0.128, 0.00000001, 0.0000001])

print(my_x.shape) # (8,)

my_x = my_x.reshape((1, 1, 8))

my_pred = model.predict(my_x)

print(my_pred)

The answer is the “scaled” answer which is 0.03436

I tried applying the scaler.inverse_transform(my_pred) to GET the actual number

But I get the following error:

on-broadcastable output operand with shape (1,1) doesn’t match the broadcast shape (1,8)

Thank you

Yes, the transform requires data in the same form as when you “fit” it.

Then what if I use multi-time step prediction? (use several lags for prediction)

The y_hat and X_test can not have the same dimension.

If the size of X or y must vary, you can use padding.

Hi Jason,

Thanks for the tutorial!

Maybe I missed something, but it seems that you provided the model with all of remaining data as ‘testdata’ and then tried predicting it? Isn’t that kind of pointless, since we should be interested in predicting unknown data in the future, instead of data that the model has already seen? Wouldn’t it make more sense to try the model to predict a first timestep into the future that neither the training nor the test data knew anything about? (Perhaps only give the model training data, but no test data, and afterwards ask it to predict first time step after training data?) How would I have to change the code to achieve that?

The model is fit on the training data, then makes a prediction for each step in the test data. The model did not “know” the answer to the test data prior to making each prediction.

Normally we would use walk-forward validation:

https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

I did use walk forward validation on other LSTM examples (use the blog search) but it confuses readers more than helps it seems.

Can I use part of trainX to predict testY ? (lags needed to predict testY is in trainX) Not sure if it is a logical way to do it.

Yes.

Dear Jason Brownlee,

I have a little different question, Actually I have a sequence of characters as input and I want to project it into a multidimensional space.

I mean I want to project each sequence of chars (let say word) to an vector of 100 real numbers along my corpus, so my input is a sequence of chars (any char-emedding is welcome) and my output is a vector for each sequence (which is a word ) and Im really confused how to define the model,

I would appreciate if you give any clue help or sample code to define my model.

Thanks a lot in advance.

Keras provides an Embedding layer that you can use directly:

https://keras.io/layers/embeddings/

Hi Jason,

Thanks for the wonderful tutorial!

Could you please explain how to deal the problem when situation is “Predict the pollution for the complete month (assume month has 30 days. t+1…t+30) and given the “expected” weather features for that month…assuming we have been provided historic data of pollution and weather data on daily basis”

How should the data be prepared and how it should be feed into LSTM?

As I new to LSTM model, I have problem understanding the data preparation and feeding to LSTM.

Thanks in advance for your response

Predicting for a month is called multi-step forecasting.

Here is a post on the general approach:

https://machinelearningmastery.com/multi-step-time-series-forecasting/

Here is an example of doing multi-step forecasting with an LSTM:

https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

Hi Jason,

Thanks for sharing. I added accuracy info to model while training using ‘ metrics=[‘accuracy’] ‘.

So model.compile(loss=’mae’, optimizer=’adam’) becomes :

model.compile(loss=’mae’, optimizer=’adam’, metrics=[‘accuracy’])

This adds acc & val_acc to output. After 100 epochs the acc value appears quite low : (0.0761) :

Epoch 100/100

1s – loss: 0.0143 – acc: 0.0761 – val_loss: 0.0132 – val_acc: 0.0393

The accuracy of the model appears very low ? Is this expected ?

Further info on acc & val_acc values : https://github.com/tflearn/tflearn/issues/357 “acc is the accuracy of a batch of training data and val_acc is the accuracy of a batch of testing data.”

This is a regression problem. Accuracy does not make sense.

Hi Jason, I’ve recently discovered your site and have been so pleased with your information – thank you. I’ve been trying to model data which is much like the air quality data described here, but every few time steps there will be a change in the number of features present.

Example: in my data a time step = 1 day and a sequence can be 800 – 1200 days long. Normally the data consists of features

– pm2.5: PM2.5 concentration

– DEWP: Dew Point

– TEMP: Temperature

– PRES: Pressure

– cbwd: Combined wind direction

– Iws: Cumulated wind speed

– Is: Cumulated hours of snow

– Ir: Cumulated hours of rain

But then every (random-ish amount of time) there will be an additional number of features for a day and then back to the baseline number of features.

I’ve no idea on how to handle variable feature length. I’ve seen and played with plenty of variable sequence length examples, but I have both variable sequenceS and features. I’d love your input!

Thanks!

-Eric

You will need to normalize the number of features to be consistent for all time.

Is it possible to use (what in TensorFlow – land is called) SparseFeatures or SparseTensors to represent sparse datasets, or is there a fundamental issue with handling sparse datasets within RNNs?

Good question, I’m not sure off the cuff. Keras may support sparse numpy arrays – try it and see?

Hi Jason,

Thanks for the amazing articles. They are really helpful.

Lets say I want to forecast with lead 2. I mean by that forecasting values at time t using t-2 values, without using t-1 elements. I have to remove columns from reframed after running function series_to_supervised right ? To remove all columns with values t-1?

reframed.drop(reframed.columns[…])

Thanks

Yep, looks good.

Hello!

Thanks for articles.

I have a question related with time series. Is it possible to forecast all variables? For example, I have ‘pollution’, ‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’ and want to predict all of them for the next hour. We know about trends and common rules (because of data amount: few years), so we can do forecasting. Where can I find more info about it?

Yes, this example can be modified to predict each variable.

Thank you Jason for the great tutorial! I’m adapting it for different data, and i’m trying to use >1 time step. However I noticed something strange in the series-to-supervised: Since the first loops ends at 0 and the last loops starts at 0, won’t there be two columns that are the same?

No, try it with the data and see.

Hi Jason,

Thanks for the tutorial. I had just one question though.

I’ve seen tutorial using multivariate time series to train a lot of dataset (all have correlation between each other) at the same time and were able to predict for each dataset used.

For sake of argument let’s say than one of the dataset is broke, the sensor that get the information to feed it is out of service (let’s say at some point one of the column of data only have 0 instead of whatever value). Do you think that we could use the other spot to continue to predict the broken one? (there is correlation between them and there would be a lot of non broken data from before the bug)

Best regards,

Yes, you could try it and see. Or impute the missing data and see if that is better.

Thank you Jason,

I shall try that as soon as possible.I guess that the overall accuracy will lower for every set prediction (since my goal is to use multivariate, feed it every spot data set and predict each of them (with possibility to predict a broken one)) so one spot being fed “wrong” data should lower each spot accuracy no?

Best regards,

It will.

Is there any time parser like date parser? I am working with data which is in milliseconds.

It can handle parsing dates and times I believe.

i got this error when i tried to run the program

pyplot.plot(history.history[‘val_loss’], label=’test’)

KeyError: ‘val_loss’

Ensure you copy all of the code.

Hi Jason,

Wouldn’t it be better to scale the data after you run the series_to_supervised function? As it stands now, the inverse scaling doesn’t work if n_in > 1 since the dimensions don’t line up anymore.

It would, but the scaling would be column-wise and incorrect.

Could you expand more on this and how the code might be modified to incorporate multi-step? I’m also playing around with turning this into a classification problem, would it still work if the feature we are trying to predict is a classifier?

I give the code to do this in another comment.

For classification, you will need to change the number of neurons in the output layer, the activation function in the output layer and the loss function.

I have a little question. I’ve successfully built my own LSTM multivariate NN using your code as a basis (thanks!). It forecasts export growth for the UK using past export growth and GDP. It perform decently but the financial crisis kinda messes things up.

Now I want to add data to this model, but I can’t go further back than 1980 for the time-series (not for now at least). So what I want to do is add the GDP growth rate of all the UK’s major trading partners. Should I be worried about adding another 20 input neurons (e.g. countries)? Do you have a post talking about the risks of using data that is low in rows (e.g. years) but high in columns (e.g. inputs).

I hope my question makes sense.

Cheers

I don’t have posts on the topic of more columns than rows. It does require careful handling.

As a start, I would recommend developing a strong test harness, then try adding data and see how it impacts the model skill. Experiment.

Jason

Thanks a lot for your tutorial!

Is there a feature importance plot for cases like this?

sometimes is very important to know it

Good question. I’m not sure about feature importance plots for LSTMs. I would expect that if feature importance can be calculated for MLPs, then it could be calculated for LSTMs, but this is not something I have looked into sorry.

Thanks a lot, Jason!

No problem.

Hi Jason,

Great post as always!

I have a question regarding scaling. My problem is quite different as I have to apply series to supervised function first on the data coming from different source and then combine the data… my question is, can I apply scaling at the end? Should scaling be applied column wise or on complete matrix/array?

The key is being able to scale the data consistently. The place in the pipeline is less important.

Hi Jason thank you very much for your tutorials!

I’m trying to develop an LSTM for time prediction having as input 3 features (2 measurements and a third one is a sort of control of the system) and the output (value to predict) is not a single value but a vector of 6 values. So, at every time step my network should be able to predict this entire vector. Two questions:

1. Since my inputs are not correlated between them, their order in the input array will not influence my predictions?

2. How can I shape my output in order to estimate all the 6 values of the vector for each time step?

Thanks for any kind of help!

This post will help you understand how to prepare data for multi-step forecasting:

https://machinelearningmastery.com/multi-step-time-series-forecasting-long-short-term-memory-networks-python/

I replicated the example described on this page, and saved my test_y and yhat vectors to csv so that I could manually check how my prediction compared with the true values. However, when I did this, I discovered that every yhat value in my array is the exact same value (~34). I was expecting a unique yhat value for each input vector. Do you have any suggestions to help fix this?

Follow up on this — when this error arose, I was using my own data set that I want to perform time series forecasting on. When I duplicated the guide exactly as described above, the issue goes away. Do you have any idea why this issue comes up (where every predicted yhat value is the exact same) when I use a different data set?

Perhaps the model needs to be tuned to your specific dataset?

Hi Jason thank you very much for your tutorials! I try to delete the columns [‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’] from the train_X data, and I also get the almost same test RMSE. It is 26.461. It seems to show that the 8 weather conditions have no affect on the prediction result. The code is below.

# fit an LSTM network to training data

def fit_lstm(train, test, batch_size, neurons):

# split into input and outputs

train_X, train_y = train[:, 0:1], train[:, -1]

test_X, test_y = test [:, 0:1], test [:, -1]

train_X = train_X.reshape((train_X.shape[0], 1, train_X.shape[1]))

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

# design network

model = Sequential()

model.add(LSTM(neurons, input_shape=(train_X.shape[1], train_X.shape[2])))

model.add(Dense(1))

model.compile(loss=’mae’, optimizer=’adam’)

# fit network

history = model.fit(train_X, train_y, epochs=50, batch_size=batch_size, validation_data=(test_X, test_y), verbose=2, shuffle=False)

#history = model.fit(train_X, train_y, epochs=50, batch_size=72, verbose=2, shuffle=False)

return model

# make a prediction

def make_forecasts(model, test_X):

test_X = test_X[:, 0:1]

test_X = test_X.reshape((test_X.shape[0], 1, test_X.shape[1]))

forecasts = model.predict(test_X)

return forecasts

Nice one!

The real motivation for me writing this post was to help the 100s of people asking how to develop a multivariate LSTM.