### Gentle introduction to the Stacked LSTM

with example code in Python.

The original LSTM model is comprised of a single hidden LSTM layer followed by a standard feedforward output layer.

The Stacked LSTM is an extension to this model that has multiple hidden LSTM layers where each layer contains multiple memory cells.

In this post, you will discover the Stacked LSTM model architecture.

After completing this tutorial, you will know:

- The benefit of deep neural network architectures.
- The Stacked LSTM recurrent neural network architecture.
- How to implement stacked LSTMs in Python with Keras.

Let’s get started.

## Overview

This post is divided into 3 parts, they are:

- Why Increase Depth?
- Stacked LSTM Architecture
- Implement Stacked LSTMs in Keras

## Why Increase Depth?

Stacking LSTM hidden layers makes the model deeper, more accurately earning the description as a deep learning technique.

It is the depth of neural networks that is generally attributed to the success of the approach on a wide range of challenging prediction problems.

[the success of deep neural networks] is commonly attributed to the hierarchy that is introduced due to the several layers. Each layer processes some part of the task we wish to solve, and passes it on to the next. In this sense, the DNN can be seen as a processing pipeline, in which each layer solves a part of the task before passing it on to the next, until finally the last layer provides the output.

— Training and Analyzing Deep Recurrent Neural Networks, 2013

Additional hidden layers can be added to a Multilayer Perceptron neural network to make it deeper. The additional hidden layers are understood to recombine the learned representation from prior layers and create new representations at high levels of abstraction. For example, from lines to shapes to objects.

A sufficiently large single hidden layer Multilayer Perceptron can be used to approximate most functions. Increasing the depth of the network provides an alternate solution that requires fewer neurons and trains faster. Ultimately, adding depth it is a type of representational optimization.

Deep learning is built around a hypothesis that a deep, hierarchical model can be exponentially more efficient at representing some functions than a shallow one.

— How to Construct Deep Recurrent Neural Networks, 2013.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Stacked LSTM Architecture

The same benefits can be harnessed with LSTMs.

Given that LSTMs operate on sequence data, it means that the addition of layers adds levels of abstraction of input observations over time. In effect, chunking observations over time or representing the problem at different time scales.

… building a deep RNN by stacking multiple recurrent hidden states on top of each other. This approach potentially allows the hidden state at each level to operate at different timescale

— How to Construct Deep Recurrent Neural Networks, 2013

Stacked LSTMs or Deep LSTMs were introduced by Graves, et al. in their application of LSTMs to speech recognition, beating a benchmark on a challenging standard problem.

RNNs are inherently deep in time, since their hidden state is a function of all previous hidden states. The question that inspired this paper was whether RNNs could also benefit from depth in space; that is from stacking multiple recurrent hidden layers on top of each other, just as feedforward layers are stacked in conventional deep networks.

— Speech Recognition With Deep Recurrent Neural Networks, 2013

In the same work, they found that the depth of the network was more important than the number of memory cells in a given layer to model skill.

Stacked LSTMs are now a stable technique for challenging sequence prediction problems. A Stacked LSTM architecture can be defined as an LSTM model comprised of multiple LSTM layers. An LSTM layer above provides a sequence output rather than a single value output to the LSTM layer below. Specifically, one output per input time step, rather than one output time step for all input time steps.

## Implement Stacked LSTMs in Keras

We can easily create Stacked LSTM models in Keras Python deep learning library

Each LSTMs memory cell requires a 3D input. When an LSTM processes one input sequence of time steps, each memory cell will output a single value for the whole sequence as a 2D array.

We can demonstrate this below with a model that has a single hidden LSTM layer that is also the output layer.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Example of one output for whole sequence from keras.models import Sequential from keras.layers import LSTM from numpy import array # define model where LSTM is also output layer model = Sequential() model.add(LSTM(1, input_shape=(3,1))) model.compile(optimizer='adam', loss='mse') # input time steps data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

The input sequence has 3 values. Running the example outputs a single value for the input sequence as a 2D array.

1 |
[[ 0.00031043]] |

To stack LSTM layers, we need to change the configuration of the prior LSTM layer to output a 3D array as input for the subsequent layer.

We can do this by setting the return_sequences argument on the layer to True (defaults to False). This will return one output for each input time step and provide a 3D array.

Below is the same example as above with return_sequences=True.

1 2 3 4 5 6 7 8 9 10 11 12 |
# Example of one output for each input time step from keras.models import Sequential from keras.layers import LSTM from numpy import array # define model where LSTM is also output layer model = Sequential() model.add(LSTM(1, return_sequences=True, input_shape=(3,1))) model.compile(optimizer='adam', loss='mse') # input time steps data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example outputs a single value for each time step in the input sequence.

1 2 3 |
[[[-0.02115841] [-0.05322712] [-0.08976141]]] |

Below is an example of defining a two hidden layer Stacked LSTM:

1 2 3 4 |
model = Sequential() model.add(LSTM(..., return_sequences=True, input_shape=(...))) model.add(LSTM(...)) model.add(Dense(...)) |

We can continue to add hidden LSTM layers as long as the prior LSTM layer provides a 3D output as input for the subsequent layer; for example, below is a Stacked LSTM with 4 hidden layers.

1 2 3 4 5 6 |
model = Sequential() model.add(LSTM(..., return_sequences=True, input_shape=(...))) model.add(LSTM(..., return_sequences=True)) model.add(LSTM(..., return_sequences=True)) model.add(LSTM(...)) model.add(Dense(...)) |

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

- How to Construct Deep Recurrent Neural Networks, 2013.
- Training and Analyzing Deep Recurrent Neural Networks, 2013.
- Speech Recognition With Deep Recurrent Neural Networks, 2013.
- Generating Sequences With Recurrent Neural Networks, 2014.

## Summary

In this post, you discovered the Stacked Long Short-Term Memory network architecture.

Specifically, you learned:

- The benefit of deep neural network architectures.
- The Stacked LSTM recurrent neural network architecture.
- How to implement stacked LSTMs in Python with Keras.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Thanks alot Jason !

Your blog is wonderful

Please keep up the great work

Best regards/ Thabet

Thanks Thabet.

Hi Jason,

After first stack of LSTM layer, Don’t we need ‘input_shape’ or ‘batch_input_shape’? Need your expert comment.

No, the input specification is only needed on the first hidden layer.

Thanks for your response

Can you specify when this approach is needed?

Wonderful work, thanks!

Hard question, nice.

Perhaps generally when you think there may be a hierarchical structure in your sequence data. You can try some stacked LSTMs and see how it impacts model skill.

Stacked LSTMS will likely need more epochs to complete training, use normal model diagnostics.

Hi Jason, thanks for your work!

If I have a large size of data, I want to train a Stacked LSTMS about 30 layers.

Can you tell me if I train a 30 layers Stacked LSTMS, what do I need to pay attention to?

Why 3 or 4 layers Stacked LSTMS are common？

would the 30 layers Stacked LSTMS work?

That is a lot of layers, I have not developed LSTMs that deep myself. I cannot give you good advice.

Generally, there are diminishing returns beyond 4 layers.

Thank you I think you for your answer, I think that probably there are to much layers and to try to summarize:

Its necessary and a Dropout and/or Dense(1 LSTM,1Drop, 1 Dense) layer for every LSTM layer in a model or that is almost the same that for example (2 LSTM, 1Drop, 1Dense)

Thank you in advance

Hello Jason,

Like always, very useful article,

I have a question for you

I am get use to add LSTM labels in this way:

layers=shape = [4, seq_len, 1] # feature, window, output

# neuros=neurons = [128, 128, 32, 1]

LSTM, Dropout and Dense.

model.add(LSTM(250, input_shape=(layers[1], layers[0]), return_sequences=True))

model.add(Dropout(d))

model.add(LSTM(neurons[1], input_shape=(layers[1], layers[0]), return_sequences=True))

model.add(Dropout(d))

model.add(LSTM(neurons[2], input_shape=(layers[1], layers[0]), return_sequences=False))

model.add(Dropout(d))

model.add(Dense(neurons[2],kernel_initializer=”uniform”,activation=’relu’))

model.add(Dense(neurons[2],kernel_initializer=”uniform”,activation=’relu’))

model.add(Dense(layers[0],kernel_initializer=”uniform”,activation=’linear’))

model.compile(loss=’mse’,optimizer=optimizador, metrics=[‘accuracy’])

There is any difference if I add only LSTM I mean something like that

model.add(LSTM(250, input_shape=(layers[1], layers[0]), return_sequences=True))

model.add(LSTM(neurons[1], input_shape=(layers[1], layers[0]), return_sequences=True))

model.add(LSTM(neurons[2], input_shape=(layers[1], layers[0]), return_sequences=False))

model.add(Dense(neurons[2],kernel_initializer=”uniform”,activation=’relu’))

model.add(Dropout(d))

model.compile(loss=’mse’,optimizer=optimizador, metrics=[‘accuracy’])

Thank you in advance

Perhaps you could summarize your question?

Hello Jason,

First of all thank you for all your work on your website it is very useful.

I am implementing in keras a stacked lstm network for solving a many to many sequence problem, I would like to know if for that kind of problem you would still put the parameter return_sequences at the value False for the last lstm layer ?

Thank you in advance.

No need to return the sequence on the last layer.

Perhaps this tutorial will help:

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

I’m currently working on a LSTM model using just one hidden layer. I was wondering if you know how to best approach whether or not to expand and add more layers.

Is there a general rule of thumb as to how and why to add more hidden layers and build more deeply when working with time series forecasting?

Also, how well does adding layers scale with the size of training data. I’m trying to compare training data from 1000 (lower end) to over 100,000 points of training.

More layers offers more abstraction of the input sequence.

No good theories or rules of thumb around this that I have seen.

Thanks!

Then I’ll continue my research and see if I can find anything any correlation between adding more layers and the accuracy of the predictions.

Hi Jason, good post. Would you say that the use of a stacked LSTM is equivalent (or better / worse) in terms of predictive capabilities versus a feedforward network of a few hidden layers which then feeds into a single layer LSTM?

My line of thinking is if I had a dataset with interesting relationships between the input features themselves, but also as those features change over time, I would expect to get interesting results both from a feedforward network, and an LSTM … does a stacked LSTM get the best of both worlds?

It can, but it really depends on the specifics of the problem and the relationships being modelled.

For example, deep encoder-decoder LSTMs do prove very effective at NLP problems with long sequences of text.

Hey there.

What happens to the computation time needed to train a stacked LSTM? If i have a LSTM with one layer, does a stacked LSTM with m layers need m times as much computation time?

It depends on your hardware. It is slower, but perhaps not 200% slower.

Dear Jason,

Very interesting approach, however I wonder if this is trained using regular backpropagation. I ask this due to the problems backpropagation has when dealing with deep-NN, particularly:

– Diffusion of gradient problem where gradients that are backpropagated rapidly decrease in magnitude as the depth of the network increases.

– Appearance of many bad local minima.

I know there are some approaches for greedy-training layer by layer, is this performed by Keras automatically? Or perhaps is there a maximum network-depth that can be dealt with regular backpropagation?

Thank you for your help.

Yes, but Backprop through time, you can learn more here:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

Yes, Keras does this for us automatically.

Thank you for your reply! However, as BPTT unrolls the network by timesteps, which makes the network seem as an even deeper network as each timestep becomes kind of a new layer; doesn’t this makes the Diffusion of gradient problem worse?

Additionally, I have a small code question, i.e.:

In the part where you add a LSTM layer, say “model.add(LSTM(1, return_sequences=True, input_shape=(3,1)))”, the first parameter which you input as 1, defines the number of “units” in the layer. This usually also defines the number of outputs it would have, I wonder if the “return_sequences=True” supersedes this and outputs as many inputs as you have?

Thank you for your help.

The input_shape defines the shape of inputs to the network.

The 1 refers to the number of units in the first hidden layer which is unrelated to the number of input time steps.

Return sequences returns a vector for each input time step instead of one vector at the end of the input sequence.

Hello Jason,

Thanks a lot for your work. Your blog really did spark a huge interest in me towards neural network.

I have read one of your suggested article which is “How to Construct Deep Recurrent Neural Networks”. I wonder whether the novel type of RNN mentioned in that article can be constructed using keras with tensorflow backend.

Again, good job and please keep up the good work!

Best regards,

Kamarul

I don’t know, sorry.

Hello Jason!,

Just as the other people before, first of all thanks for this amazingly helpful blogs and tutorials.

One question regarding this stacked LSTMs NN.

I see you seem to always need a Dense layer to give the final output of the stacked network.

Is that really so? and why is it necessary? does the absence of that Dense layer affect for good or worse the performance of the LSTMs network?

Thanks again,

We need something at the output end of the network to make an output that we can interpret as a prediction.

You could try an LSTM layer, I have never done so.

Can stacked LSTM’s learn feed sequence order? For example let’s say I had a random list of a billion numbers that I wanted returned in order. If numbers that are close together in the sort appear at opposite ends of the sequence the LSTM memory may lose track. However if a stack of LSTM’s could learn to rearrange the sequence as it moves up the stack I imagine that could help.

Currently Im taking a stack of LSTM’s with N outputs and sorting the input sequence between the stacks by one of the output values at each time step. As far as I can tell there is no way to associate gradient with a sorted index so it can only learn to sort through reinforcement (I think).

Interesting question. I think (gut intuition) it may be too challenging for the LSTM.