The Keras deep learning library provides an implementation of the Long Short-Term Memory, or LSTM, recurrent neural network.

As part of this implementation, the Keras API provides access to both return sequences and return state. The use and difference between these data can be confusing when designing sophisticated recurrent neural network models, such as the encoder-decoder model.

In this tutorial, you will discover the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

After completing this tutorial, you will know:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- Long Short-Term Memory
- Return Sequences
- Return States
- Return States and Sequences

## Long Short-Term Memory

The Long Short-Term Memory, or LSTM, is a recurrent neural network that is comprised of internal gates.

Unlike other recurrent neural networks, the network’s internal gates allow the model to be trained successfully using backpropagation through time, or BPTT, and avoid the vanishing gradients problem.

In the Keras deep learning library, LSTM layers can be created using the LSTM() class.

Creating a layer of LSTM memory units allows you to specify the number of memory units within the layer.

Each unit or cell within the layer has an internal cell state, often abbreviated as “*c*“, and outputs a hidden state, often abbreviated as “*h*“.

The Keras API allows you to access these data, which can be useful or even required when developing sophisticated recurrent neural network architectures, such as the encoder-decoder model.

For the rest of this tutorial, we will look at the API for access these data.

## Return Sequences

Each LSTM cell will output one hidden state *h* for each input.

1 |
h = LSTM(X) |

We can demonstrate this in Keras with a very small model with a single LSTM layer that itself contains a single LSTM cell.

In this example, we will have one input sample with 3 time steps and one feature observed at each time step:

1 2 3 |
t1 = 0.1 t2 = 0.2 t3 = 0.3 |

The complete example is listed below.

Note: all examples in this post use the Keras functional API.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example outputs a single hidden state for the input sequence with 3 time steps.

Your specific output value will differ given the random initialization of the LSTM weights and cell state.

1 |
[[-0.0953151]] |

It is possible to access the hidden state output for each input time step.

This can be done by setting the *return_sequences* attribute to *True* when defining the LSTM layer, as follows:

1 |
LSTM(1, return_sequences=True) |

We can update the previous example with this change.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1, return_sequences=True)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer.

1 2 3 |
[[[-0.02243521] [-0.06210149] [-0.11457888]]] |

You must set *return_sequences=True* when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input. For more details, see the post:

You may also need to access the sequence of hidden state outputs when predicting a sequence of outputs with a *Dense* output layer wrapped in a TimeDistributed layer. See this post for more details:

## Return States

The output of an LSTM cell or layer of cells is called the hidden state.

This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or *c*.

Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.

Keras provides the return_state argument to the LSTM layer that will provide access to the hidden state output (*state_h*) and the cell state (*state_c*). For example:

1 |
lstm1, state_h, state_c = LSTM(1, return_state=True) |

This may look confusing because both lstm1 and *state_h* refer to the same hidden state output. The reason for these two tensors being separate will become clear in the next section.

We can demonstrate access to the hidden and cell states of the cells in the LSTM layer with a worked example listed below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example returns 3 arrays:

- The LSTM hidden state output for the last time step.
- The LSTM hidden state output for the last time step (again).
- The LSTM cell state for the last time step.

1 2 3 |
[array([[ 0.10951342]], dtype=float32), array([[ 0.10951342]], dtype=float32), array([[ 0.24143776]], dtype=float32)] |

The hidden state and the cell state could in turn be used to initialize the states of another LSTM layer with the same number of cells.

## Return States and Sequences

We can access both the sequence of hidden state and the cell states at the same time.

This can be done by configuring the LSTM layer to both return sequences and return states.

1 |
lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True) |

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example, we can see now why the LSTM output tensor and hidden state output tensor are declared separably.

The layer returns the hidden state for each input time step, then separately, the hidden state output for the last time step and the cell state for the last input time step.

This can be confirmed by seeing that the last value in the returned sequences (first array) matches the value in the hidden state (second array).

1 2 3 4 5 |
[array([[[-0.02145359], [-0.0540871 ], [-0.09228823]]], dtype=float32), array([[-0.09228823]], dtype=float32), array([[-0.19803026]], dtype=float32)] |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

- Keras Functional API
- LSTM API in Keras
- Long Short-Term Memory, 1997.
- Understanding LSTM Networks, 2015.
- A ten-minute introduction to sequence-to-sequence learning in Keras

## Summary

In this tutorial, you discovered the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

Specifically, you learned:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Thanks for this!

To help people understand some applications of the output sequence and state visually, a picture like in the following stats overflow answer is great!

https://stats.stackexchange.com/a/181544/37863

Thanks.

Hi Jason,

Do you have plans to use more of the function API in your blog series?

if so, why?

Best regards

Thabet

Yes, it is needed for more advanced model development.

I will have a “how to…” post on the functional API soon. It is scheduled.

Hi Jason, is it possible to access the internal states through return_state = True and return_sequences = True with the Sequencial API? Moreover, is it possible to set the hidden state through a function like set_state() ?

Thanks!

Perhaps, but not as far as I know or have tested.

Hey Jason, I wanted to show you this cool new RNN cell I’ve been trying out called “Recurrent Weighted Average” – it implements attention into the recurrent neural network – the keras implementation is available at https://github.com/keisuke-nakata/rwa and the whitepaper is at https://arxiv.org/pdf/1703.01253.pdf

I’ve also seen that GRU is often a better choice unless the LSTM’s bias is initialized to ones, and it’s baked into Keras now (whitepaper for that at http://proceedings.mlr.press/v37/jozefowicz15.pdf )

Very cool!

Just a note to say that return_state seems to be a recent addition to keras (since tensorflow 1.3 – if you are using keras in tensorflow contrib).

Shame it’s not available in earlier versions – I was looking forward to playing around with it 🙂

Alex

Thanks Alex.

Hi Jason, in these example you don’t fit.

When you define the model like this: model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) and then fit, the fit() function expects three values for the output instead of 1. How to correctly print the states (o see they change during training and/or prediction ?

For a model that takes 2 inputs, they must be provided to fit() as an array.

Hi Jason, the question was about the outputs, not the inputs.. The problem is that if i set outputs=[lstm1, state_h, state_c] in the Model(), then the fit() function will expect three arrays as target arrays.

Hi Alex, did u find how to handle the fit in this case?

Suppose i have

model = Model(inputs=[input_x, h_one_in , h_two_in], outputs=[y1,y2,state_h,state_c])

how would I write my mode.fit? in the input and outputs?

Thanks,

+1

I use random initialization but the results are disappointing.

Any other ideas?

Jason,

Brilliant post as usual. I am also going to buy your LSTM book.

I however had a question on Keras LSTM error I have been getting and was hoping if you could help that?

Getting an error like this

“You must feed a value for placeholder tensor ’embedding_layer_input'”

/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()

465 compat.as_text(pywrap_tensorflow.TF_Message(status)),

–> 466 pywrap_tensorflow.TF_GetCode(status))

467 finally:

InvalidArgumentError: You must feed a value for placeholder tensor ’embedding_layer_input’ with dtype float

[[Node: embedding_layer_input = Placeholder[dtype=DT_FLOAT, shape=[], _device=”/job:localhost/replica:0/task:0/gpu:0″]()]]

[[Node: output_layer_2/bias/read/_237 = _Recv[client_terminated=false, recv_device=”/job:localhost/replica:0/task:0/cpu:0″, send_device=”/job:localhost/replica:0/task:0/gpu:0″, send_device_incarnation=1, tensor_name=”edge_1546_output_layer_2/bias/read”, tensor_type=DT_FLOAT, _device=”/job:localhost/replica:0/task:0/cpu:0″]()]]

During handling of the above exception, another exception occurred:

Here is the code I wrote for this:

def model_param(self):

# Method to do deep learning

from keras.models import Sequential

from keras.layers import Dense, Flatten, Dropout, Activation

from keras.layers import LSTM

from keras.layers.embeddings import Embedding

from keras.initializers import TruncatedNormal

tn=TruncatedNormal(mean=0.0, stddev=1/sqrt(self.x_train.shape[1]*self.x_train.shape[1]), seed=2)

self.model = Sequential()

self.model.add(Embedding(self.len_vocab,300,input_length=self.x_train.shape[1]))

# Adding LSTM cell

self.model.add(LSTM(self.num_units,dropout=0.30,kernel_initializer=tn,name=”lstm_1″))

# Adding the dense output layer for Output

self.model.add(Dense(1,activation=”sigmoid”,name=”output_layer”))

#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

self.model.compile(loss=’binary_crossentropy’,

optimizer=”adam”,

metrics=[‘accuracy’])

self.model.summary()

def fit(self):

# Training the deep learning network on the training data

# Adding the callbacks for Logging

import keras

logger_tb=keras.callbacks.TensorBoard(

log_dir=”logs_sentiment_lstm”,

write_graph=True,

histogram_freq=5

)

self.model.fit(self.x_train, self.y_train,validation_split=0.20,

epochs=10,

batch_size=128,callbacks=[logger_tb]

)

Ouch, I have not seen this fault before.

Perhaps try simplifying the example to flush out the cause?

histogram_freq=5 is causing this error, this is a bug in keras, set histogram_freq=0 and it should work fine

Thanks for sharing.

This is another great Post Jason! I am a fan of all your RRNs posts. 😉 Thanks!

In case anyone was wondering the difference between c (Internal state) and h (Hidden state) in a LSTM, this answer was very helpful for me:

https://www.quora.com/What-is-the-difference-between-states-and-outputs-in-LSTM

Would be correct to say that in a GRU and SimpleRNN, the c=h?

Thanks in advance!

Thanks Julian.

Hi Jason,

In the implementation of encoder-decoder in keras we do a return state in the encoder network which means that we get state_h and state_c after which the [state_h,state_c] is set as initial state of the decoder network. What does setting the initial state mean for a LSTM network? Is it that the state_h of decoder = [state_h,state_c].

Thanks in advance.

Great question, here is an example:

https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

Great. I think i get it now. Basically when we give return_state=True for a LSTM then the LSTM will accept three inputs that is decoder_actual_input,state_h,state_c. And during inference we again reset the state_c and state_h with state_h and state_c of previous prediction. Am i correct on my assumption ?

I am still a little bit confused why we use three keras models (model,encoder_model and decoder_model).

Thank You.

The return_state argument only controls whether the state is returned. A different argument is used to initialize state for an LSTM (e.g. during the definition of the model with the functional API).

Got it. Its initial_state. Thank You Jason.

Glad to hear it.

Hi Jason,

I do enjoy reading your blog. I have 2 short questions for this post and hope you could kindly address them briefly:

1. Can we return the sequence of cell states (a sort of variable similar to *lstm1*)?

2. No matter the dimension (I mean #features) of the input sequence, as we place 1 LSTMcell in the layer, both the hidden and cell states are always a scalar, right? As such, the kernel_ and recurrent_kernel_ properties in Keras (at each gate) are not in the matrix form. However, I believe your standpoint on viewing each LSTM cell having 1Dim hidden state/cell makes sense in the case of dropout in deep learning.

Please correct me if I misunderstood your post. Thank you.

Not directly, perhaps by calling the model recursively.

I think you’re right.

Hi, very good explanation.

One question, I thought h = activation (o), is that correct? (h: hidden state output, o: hidden cell)

But tanh(-0.19803026) does not equals -0.09228823. (The default activation for LSTM should be tanh)

Thank you so much, Jason. This cleared my doubt.

I’m glad to hear that.

Thank you so much for writing this. This is really a big help.

Thanks, I’m glad it helped.

hey Jason， how could i get the final hidden state and sequence both when using a bidirectional wrapper？

Does the above post not help?

Hi Jason, can I connect the output of a dense layer to the c state of a LSTM in such a way that it initialize the state c with this value before each batch? Thanks

I’m sure you can (it’s all just code), but it might require some careful programming. I don’t have good advice other than lots of trial and error.

Let me know how you go.

What is the hidden state and cell state of the first input if it does not have a previous hidden or cell state to reference?

It will be zero.

When I use following code based on bidirectional LSTM, it retruns this error:

‘ is not connected, no input to return.’)

AttributeError: Layer sequential_1 is not connected, no input to return.

But when ordinary LSTM (commented code) is ran, it returns correctly.

self.model = Sequential()

# self.model.add(LSTM(input_shape=(None,self.num_encoder_tokens), units=self.n_hidden,

# return_sequences=True,name=’hidden’))

# self.model.add(LSTM(units=self.num_encoder_tokens, return_sequences=True))

# self.intermediate_layer = Model(inputs=self.model.input, outputs=self.model.get_layer(‘hidden’).output)

self.model.add(Bidirectional(LSTM(input_shape=(None,self.num_encoder_tokens), units=self.n_hidden,

return_sequences=True,name=’hidden’),merge_mode=’concat’))

self.model.add(Bidirectional(LSTM(units=self.num_encoder_tokens, return_sequences=True),merge_mode=’concat’))

self.intermediate_layer = Model(input=self.model.input,output=self.model.get_layer(‘hidden’).output)

why?

I have some suggestions here:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

https://stackoverflow.com/questions/49313650/how-could-i-get-both-the-final-hidden-state-and-sequence-in-a-lstm-layer-when-us

This can help

Is there no reply for this?

Thank you so much for your explanation!

I’m happy it helped Sam!

Awesome Work Jason. I always thought the last hidden state is equal to the cell state. So I was wrong and the hidden state and the cell state is never the same?

Thank you

Yes, they are different things.

Hi,

I just wanna thank you for the entire site.

Whenever I am stuck in code or concepts I visit your site and things get cleared up.

Thanks, I’m glad its helpful!

Hi so in the above example our network consist of only one lstm node or cell

And the output is feed to it of 3 timestamps one at a time ?

Or does it have 3 cells for each timestemp

The number of nodes in the LSTM is unrelated to the number of time steps in the data sample.

Perhaps this will make things clearer for you:

https://machinelearningmastery.com/prepare-univariate-time-series-data-long-short-term-memory-networks/

“Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.”

in the another your post,encoder-decoder LSTM Model code as fllower:

model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))

model.add(RepeatVector(n_outputs))

model.add(LSTM(200, activation=’relu’, return_sequences=True))

but return_state = false?

Correct.

Hello Jason,

Thanks for the great post. I have a quick question about the bottleneck in the LSTM encoder above. As I understand it, if the encoder has say 50 cells, then does this mean that the hidden state of the encoder LSTM layer contains 50 values, one per cell?

If this is correct, then would it be accurate to say that if the original data had 50 timesteps and a dimensionality/feature count of 3, then having an encoder LSTM with 20 cells (which would give a hidden state of 20 values) could be considered to be a sort of compression/dimensionality reduction (a la autoencoders and compressed representations) ?

Finally, does it make sense to apply have a fully-connected layer with some nonlinearity operating on the hidden state for purposes of dimensionality reduction i.e hidden state with 50 values -> FFlayer with 10 neurons, ‘compressing’ the 50 values to 10…?

Thanks again

Yes, correct.

If you want to use the hidden state as a learned feature, you could feed it into a new fully connected model.

Brilliant, thanks!

Excellent post, how would one save the state when prediction samples arrives from multiple sources, like the question posted here https://stackoverflow.com/questions/54850854/keras-restore-lstm-hidden-state-for-a-specific-time-stamp ?

You can save state by retrieving it from the model and saving it to a file.

Thank you so much 🙂

Dear Jason

Thank you very much for the great post. Currently I am working on two-stream LSTM network(sequence of images) and I am trying to extract both LSTMs each time step’s cell state and calculate the average value. Afterwards update next time step with this previous time step’s average value + existing cell state value. And continue this process thru all time steps.

Greatly appreciate if you could explain me how do we update LSTM cell states(as each time steps) by giving additional value. Thank you very much.

Why are you trying to average the cell state exactly?

Thank you very much for your response. I am doing it the following way

Please be noted 2nd LSTM is for Optical flow stream mistakenly comment both LSTMs for RGB. Thank you.

Why? Why do you want to do this?

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

Interesting, let me know how you go.

Really great article, Thanks a lot:)

Thanks.

It really solved my confusion. Thank you 🙂

I’m happy to hear that.

Thanks Jason,

Can you pls tell me how to use return states with Bidirectional wrapper on LSTM? The unpacking of outputs throws error

code:

encoder = Bidirectional(LSTM(n_a, return_state=True))

encoder_outputs, state_h, state_c = encoder(encoder_inputs)

error:

ValueError: too many values to unpack (expected 3)

Perhaps assign the result to one variable and inspect it to see what you have?

Question: Is only the hidden state forwarded to upper layers in LSTM, or is also the memory cell state forwarded to upper layers?

Or is the memory cell state only forwarded along the time sequence?

Thank you!

Only the hidden state is output, memory state remains internal the node.

Hi Jason,

Thanks for sharing. I am not sure if I understand Keras.LSTM correctly. Could you please help me clarify / correct the following statements?

1. Keras LSTM is an output-to-hidden recurrent by default, e.g. it sends previous output to current hidden layers;

2. To create a hidden-to-hidden LSTM, can we do:

lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1)

model = Model(inputs=(inputs1, state_h), outputs=lstm1)

3. Does Keras train LSTM using teaching force or BPTT?

Not sure I follow. The LSTM has outputs and hidden state.