# Difference Between Return Sequences and Return States for LSTMs in Keras

Last Updated on August 14, 2019

The Keras deep learning library provides an implementation of the Long Short-Term Memory, or LSTM, recurrent neural network.

As part of this implementation, the Keras API provides access to both return sequences and return state. The use and difference between these data can be confusing when designing sophisticated recurrent neural network models, such as the encoder-decoder model.

In this tutorial, you will discover the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

After completing this tutorial, you will know:

• That return sequences return the hidden state output for each input time step.
• That return state returns the hidden state output and cell state for the last input time step.
• That return sequences and return state can be used at the same time.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Understand the Difference Between Return Sequences and Return States for LSTMs in Keras
Photo by Adrian Curt Dannemann, some rights reserved.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

1. Long Short-Term Memory
2. Return Sequences
3. Return States
4. Return States and Sequences

## Long Short-Term Memory

The Long Short-Term Memory, or LSTM, is a recurrent neural network that is comprised of internal gates.

Unlike other recurrent neural networks, the network’s internal gates allow the model to be trained successfully using backpropagation through time, or BPTT, and avoid the vanishing gradients problem.

In the Keras deep learning library, LSTM layers can be created using the LSTM() class.

Creating a layer of LSTM memory units allows you to specify the number of memory units within the layer.

Each unit or cell within the layer has an internal cell state, often abbreviated as “c“, and outputs a hidden state, often abbreviated as “h“.

The Keras API allows you to access these data, which can be useful or even required when developing sophisticated recurrent neural network architectures, such as the encoder-decoder model.

For the rest of this tutorial, we will look at the API for access these data.

## Return Sequences

Each LSTM cell will output one hidden state h for each input.

We can demonstrate this in Keras with a very small model with a single LSTM layer that itself contains a single LSTM cell.

In this example, we will have one input sample with 3 time steps and one feature observed at each time step:

The complete example is listed below.

Note: all examples in this post use the Keras functional API.

Running the example outputs a single hidden state for the input sequence with 3 time steps.

Your specific output value will differ given the random initialization of the LSTM weights and cell state.

It is possible to access the hidden state output for each input time step.

This can be done by setting the return_sequences attribute to True when defining the LSTM layer, as follows:

We can update the previous example with this change.

The full code listing is provided below.

Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer.

You must set return_sequences=True when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input. For more details, see the post:

You may also need to access the sequence of hidden state outputs when predicting a sequence of outputs with a Dense output layer wrapped in a TimeDistributed layer. See this post for more details:

## Return States

The output of an LSTM cell or layer of cells is called the hidden state.

This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or c.

Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.

Keras provides the return_state argument to the LSTM layer that will provide access to the hidden state output (state_h) and the cell state (state_c). For example:

This may look confusing because both lstm1 and state_h refer to the same hidden state output. The reason for these two tensors being separate will become clear in the next section.

We can demonstrate access to the hidden and cell states of the cells in the LSTM layer with a worked example listed below.

Running the example returns 3 arrays:

1. The LSTM hidden state output for the last time step.
2. The LSTM hidden state output for the last time step (again).
3. The LSTM cell state for the last time step.

The hidden state and the cell state could in turn be used to initialize the states of another LSTM layer with the same number of cells.

## Return States and Sequences

We can access both the sequence of hidden state and the cell states at the same time.

This can be done by configuring the LSTM layer to both return sequences and return states.

The complete example is listed below.

Running the example, we can see now why the LSTM output tensor and hidden state output tensor are declared separably.

The layer returns the hidden state for each input time step, then separately, the hidden state output for the last time step and the cell state for the last input time step.

This can be confirmed by seeing that the last value in the returned sequences (first array) matches the value in the hidden state (second array).

This section provides more resources on the topic if you are looking to go deeper.

## Summary

In this tutorial, you discovered the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

Specifically, you learned:

• That return sequences return the hidden state output for each input time step.
• That return state returns the hidden state output and cell state for the last input time step.
• That return sequences and return state can be used at the same time.

Do you have any questions?

## Develop LSTMs for Sequence Prediction Today!

#### Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

### 145 Responses to Difference Between Return Sequences and Return States for LSTMs in Keras

1. Nikeita October 24, 2017 at 5:22 pm #

Thanks for this!

To help people understand some applications of the output sequence and state visually, a picture like in the following stats overflow answer is great!

https://stats.stackexchange.com/a/181544/37863

2. Thabet Ali October 25, 2017 at 12:44 am #

Hi Jason,

Do you have plans to use more of the function API in your blog series?
if so, why?

Best regards
Thabet

• Jason Brownlee October 25, 2017 at 6:48 am #

Yes, it is needed for more advanced model development.

I will have a “how to…” post on the functional API soon. It is scheduled.

3. Alex October 27, 2017 at 1:33 am #

Hi Jason, is it possible to access the internal states through return_state = True and return_sequences = True with the Sequencial API? Moreover, is it possible to set the hidden state through a function like set_state() ?

Thanks!

• Jason Brownlee October 27, 2017 at 5:24 am #

Perhaps, but not as far as I know or have tested.

4. Eldar October 27, 2017 at 6:43 am #

Hey Jason, I wanted to show you this cool new RNN cell I’ve been trying out called “Recurrent Weighted Average” – it implements attention into the recurrent neural network – the keras implementation is available at https://github.com/keisuke-nakata/rwa and the whitepaper is at https://arxiv.org/pdf/1703.01253.pdf

I’ve also seen that GRU is often a better choice unless the LSTM’s bias is initialized to ones, and it’s baked into Keras now (whitepaper for that at http://proceedings.mlr.press/v37/jozefowicz15.pdf )

5. Alex October 28, 2017 at 5:13 am #

Just a note to say that return_state seems to be a recent addition to keras (since tensorflow 1.3 – if you are using keras in tensorflow contrib).

Shame it’s not available in earlier versions – I was looking forward to playing around with it 🙂

Alex

6. Alex October 30, 2017 at 9:54 pm #

Hi Jason, in these example you don’t fit.
When you define the model like this: model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) and then fit, the fit() function expects three values for the output instead of 1. How to correctly print the states (o see they change during training and/or prediction ?

• Jason Brownlee October 31, 2017 at 5:33 am #

For a model that takes 2 inputs, they must be provided to fit() as an array.

• Alex April 27, 2018 at 6:29 pm #

Hi Jason, the question was about the outputs, not the inputs.. The problem is that if i set outputs=[lstm1, state_h, state_c] in the Model(), then the fit() function will expect three arrays as target arrays.

• Pradyumna Majumder November 9, 2018 at 11:53 pm #

Hi Alex, did u find how to handle the fit in this case?

Suppose i have

model = Model(inputs=[input_x, h_one_in , h_two_in], outputs=[y1,y2,state_h,state_c])

how would I write my mode.fit? in the input and outputs?

Thanks,

• Fatih June 21, 2019 at 7:30 pm #

+1

I use random initialization but the results are disappointing.

Any other ideas?

7. MT November 8, 2017 at 9:47 am #

Jason,

Brilliant post as usual. I am also going to buy your LSTM book.

I however had a question on Keras LSTM error I have been getting and was hoping if you could help that?

Getting an error like this

“You must feed a value for placeholder tensor ’embedding_layer_input'”

/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()
465 compat.as_text(pywrap_tensorflow.TF_Message(status)),
–> 466 pywrap_tensorflow.TF_GetCode(status))
467 finally:

InvalidArgumentError: You must feed a value for placeholder tensor ’embedding_layer_input’ with dtype float
[[Node: embedding_layer_input = Placeholder[dtype=DT_FLOAT, shape=[], _device=”/job:localhost/replica:0/task:0/gpu:0″]()]]

During handling of the above exception, another exception occurred:

Here is the code I wrote for this:

def model_param(self):

# Method to do deep learning

from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout, Activation
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras.initializers import TruncatedNormal

tn=TruncatedNormal(mean=0.0, stddev=1/sqrt(self.x_train.shape[1]*self.x_train.shape[1]), seed=2)

self.model = Sequential()

# Adding the dense output layer for Output

#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
self.model.compile(loss=’binary_crossentropy’,
metrics=[‘accuracy’])

self.model.summary()

def fit(self):
# Training the deep learning network on the training data

# Adding the callbacks for Logging

import keras
logger_tb=keras.callbacks.TensorBoard(
log_dir=”logs_sentiment_lstm”,
write_graph=True,
histogram_freq=5
)

self.model.fit(self.x_train, self.y_train,validation_split=0.20,
epochs=10,
batch_size=128,callbacks=[logger_tb]
)

• Jason Brownlee November 9, 2017 at 9:51 am #

Ouch, I have not seen this fault before.

Perhaps try simplifying the example to flush out the cause?

• Kumar Nilay June 21, 2019 at 11:07 am #

histogram_freq=5 is causing this error, this is a bug in keras, set histogram_freq=0 and it should work fine

8. Julian November 12, 2017 at 8:27 am #

This is another great Post Jason! I am a fan of all your RRNs posts. 😉 Thanks!

In case anyone was wondering the difference between c (Internal state) and h (Hidden state) in a LSTM, this answer was very helpful for me:

https://www.quora.com/What-is-the-difference-between-states-and-outputs-in-LSTM

Would be correct to say that in a GRU and SimpleRNN, the c=h?

9. Kaushal Shetty November 24, 2017 at 12:21 am #

Hi Jason,
In the implementation of encoder-decoder in keras we do a return state in the encoder network which means that we get state_h and state_c after which the [state_h,state_c] is set as initial state of the decoder network. What does setting the initial state mean for a LSTM network? Is it that the state_h of decoder = [state_h,state_c].

• Kaushal Shetty November 24, 2017 at 9:37 pm #

Great. I think i get it now. Basically when we give return_state=True for a LSTM then the LSTM will accept three inputs that is decoder_actual_input,state_h,state_c. And during inference we again reset the state_c and state_h with state_h and state_c of previous prediction. Am i correct on my assumption ?

I am still a little bit confused why we use three keras models (model,encoder_model and decoder_model).

Thank You.

• Jason Brownlee November 25, 2017 at 10:19 am #

The return_state argument only controls whether the state is returned. A different argument is used to initialize state for an LSTM (e.g. during the definition of the model with the functional API).

• Kaushal Shetty November 27, 2017 at 4:57 pm #

Got it. Its initial_state. Thank You Jason.

• Jason Brownlee November 28, 2017 at 8:36 am #

10. Nathan D. January 9, 2018 at 1:17 am #

Hi Jason,

I do enjoy reading your blog. I have 2 short questions for this post and hope you could kindly address them briefly:

1. Can we return the sequence of cell states (a sort of variable similar to *lstm1*)?

2. No matter the dimension (I mean #features) of the input sequence, as we place 1 LSTMcell in the layer, both the hidden and cell states are always a scalar, right? As such, the kernel_ and recurrent_kernel_ properties in Keras (at each gate) are not in the matrix form. However, I believe your standpoint on viewing each LSTM cell having 1Dim hidden state/cell makes sense in the case of dropout in deep learning.

• Jason Brownlee January 9, 2018 at 5:32 am #

Not directly, perhaps by calling the model recursively.

I think you’re right.

• Tianyu October 2, 2019 at 9:57 pm #

I have the same questions like Q1, so how do you output the sequence of cell states? Thank you.

• Jason Brownlee October 3, 2019 at 6:47 am #

You don’t, generally. You output a sequence of activations, referred to in the papers as h.

11. Zebo Li January 31, 2018 at 2:06 pm #

Hi, very good explanation.
One question, I thought h = activation (o), is that correct? (h: hidden state output, o: hidden cell)
But tanh(-0.19803026) does not equals -0.09228823. (The default activation for LSTM should be tanh)

12. Vinayaka February 27, 2018 at 1:29 am #

Thank you so much, Jason. This cleared my doubt.

13. Jason March 6, 2018 at 2:55 pm #

Thank you so much for writing this. This is really a big help.

14. tiopon March 16, 2018 at 3:53 pm #

hey Jason， how could i get the final hidden state and sequence both when using a bidirectional wrapper？

15. Alex April 12, 2018 at 4:28 pm #

Hi Jason, can I connect the output of a dense layer to the c state of a LSTM in such a way that it initialize the state c with this value before each batch? Thanks

• Jason Brownlee April 13, 2018 at 6:34 am #

I’m sure you can (it’s all just code), but it might require some careful programming. I don’t have good advice other than lots of trial and error.

Let me know how you go.

16. Andrew April 13, 2018 at 2:18 am #

What is the hidden state and cell state of the first input if it does not have a previous hidden or cell state to reference?

17. Ali July 17, 2018 at 4:12 am #

When I use following code based on bidirectional LSTM, it retruns this error:
‘ is not connected, no input to return.’)
AttributeError: Layer sequential_1 is not connected, no input to return.

But when ordinary LSTM (commented code) is ran, it returns correctly.

self.model = Sequential()
# return_sequences=True,name=’hidden’))
# self.intermediate_layer = Model(inputs=self.model.input, outputs=self.model.get_layer(‘hidden’).output)
return_sequences=True,name=’hidden’),merge_mode=’concat’))
self.intermediate_layer = Model(input=self.model.input,output=self.model.get_layer(‘hidden’).output)

why?

18. Ali July 17, 2018 at 11:01 pm #

Is there no reply for this?

19. Sam September 6, 2018 at 7:25 am #

Thank you so much for your explanation!

20. Klaas Brau November 14, 2018 at 4:23 am #

Awesome Work Jason. I always thought the last hidden state is equal to the cell state. So I was wrong and the hidden state and the cell state is never the same?
Thank you

• Jason Brownlee November 14, 2018 at 7:36 am #

Yes, they are different things.

21. Hussain November 23, 2018 at 8:20 pm #

Hi,
I just wanna thank you for the entire site.
Whenever I am stuck in code or concepts I visit your site and things get cleared up.

• Jason Brownlee November 24, 2018 at 6:31 am #

22. Clive December 23, 2018 at 4:55 am #

Hi so in the above example our network consist of only one lstm node or cell
And the output is feed to it of 3 timestamps one at a time ?

Or does it have 3 cells for each timestemp

23. mk January 17, 2019 at 6:41 pm #

“Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.”
in the another your post,encoder-decoder LSTM Model code as fllower:
but return_state = false?

24. M February 9, 2019 at 12:16 am #

Hello Jason,
Thanks for the great post. I have a quick question about the bottleneck in the LSTM encoder above. As I understand it, if the encoder has say 50 cells, then does this mean that the hidden state of the encoder LSTM layer contains 50 values, one per cell?

If this is correct, then would it be accurate to say that if the original data had 50 timesteps and a dimensionality/feature count of 3, then having an encoder LSTM with 20 cells (which would give a hidden state of 20 values) could be considered to be a sort of compression/dimensionality reduction (a la autoencoders and compressed representations) ?

Finally, does it make sense to apply have a fully-connected layer with some nonlinearity operating on the hidden state for purposes of dimensionality reduction i.e hidden state with 50 values -> FFlayer with 10 neurons, ‘compressing’ the 50 values to 10…?

Thanks again

• Jason Brownlee February 9, 2019 at 5:58 am #

Yes, correct.

If you want to use the hidden state as a learned feature, you could feed it into a new fully connected model.

• M February 9, 2019 at 11:14 pm #

Brilliant, thanks!

25. Shlomi Schwartz February 24, 2019 at 11:24 pm #

Excellent post, how would one save the state when prediction samples arrives from multiple sources, like the question posted here https://stackoverflow.com/questions/54850854/keras-restore-lstm-hidden-state-for-a-specific-time-stamp ?

• Jason Brownlee February 25, 2019 at 6:43 am #

You can save state by retrieving it from the model and saving it to a file.

26. Sandeli March 3, 2019 at 8:50 pm #

Dear Jason

Thank you very much for the great post. Currently I am working on two-stream LSTM network(sequence of images) and I am trying to extract both LSTMs each time step’s cell state and calculate the average value. Afterwards update next time step with this previous time step’s average value + existing cell state value. And continue this process thru all time steps.

Greatly appreciate if you could explain me how do we update LSTM cell states(as each time steps) by giving additional value. Thank you very much.

• Jason Brownlee March 4, 2019 at 6:58 am #

Why are you trying to average the cell state exactly?

• Sandeli March 4, 2019 at 1:47 pm #

Thank you very much for your response. I am doing it the following way




• Sandeli March 4, 2019 at 1:50 pm #

Please be noted 2nd LSTM is for Optical flow stream mistakenly comment both LSTMs for RGB. Thank you.

• Jason Brownlee March 4, 2019 at 2:18 pm #

Why? Why do you want to do this?

• Sandeli March 4, 2019 at 6:04 pm #

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

27. Sandeli March 4, 2019 at 4:56 pm #

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

• Jason Brownlee March 5, 2019 at 6:31 am #

Interesting, let me know how you go.

28. saria March 6, 2019 at 4:34 am #

Really great article, Thanks a lot:)

29. Jay April 27, 2019 at 6:10 pm #

It really solved my confusion. Thank you 🙂

30. Harish May 14, 2019 at 2:23 pm #

Thanks Jason,

Can you pls tell me how to use return states with Bidirectional wrapper on LSTM? The unpacking of outputs throws error

code:
encoder = Bidirectional(LSTM(n_a, return_state=True))
encoder_outputs, state_h, state_c = encoder(encoder_inputs)

error:
ValueError: too many values to unpack (expected 3)

• Jason Brownlee May 14, 2019 at 2:29 pm #

Perhaps assign the result to one variable and inspect it to see what you have?

31. Niclas H June 19, 2019 at 3:45 pm #

Question: Is only the hidden state forwarded to upper layers in LSTM, or is also the memory cell state forwarded to upper layers?

Or is the memory cell state only forwarded along the time sequence?
Thank you!

• Jason Brownlee June 20, 2019 at 8:23 am #

Only the hidden state is output, memory state remains internal the node.

32. QuantCub June 27, 2019 at 4:58 am #

Hi Jason,

Thanks for sharing. I am not sure if I understand Keras.LSTM correctly. Could you please help me clarify / correct the following statements?

1. Keras LSTM is an output-to-hidden recurrent by default, e.g. it sends previous output to current hidden layers;
2. To create a hidden-to-hidden LSTM, can we do:
lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1)
model = Model(inputs=(inputs1, state_h), outputs=lstm1)
3. Does Keras train LSTM using teaching force or BPTT?

• Jason Brownlee June 27, 2019 at 8:04 am #

Not sure I follow. The LSTM has outputs and hidden state.

• QuantCub June 28, 2019 at 1:16 am #

Sorry for the confusion. My output-to-hidden refers to the 2nd of the three patterns in Goodfellow’s Deep Learning: Chapter 10

Some examples of important design patterns for recurrent neural networks include the following:
• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, illustrated in figure 10.3.
• Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step, illustrated in figure 10.4
• Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output, illustrated in figure 10.5.

• QuantCub June 28, 2019 at 1:21 am #

Back to me question:
1. Keras LSTM is pattern 2 (previous output to current hidden) by default?
2. Can we use return_state to create a pattern 1 (previous hidden to current hidden) model?
3. Does Keras train LSTM using BPTT? or it can choose between teaching force and BPTT based on patterns?

33. Harshula June 29, 2019 at 7:11 pm #

Hi Jason.
I wanted to stack 2 GRUs. First one has hidden layers 64 and the second one 50 hidden layers
I am unsure how to go about defining that.

encoder_inputs = Input(batch_shape=(32, 103, 1), name=’encoder_inputs’)

encoder_gru1 = GRU(64, return_sequences=True, return_state=True,name=’encoder_gru1′)
encoder_out1, encoder_state1 = encoder_gru1(encoder_inputs)

encoder_gru2 = GRU(50, return_sequences=True, name=’encoder_gru’) encoder_out, encoder_state = encoder_gru2(encoder_out1)

decoder_gru = GRU(50, return_sequences=True, name=’decoder_gru’)
decoder_out, decoder_state = decoder_gru(encoder_out)

However i get following error

encoder_out, encoder_state = encoder_gru2(encoder_out1)

File “C:\Users\Harshula\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py”, line 457, in __iter__
“Tensor objects are only iterable when eager execution is ”

TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.

34. sopa July 2, 2019 at 5:56 pm #

Thank you for these understandable article. I have a question, how to plot predictions. I mean when I apply sine wave to code to see three output of LSTM how can I plot outputs in the form of continues signal.

• Jason Brownlee July 3, 2019 at 8:26 am #

You could use matplotlib and the plot() function.

35. sopa July 9, 2019 at 4:37 pm #

I mean I want to plot lstm1, state_h, state_c. but when I write model.fit like that:

model.fit(trainX, trainY=[lstm1, state_h, state_c], epochs=10, batch_size=1, verbose=2)

I got this error:
TypeError: Unrecognized keyword arguments: {‘trainY’: [, array([[]],

dtype=object), array([],

dtype=object)]}

should all of lsrm1, state_h, state_c have three dimension?

• Jason Brownlee July 10, 2019 at 8:04 am #

This looks really wrong, e.g. state variables as target variables in a call to fit.

What are you trying to achieve exactly?

36. Leo July 15, 2019 at 8:37 pm #

Hi Jason

My code has three output of lstm : output, hidden_state, cell_state.

I want to see all of them.

lstm, h, c = LSTM(units=20, batch_input_shape=(1,10, 2), return_sequences=True,

return_state=True)(inp)

dense = Dense(2)(lstm)

model = Model(inputs=inp, outputs=dense )

I want to plot all three of my output. I can plot lstm but I can’t plot h_state and c_state.

How can I do that?

• Jason Brownlee July 16, 2019 at 8:16 am #

That is odd.

Nevertheless, you could use matplotlib to plot anything you wish. e.g. plot(…)

37. Youssef MELLAH August 7, 2019 at 8:00 pm #

Hello Jason,

Thank you for this good post.

I have two hidden states of two Bi-LSTM, H1 and H2, and i want to use them as inputs in two Dense layer. shoud i connect the two dense layers with the two Bi_LSTM and tha’s done? or connect them directly with the hidden states?

some thing like this :

hidden1 = Dense(100)(H1)
hidden2 = Dense(100)(H2)

thanks again!

• Jason Brownlee August 8, 2019 at 6:33 am #

If by hidden states you mean those states that are internal to the LSTM layers, then I don’t think there is an effective way to pass them to a dense.

If you mean the outputs of the layer (the common meaning), then this looks fine.

• Youssef MELLAH August 8, 2019 at 7:10 pm #

for being mor clear, i have two text inputs and i use embedding for encoding them, and i puted the output of embeddings into two Bi-LSTMs.

so i want to use the hidden states of the two Bi-LSTM to do predictions. The hidden state for the first input is returned as above :
lstm, forward_h, forward_c, backward_h, backward_c= Bidirectional(..)(Embedding)
and H1 is calculated as : H1 = Concatenate()([forward_h, backward_h]).

the same thing i did for the seconde input and i calculated H2.

mathematiccaly, how can i impliment the above formule :

softmax(V tanh(W1*H1 + W2*H2))

which W and V represent all trainable parameter matrices and vectors, respectively.

thank’s Jason for helping me.

• Jason Brownlee August 9, 2019 at 8:09 am #

Nice work!

Looks like you want a weighted sum of the two vectors, perhaps a custom layer?

• Youssef MELLAH August 9, 2019 at 6:59 pm #

perhaps but to decrease complexity, i removed the two Bi-LSTM so i use the embeddings only for encoding.

so in order to do classification by using the 2 embeddings, can i use this mathematique formule: softmax(V tanh(W1*E1 + W2*E2)) ? if so, the code above is correct to represente it?

input1 = Input(shape=(25,))
E1 = Embedding(vocab_size, 100, input_length=25,
weights=[embedding_matrix], trainable=False)(input1 )
input1_hidden1 = Dense(100)(E1)

input2 = Input(shape=(25,))
E2 = Embedding(vocab_size, 100, input_length=25,
weights=[embedding_matrix], trainable=False)(input2 )
input1_hidden2 = Dense(100)(E2 )

model = Model(inputs=[input1 , input2],outputs=output1)

• Jason Brownlee August 10, 2019 at 7:14 am #

I’m eager to help, but I don’t have the capacity to review/debug your code.

Perhaps try posting to the keras user group:
https://machinelearningmastery.com/get-help-with-keras/

• Youssef MELLAH August 15, 2019 at 12:44 am #

that’s interesting, thanks a lot Jason

• Jason Brownlee August 15, 2019 at 8:12 am #

No problem.

38. Dawjidda August 16, 2019 at 1:57 am #

TypeError: GRU can accept only 1 positional arguments (‘units’,), but you passed the following positional arguments: [4, 200]

39. Usman October 2, 2019 at 2:44 am #

Amazing explanation!
Just have one confusion.

In the very first example, where LSTM is defined as LSTM(1)(inputs1) and Input as Input(shape=(3,1)).

So, the way you have reshaped data (1,3,1) means the timesteps value is 3(middle value) and the no of cells in LSTM is 1 i.e., LSTM(1).

I am confused about how 1-LSTM is going to process 3 timestep value. I mean shouldn’t there be 3 neurons/LSTM(3) to process the (1,3,1) shape data? Or is the LSTM going to process each input one after the other in sequence?

Thanks

40. Hugo April 9, 2020 at 4:25 am #

Perfectly clear. It was ver helpful to me.

• Jason Brownlee April 9, 2020 at 8:08 am #

Thanks, I’m happy to hear that!

41. Nima May 12, 2020 at 1:41 am #

Hi,

Thanks for the clear discuss. I have a question about a little different implementation. I have a model with an lstm layer, where the hidden layer of the last time step will be passed to a softmax to create a sentiment. Is there any way that I could access the hidden states of this model when passing a new sequence to it?

• Jason Brownlee May 12, 2020 at 6:47 am #

Yes, you can define the model using the functional api to output the hidden state as a separate output of the model.

42. Nishchay Chawla May 18, 2020 at 8:28 pm #

Hi,
Thanks for the clear explanation. I am trying to make an encoder-decoder model, but this model will have two decoders(d1 and d2) and one encoder. One decoder(d1) gets input only from encoder while another one(d2) will get input from encoder and other decoder(d1). d2 must get hidden states from d1 only when d1 makes a particular type of prediction. Both decoders have a different set of vocabulary. Say d1 has “a,b,c,d” and d2 has “P,Q,R,S”. I want to pass a hidden state from d1 to d2 only when d1 predicts “b”.
I hope this statement gives some sense of what I am trying to do. Thanks!

• Jason Brownlee May 19, 2020 at 6:02 am #

That sounds complex. Not sure what I can do for you, sorry.

Perhaps experiment with different prototypes until you achieve what you need?

43. Giri September 19, 2020 at 7:03 pm #

Dear Jason,

I usually visit your website lot of times for if i have any question. All your articles are so crisp and so is this return sequences and return state. No complex coding and point to point. Thanks for the good work you are doing.

I have a question. If in the above examples instead of LSTM(1), if we give LSTM(5) lets say.

inputs1 = Input(shape=(3, 1))
lstm1 = LSTM(5, return_sequences=True)(inputs1)

Then my output will be a 3 D array. But i wonder how 5 hidden states at each time step are
generated in LSTM. each LSTM has 1 hidden and 1 cell state right.

Can you please clarify my question?

44. Pratik Sen November 12, 2020 at 4:37 am #

[[[0.1]
[0.2]
[0.3]]] is the input given to the LSTM. But instead, Can the input to the LSTM be [[0.1 0.2 0.3]]

45. Pratik Sen November 12, 2020 at 4:52 am #

Ok, I have found the Answer. The LSTM layer requires input only in 3D format.

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

46. Adam December 10, 2020 at 12:32 pm #

Jason,

A quick question.

When you produce a single hidden state output, does that mean the prediction for t4 based on the input data set of [t1, t2, t3]? or the prediction on t3?

Along the same line, when producing three steps hidden state output, does that mean the prediction on for [t1, t2. t3] or [t2, t3, t4]?

Also, if we were to want to get a single hidden state output say n steps ahead (t+n), how do we specify that in your example?

Thanks and hope to hear back from you soon!

• Adam December 10, 2020 at 12:51 pm #

This was a dumb question.

There’s no timestep-based prediction set up here including data prep and training accordingly for that need.

I’d interpret hidden state outputs literally as outputs that carry over information up to t3 from t1.

Thanks

• Jason Brownlee December 10, 2020 at 1:27 pm #

When you use return state, you are only getting the state for the last time step.

47. Takhir January 6, 2021 at 1:06 am #

Thank you Jason!
Your materials helps me very much in learning.
Your simple and clear explanations is what newcommers realy need.

48. Lawrence January 11, 2021 at 3:02 pm #

My LSTM is like this:

def _get_model(input_shape, latent_dim, num_classes):

inputs = Input(shape=input_shape)
lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)
fc_lyr = Dense(num_classes)(lstm_lyr)
soft_lyr = Activation(‘relu’)(fc_lyr)
model = Model(inputs, [soft_lyr,state_h,state_c])
return model
model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)
history = model.fit(X_train,Y_train)

print (history.history.keys)
dict_keys(['loss', 'activation_26_loss', 'lstm_151_loss', 'activation_26_accuracy', 'lstm_151_accuracy', 'val_loss', 'val_activation_26_loss', 'val_lstm_151_loss', 'val_activation_26_accuracy', 'val_lstm_151_accuracy'])

I get 2 loss and 3 accuracies like this:

Epoch 1/2000
1/1 [==============================] – 1s 698ms/step – loss: 0.2338 – activation_26_loss: 0.1153 – lstm_151_loss: 0.1185 – activation_26_accuracy: 0.0000e+00 – lstm_151_accuracy: 0.0000e+00 – val_loss: 0.2341 – val_activation_26_loss: 0.1160 – val_lstm_151_loss: 0.1181 – val_activation_26_accuracy: 0.0000e+00 – val_lstm_151_accuracy: 0.0000e+00

How to read the losses and accuracies?

49. tanunchai September 19, 2021 at 8:26 am #

at page 102 , and page 104 of Long short term memory network with python book.

I found that when I run the program in list 8.25 I found that

no predict_classes method in Sequential object
see at
https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

Please observe that there are only methods only “predict, predict_on_batch, predict_step . [***No longer use predict_classes any more]

at page 102 , and page 104 of Long short term memory network with python book.

I found that when I run the program in list 8.25 I found that

# prediction on new data
X, y = generate_examples(size, 1)
yhat = model.predict_classes(X, verbose=0) // Problem at this line
expected = “Right” if y[0]==1 else “Left”
predicted = “Right” if yhat[0]==1 else “Left”
print(‘Expected: %s, Predicted: %s’ % (expected, predicted)

Error from complie:
There is error message that ” Sequencial object has no*** attribute “predict_classes in its class” “.

and get some info from internet that
model.predict_classes method is deprecated.It has been removed after 2021-01-01

This means we can no longer use predict_classed any more.

Way to solve :

Then I solve by replace “yhat “with 2 lines below
” from numpy as np ”
“yhat = np.argmax( model.predict(X), axis= -1)”

after my replace with adding 2 lines
then all can run successfully.

I do it right ?

• Adrian Tam September 20, 2021 at 2:27 pm #

That’s correct. You did it perfectly.

50. Anthony The Koala November 17, 2021 at 2:56 am #

Dear Dr Jason,

In the section “Return Sequences”, you had a sequence

* I would have thought that the next value in the sequence would be around 0.4, but in the example and trying it out myself as in the above example it is

YET, at https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/ where the sequence is

The output was close to 100:

In other words, in the first example of input = [0.1,0.2,0.3] why did we get -0.09 instead of something near 0.4? in the same was as the sequence of [70,80,90] we have something close to 100, at 101?

Thank you
Anthony of Sydney

• Adrian Tam November 17, 2021 at 7:16 am #

As said “Your specific output value will differ given the random initialization of the LSTM weights and cell state.”
The prediction is from a NOT TRAINED model hence it is random depends on the initialization in the LSTM layer. If you did model.fit() then you should see a better answer (given you have enough data to train it, which in this particular example, we don’t)

• Anthony The Koala November 17, 2021 at 2:52 pm #

The key concept was the model was “NOT TRAINED” and key statement in model building is “model.fit()” which was not implemented in order to train the model.
Thank you for the clarification.
Anthony from Sydney.

• Adrian Tam November 18, 2021 at 5:31 am #

You’re welcomed!

51. Anthony The Koala November 17, 2021 at 3:05 am #

Dear Dr Jason,
When you were using return_states = True at

I need please clarification on the last two numbers.
In the text of the tutorial it says “..Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer….”
If the first value is the predicted value, you then have “one hidden state output”, what is the 3rd output of -0.11 as in the following output.

Thank you,
Anthony of Sydney

• Adrian Tam November 17, 2021 at 7:20 am #

LSTM has a cell state, a hidden state, and an output. The three numbers you quoted are all outputs, one for each of your input [0.1, 0.2, 0.3]
See the figure at this stackoverflow question: https://stackoverflow.com/a/50235563

• Anthony The Koala November 17, 2021 at 4:17 pm #

In the LSTM model you mentioned that it had a cell state, a hidden state and an output.
I need clarification please on the order of presentation/print of the following which came from the example model.

Question is the order of the numbers in the 3D array, cell state, hidden state and output?

Then I tried to model.fit() using the following code:

I wanted to get a one step ahead prediction for x.

The “split_sequence” was based under subheading “Vanilla LSTM” at https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

Thank you,
Anthony of Sydney

• Adrian Tam November 18, 2021 at 5:35 am #

Your model expects input shape of (3,1) but you called “x, y = split_sequence(data,1)” to produce input shape of (1,1) — so change it into “x, y = split_sequence(data,3)” should fix

• Anthony The Koala November 18, 2021 at 8:24 pm #

Unfortunately the modification did not work. I still had runtime errors.

So what I did was to replicate the code under the subheading “Return Sequences” and add compile and fit.

The result: there were no errors, and there was no number which I expected to be 0.4

BUT I could not predict that the next value is expected to be 0.4

In sum:
* I made x, in the shape (1,3,1) = shape (samples, time_steps, features) as per https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/
x = array([0.1, 0.2, 0.3]).reshape((1,3,1))

* I made y, in the shape (1,3) = shape (samples, time_steps)
y = array([0.2,0.3,0.4]).reshape((1,3))

* Added compile and fit which was not in the original model
* The aim was to predict the next value of y to be 0.5

Result: I could not find anything from the predictions resembling y being close to 0.5

Thank you,
Anthony of Sydney

• Adrian Tam November 19, 2021 at 10:30 am #

300 epoch with bath_size=1 and dataset size of 1 means you allows only 300 chances to update the network weights. I tried to bump this up to 10000 and it seems to produce a better result. Also try too set return_state=False as you’re not using it and not training it anyway.

52. Anthony The Koala November 19, 2021 at 2:55 pm #

By setting return_state=False, I get the following runtime error:

However keeping return_state=True, I don’t have problems. Nevertheless, I don’t have anything matching the predicted value of 0.4.

When epochs=1000. last value is close to 0.4 at 0.377

When epochs = 2000, last value is 0.345 closer to 0.4
When epochs = 4000, last value is 0.303 larger departure from 0.4
When epochs = 8000, last value is 0.322 closer to 0.4
When epochs = 16000, last value is 0.308 larger departure from 0.4

Summary:
* By setting return_state=False. a runtime error occurs
* Increasing the number of epochs does not necessarily mean more accurate. In this experiment, increasing the epochs to 16000 produced a prediction worse than when no. of epochs was 2000.
* I would like to understand why an LSTM for a really short array did not accurately predict the next value in the sequence.

Thank you,
Anthony of Sydney

• Adrian Tam November 20, 2021 at 1:50 am #

If you set return_state=False, your LSTM will return only output but not the states. So you should write:

• Anthony The Koala November 20, 2021 at 4:54 pm #

It appears that the results have much improved by this modifcation:

epochs = 300, final value = 0.32

epochs = 1000, final result = 0.417

epochs = 2000, final result = 0.4

epochs = 4000, final result = 0.391

Here is the final program then I have summary and questions:

Summary:
* The ‘simplified’ model produced no runtime errors. Only interested in lstm output
* More accurate results with return_state=False.
* Increasing the number of epochs may also increase accuracy as evidenced by the magnitude of the error at each epoch. However, increasing the number of epochs may also result in the error increasing. In this case epochs > 2000 resulted in an increased error and less-accurate result.
* A more accurate number of epochs to get the minimum error may well be when number of epochs between 1000 and 2000. The optimum number of epochs can be determined graphically or by early stopping mechanism.
* Note on “return_states=True” and “return_states=False”. Less accurate results when “return_states = True”

* To get an optimum result, the number of epochs was over 1000. It is such a simple AR(1) process, why so many epochs required to produce an accurate result?
* I wanted to predict with a new x:

(i) why did I need a length 3 array of [0.2, 0.3, 0.4] – Why couldn’t I predict with 1 size array of [0.4]?

(ii) to get as close to a prediction of 0.5, I had to increase no. of epochs to 10000, but the result was only 0.45805833, not 0.5 as expected.

(iii) when I used return_states=True, the result was less accurate. Need clarification of return_states=False and return_states=True, there is documentation, BUT not on the consequences on the final result in a simple model when return_states = True and return_states = False.

Thank you again for your assistance as it gets me to understand the concept a little more clearly.

Anthony of Sydney

• Adrian Tam November 21, 2021 at 7:50 am #

(i) because you set return_sequence=True
(ii) because your data is too little for the LSTM to learn this is the rule
(iii) empirically you see the effect. I didn’t dig into the code to tell why, but I guess the internal design of tensorflow make it not care the unused variable and hence trained the network better, kind of like increasing the signal and reducing noise

• Anthony The Koala November 21, 2021 at 4:55 pm #

I understood the answers to (ii) and (iii), especially requiring the length of the sequence to be longer and accepting the possible quirkiness of the TensorFlow backend in processing whether return_states=True or False.

Nevertheless for (i), even setting return_sequences=False, I don’t understand why I still need to have len(x) = 3 instead of 1 when I don’t have.

I had to have x to be of length 3 to predict.

I could not have x = array([0.4]).reshape((1,1,1)) to predict..

I get a runtime error

In other words, if I only want to predict one value, why do I need three values as input?

Anyway, I appreciate your replies because it helps me better understand,

Thank you,
Anthony of Sydney

• Adrian Tam November 23, 2021 at 1:08 pm #

You should notice you passed on an input layer to the LSTM layer. In the input layer, you mentioned what shape you are expecting. Hence it is 3, not 1.

53. Anthony The Koala November 23, 2021 at 3:14 pm #

The conclusion is if you want to predict for an LSTM model, the shape of the prediction data must be the same 3D shape of the LSTM’s input layer.

Anthony of Sydney

• Adrian Tam November 24, 2021 at 1:02 pm #

That’s correct. I think that’s partially a limitation imposed by Keras’ design.

54. Anthony The Koala November 24, 2021 at 7:16 pm #

Thank you, it is appreciated
Anthony of Sydney

55. Steffen January 26, 2022 at 3:04 am #

if it is possible to get the outputs like that, is then also possible to change the RNN & LSTM Layer in some way, so that several hidden states can be used as input & internally in the lstmcell?

if yes, what needs to be changed in the rnn layer/what can be used? i think to concat the hidden states before doesnt work alone, since i want to seperate them afterwards in the cell again.