Last Updated on August 14, 2019

The Keras deep learning library provides an implementation of the Long Short-Term Memory, or LSTM, recurrent neural network.

As part of this implementation, the Keras API provides access to both return sequences and return state. The use and difference between these data can be confusing when designing sophisticated recurrent neural network models, such as the encoder-decoder model.

In this tutorial, you will discover the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

After completing this tutorial, you will know:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

**Kick-start your project** with my new book Long Short-Term Memory Networks With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Letâ€™s get started.

## Tutorial Overview

This tutorial is divided into 4 parts; they are:

- Long Short-Term Memory
- Return Sequences
- Return States
- Return States and Sequences

## Long Short-Term Memory

The Long Short-Term Memory, or LSTM, is a recurrent neural network that is comprised of internal gates.

Unlike other recurrent neural networks, the networkâ€™s internal gates allow the model to be trained successfully using backpropagation through time, or BPTT, and avoid the vanishing gradients problem.

In the Keras deep learning library, LSTM layers can beÂ created using the LSTM() class.

Creating a layer of LSTM memory units allows you to specify the number of memory units within the layer.

Each unit or cell within the layer has an internal cell state, often abbreviated as “*c*“, and outputs a hidden state, often abbreviated as “*h*“.

The Keras API allows you to access these data, which can be useful or even required when developing sophisticated recurrent neural network architectures, such as the encoder-decoder model.

For the rest of this tutorial, we will look at the API for access these data.

## Return Sequences

Each LSTM cell will output one hidden state *h* for each input.

1 |
h = LSTM(X) |

We can demonstrate this in Keras with a very small model with a single LSTM layer that itself contains a single LSTM cell.

In this example, we will have one input sample with 3 time steps and one feature observed at each time step:

1 2 3 |
t1 = 0.1 t2 = 0.2 t3 = 0.3 |

The complete example is listed below.

Note: all examples in this post use the Keras functional API.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example outputs a single hidden state for the input sequence with 3 time steps.

Your specific output value will differ given the random initialization of the LSTM weights and cell state.

1 |
[[-0.0953151]] |

It is possible to access the hidden state output for each input time step.

This can be done by setting the *return_sequences* attribute to *True* when defining the LSTM layer, as follows:

1 |
LSTM(1, return_sequences=True) |

We can update the previous example with this change.

The full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1 = LSTM(1, return_sequences=True)(inputs1) model = Model(inputs=inputs1, outputs=lstm1) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer.

1 2 3 |
[[[-0.02243521] [-0.06210149] [-0.11457888]]] |

You must set *return_sequences=True* when stacking LSTM layers so that the second LSTM layer has a three-dimensional sequence input. For more details, see the post:

You may also need to access the sequence of hidden state outputs when predicting a sequence of outputs with a *Dense* output layer wrapped in a TimeDistributed layer. See this post for more details:

## Return States

The output of an LSTM cell or layer of cells is called the hidden state.

This is confusing, because each LSTM cell retains an internal state that is not output, called the cell state, or *c*.

Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.

Keras provides the return_state argument to the LSTM layer that will provide access to the hidden state output (*state_h*) and the cell state (*state_c*). For example:

1 |
lstm1, state_h, state_c = LSTM(1, return_state=True) |

This may look confusing because both lstm1 and *state_h* refer to the same hidden state output. The reason for these two tensors being separate will become clear in the next section.

We can demonstrate access to the hidden and cell states of the cells in the LSTM layer with a worked example listed below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example returns 3 arrays:

- The LSTM hidden state output for the last time step.
- The LSTM hidden state output for the last time step (again).
- The LSTM cell state for the last time step.

1 2 3 |
[array([[ 0.10951342]], dtype=float32), array([[ 0.10951342]], dtype=float32), array([[ 0.24143776]], dtype=float32)] |

The hidden state and the cell state could in turn be used to initialize the states of another LSTM layer with the same number of cells.

## Return States and Sequences

We can access both the sequence of hidden state and the cell states at the same time.

This can be done by configuring the LSTM layer to both return sequences and return states.

1 |
lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True) |

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 |
from keras.models import Model from keras.layers import Input from keras.layers import LSTM from numpy import array # define model inputs1 = Input(shape=(3, 1)) lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1) model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) # define input data data = array([0.1, 0.2, 0.3]).reshape((1,3,1)) # make and show prediction print(model.predict(data)) |

Running the example, we can see now why the LSTM output tensor and hidden state output tensor are declared separably.

The layer returns the hidden state for each input time step, then separately, the hidden state output for the last time step and the cell state for the last input time step.

This can be confirmed by seeing that the last value in the returned sequences (first array) matches the value in the hidden state (second array).

1 2 3 4 5 |
[array([[[-0.02145359], [-0.0540871 ], [-0.09228823]]], dtype=float32), array([[-0.09228823]], dtype=float32), array([[-0.19803026]], dtype=float32)] |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

- Keras Functional API
- LSTM API in Keras
- Long Short-Term Memory, 1997.
- Understanding LSTM Networks, 2015.
- A ten-minute introduction to sequence-to-sequence learning in Keras

## Summary

In this tutorial, you discovered the difference and result of return sequences and return states for LSTM layers in the Keras deep learning library.

Specifically, you learned:

- That return sequences return the hidden state output for each input time step.
- That return state returns the hidden state output and cell state for the last input time step.
- That return sequences and return state can be used at the same time.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Thanks for this!

To help people understand some applications of the output sequence and state visually, a picture like in the following stats overflow answer is great!

https://stats.stackexchange.com/a/181544/37863

Thanks.

Hi Jason,

Do you have plans to use more of the function API in your blog series?

if so, why?

Best regards

Thabet

Yes, it is needed for more advanced model development.

I will have a “how to…” post on the functional API soon. It is scheduled.

Hi Jason, is it possible to access the internal states through return_state = True and return_sequences = True with the Sequencial API? Moreover, is it possible to set the hidden state through a function like set_state() ?

Thanks!

Perhaps, but not as far as I know or have tested.

Hey Jason, I wanted to show you this cool new RNN cell I’ve been trying out called “Recurrent Weighted Average” – it implements attention into the recurrent neural network – the keras implementation is available at https://github.com/keisuke-nakata/rwa and the whitepaper is at https://arxiv.org/pdf/1703.01253.pdf

I’ve also seen that GRU is often a better choice unless the LSTM’s bias is initialized to ones, and it’s baked into Keras now (whitepaper for that at http://proceedings.mlr.press/v37/jozefowicz15.pdf )

Very cool!

Just a note to say that return_state seems to be a recent addition to keras (since tensorflow 1.3 – if you are using keras in tensorflow contrib).

Shame it’s not available in earlier versions – I was looking forward to playing around with it ðŸ™‚

Alex

Thanks Alex.

Hi Jason, in these example you don’t fit.

When you define the model like this: model = Model(inputs=inputs1, outputs=[lstm1, state_h, state_c]) and then fit, the fit() function expects three values for the output instead of 1. How to correctly print the states (o see they change during training and/or prediction ?

For a model that takes 2 inputs, they must be provided to fit() as an array.

Hi Jason, the question was about the outputs, not the inputs.. The problem is that if i set outputs=[lstm1, state_h, state_c] in the Model(), then the fit() function will expect three arrays as target arrays.

Hi Alex, did u find how to handle the fit in this case?

Suppose i have

model = Model(inputs=[input_x, h_one_in , h_two_in], outputs=[y1,y2,state_h,state_c])

how would I write my mode.fit? in the input and outputs?

Thanks,

+1

I use random initialization but the results are disappointing.

Any other ideas?

Jason,

Brilliant post as usual. I am also going to buy your LSTM book.

I however had a question on Keras LSTM error I have been getting and was hoping if you could help that?

Getting an error like this

“You must feed a value for placeholder tensor ’embedding_layer_input'”

/usr/local/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py in raise_exception_on_not_ok_status()

465 compat.as_text(pywrap_tensorflow.TF_Message(status)),

–> 466 pywrap_tensorflow.TF_GetCode(status))

467 finally:

InvalidArgumentError: You must feed a value for placeholder tensor ’embedding_layer_input’ with dtype float

[[Node: embedding_layer_input = Placeholder[dtype=DT_FLOAT, shape=[], _device=”/job:localhost/replica:0/task:0/gpu:0″]()]]

[[Node: output_layer_2/bias/read/_237 = _Recv[client_terminated=false, recv_device=”/job:localhost/replica:0/task:0/cpu:0″, send_device=”/job:localhost/replica:0/task:0/gpu:0″, send_device_incarnation=1, tensor_name=”edge_1546_output_layer_2/bias/read”, tensor_type=DT_FLOAT, _device=”/job:localhost/replica:0/task:0/cpu:0″]()]]

During handling of the above exception, another exception occurred:

Here is the code I wrote for this:

def model_param(self):

# Method to do deep learning

from keras.models import Sequential

from keras.layers import Dense, Flatten, Dropout, Activation

from keras.layers import LSTM

from keras.layers.embeddings import Embedding

from keras.initializers import TruncatedNormal

tn=TruncatedNormal(mean=0.0, stddev=1/sqrt(self.x_train.shape[1]*self.x_train.shape[1]), seed=2)

self.model = Sequential()

self.model.add(Embedding(self.len_vocab,300,input_length=self.x_train.shape[1]))

# Adding LSTM cell

self.model.add(LSTM(self.num_units,dropout=0.30,kernel_initializer=tn,name=”lstm_1″))

# Adding the dense output layer for Output

self.model.add(Dense(1,activation=”sigmoid”,name=”output_layer”))

#sgd = SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

self.model.compile(loss=’binary_crossentropy’,

optimizer=”adam”,

metrics=[‘accuracy’])

self.model.summary()

def fit(self):

# Training the deep learning network on the training data

# Adding the callbacks for Logging

import keras

logger_tb=keras.callbacks.TensorBoard(

log_dir=”logs_sentiment_lstm”,

write_graph=True,

histogram_freq=5

)

self.model.fit(self.x_train, self.y_train,validation_split=0.20,

epochs=10,

batch_size=128,callbacks=[logger_tb]

)

Ouch, I have not seen this fault before.

Perhaps try simplifying the example to flush out the cause?

histogram_freq=5 is causing this error, this is a bug in keras, set histogram_freq=0 and it should work fine

Thanks for sharing.

This is another great Post Jason! I am a fan of all your RRNs posts. ðŸ˜‰ Thanks!

In case anyone was wondering the difference between c (Internal state) and h (Hidden state) in a LSTM, this answer was very helpful for me:

https://www.quora.com/What-is-the-difference-between-states-and-outputs-in-LSTM

Would be correct to say that in a GRU and SimpleRNN, the c=h?

Thanks in advance!

Thanks Julian.

Hi Jason,

In the implementation of encoder-decoder in keras we do a return state in the encoder network which means that we get state_h and state_c after which the [state_h,state_c] is set as initial state of the decoder network. What does setting the initial state mean for a LSTM network? Is it that the state_h of decoder = [state_h,state_c].

Thanks in advance.

Great question, here is an example:

https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

Great. I think i get it now. Basically when we give return_state=True for a LSTM then the LSTM will accept three inputs that is decoder_actual_input,state_h,state_c. And during inference we again reset the state_c and state_h with state_h and state_c of previous prediction. Am i correct on my assumption ?

I am still a little bit confused why we use three keras models (model,encoder_model and decoder_model).

Thank You.

The return_state argument only controls whether the state is returned. A different argument is used to initialize state for an LSTM (e.g. during the definition of the model with the functional API).

Got it. Its initial_state. Thank You Jason.

Glad to hear it.

Hi Jason,

I do enjoy reading your blog. I have 2 short questions for this post and hope you could kindly address them briefly:

1. Can we return the sequence of cell states (a sort of variable similar to *lstm1*)?

2. No matter the dimension (I mean #features) of the input sequence, as we place 1 LSTMcell in the layer, both the hidden and cell states are always a scalar, right? As such, the kernel_ and recurrent_kernel_ properties in Keras (at each gate) are not in the matrix form. However, I believe your standpoint on viewing each LSTM cell having 1Dim hidden state/cell makes sense in the case of dropout in deep learning.

Please correct me if I misunderstood your post. Thank you.

Not directly, perhaps by calling the model recursively.

I think you’re right.

I have the same questions like Q1, so how do you output the sequence of cell states? Thank you.

You don’t, generally. You output a sequence of activations, referred to in the papers as h.

Hi, very good explanation.

One question, I thought h = activation (o), is that correct? (h: hidden state output, o: hidden cell)

But tanh(-0.19803026) does not equals -0.09228823. (The default activation for LSTM should be tanh)

Thank you so much, Jason. This cleared my doubt.

I’m glad to hear that.

Thank you so much for writing this. This is really a big help.

Thanks, I’m glad it helped.

hey Jasonï¼Œ how could i get the final hidden state and sequence both when using a bidirectional wrapperï¼Ÿ

Does the above post not help?

Hi Jason, can I connect the output of a dense layer to the c state of a LSTM in such a way that it initialize the state c with this value before each batch? Thanks

I’m sure you can (it’s all just code), but it might require some careful programming. I don’t have good advice other than lots of trial and error.

Let me know how you go.

What is the hidden state and cell state of the first input if it does not have a previous hidden or cell state to reference?

It will be zero.

When I use following code based on bidirectional LSTM, it retruns this error:

‘ is not connected, no input to return.’)

AttributeError: Layer sequential_1 is not connected, no input to return.

But when ordinary LSTM (commented code) is ran, it returns correctly.

self.model = Sequential()

# self.model.add(LSTM(input_shape=(None,self.num_encoder_tokens), units=self.n_hidden,

# return_sequences=True,name=’hidden’))

# self.model.add(LSTM(units=self.num_encoder_tokens, return_sequences=True))

# self.intermediate_layer = Model(inputs=self.model.input, outputs=self.model.get_layer(‘hidden’).output)

self.model.add(Bidirectional(LSTM(input_shape=(None,self.num_encoder_tokens), units=self.n_hidden,

return_sequences=True,name=’hidden’),merge_mode=’concat’))

self.model.add(Bidirectional(LSTM(units=self.num_encoder_tokens, return_sequences=True),merge_mode=’concat’))

self.intermediate_layer = Model(input=self.model.input,output=self.model.get_layer(‘hidden’).output)

why?

I have some suggestions here:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

https://stackoverflow.com/questions/49313650/how-could-i-get-both-the-final-hidden-state-and-sequence-in-a-lstm-layer-when-us

This can help

Is there no reply for this?

Thank you so much for your explanation!

I’m happy it helped Sam!

Awesome Work Jason. I always thought the last hidden state is equal to the cell state. So I was wrong and the hidden state and the cell state is never the same?

Thank you

Yes, they are different things.

Hi,

I just wanna thank you for the entire site.

Whenever I am stuck in code or concepts I visit your site and things get cleared up.

Thanks, I’m glad its helpful!

Hi so in the above example our network consist of only one lstm node or cell

And the output is feed to it of 3 timestamps one at a time ?

Or does it have 3 cells for each timestemp

The number of nodes in the LSTM is unrelated to the number of time steps in the data sample.

Perhaps this will make things clearer for you:

https://machinelearningmastery.com/prepare-univariate-time-series-data-long-short-term-memory-networks/

“Generally, we do not need to access the cell state unless we are developing sophisticated models where subsequent layers may need to have their cell state initialized with the final cell state of another layer, such as in an encoder-decoder model.”

in the another your post,encoder-decoder LSTM Model code as fllower:

model.add(LSTM(200, activation=’relu’, input_shape=(n_timesteps, n_features)))

model.add(RepeatVector(n_outputs))

model.add(LSTM(200, activation=’relu’, return_sequences=True))

but return_state = false?

Correct.

Hello Jason,

Thanks for the great post. I have a quick question about the bottleneck in the LSTM encoder above. As I understand it, if the encoder has say 50 cells, then does this mean that the hidden state of the encoder LSTM layer contains 50 values, one per cell?

If this is correct, then would it be accurate to say that if the original data had 50 timesteps and a dimensionality/feature count of 3, then having an encoder LSTM with 20 cells (which would give a hidden state of 20 values) could be considered to be a sort of compression/dimensionality reduction (a la autoencoders and compressed representations) ?

Finally, does it make sense to apply have a fully-connected layer with some nonlinearity operating on the hidden state for purposes of dimensionality reduction i.e hidden state with 50 values -> FFlayer with 10 neurons, ‘compressing’ the 50 values to 10…?

Thanks again

Yes, correct.

If you want to use the hidden state as a learned feature, you could feed it into a new fully connected model.

Brilliant, thanks!

Excellent post, how would one save the state when prediction samples arrives from multiple sources, like the question posted here https://stackoverflow.com/questions/54850854/keras-restore-lstm-hidden-state-for-a-specific-time-stamp ?

You can save state by retrieving it from the model and saving it to a file.

Thank you so much ðŸ™‚

Dear Jason

Thank you very much for the great post. Currently I am working on two-stream LSTM network(sequence of images) and I am trying to extract both LSTMs each time step’s cell state and calculate the average value. Afterwards update next time step with this previous time step’s average value + existing cell state value. And continue this process thru all time steps.

Greatly appreciate if you could explain me how do we update LSTM cell states(as each time steps) by giving additional value. Thank you very much.

Why are you trying to average the cell state exactly?

Thank you very much for your response. I am doing it the following way

Please be noted 2nd LSTM is for Optical flow stream mistakenly comment both LSTMs for RGB. Thank you.

Why? Why do you want to do this?

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

Currently I working on two-steam networks with image sequence. I want to study that is there any advantage of communicating cells states in each time steps of both streams rather than without communicate (just as normal 2-stream network) as part of my research. Thank you for your concern.

Interesting, let me know how you go.

Really great article, Thanks a lot:)

Thanks.

It really solved my confusion. Thank you ðŸ™‚

I’m happy to hear that.

Thanks Jason,

Can you pls tell me how to use return states with Bidirectional wrapper on LSTM? The unpacking of outputs throws error

code:

encoder = Bidirectional(LSTM(n_a, return_state=True))

encoder_outputs, state_h, state_c = encoder(encoder_inputs)

error:

ValueError: too many values to unpack (expected 3)

Perhaps assign the result to one variable and inspect it to see what you have?

Question: Is only the hidden state forwarded to upper layers in LSTM, or is also the memory cell state forwarded to upper layers?

Or is the memory cell state only forwarded along the time sequence?

Thank you!

Only the hidden state is output, memory state remains internal the node.

Hi Jason,

Thanks for sharing. I am not sure if I understand Keras.LSTM correctly. Could you please help me clarify / correct the following statements?

1. Keras LSTM is an output-to-hidden recurrent by default, e.g. it sends previous output to current hidden layers;

2. To create a hidden-to-hidden LSTM, can we do:

lstm1, state_h, state_c = LSTM(1, return_sequences=True, return_state=True)(inputs1)

model = Model(inputs=(inputs1, state_h), outputs=lstm1)

3. Does Keras train LSTM using teaching force or BPTT?

Not sure I follow. The LSTM has outputs and hidden state.

Sorry for the confusion. My output-to-hidden refers to the 2nd of the three patterns in Goodfellow’s Deep Learning: Chapter 10

”

Some examples of important design patterns for recurrent neural networks include the following:

â€¢ Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, illustrated in figure 10.3.

â€¢ Recurrent networks that produce an output at each time step and have recurrent connections only from the output at one time step to the hidden units at the next time step, illustrated in figure 10.4

â€¢ Recurrent networks with recurrent connections between hidden units, that read an entire sequence and then produce a single output, illustrated in figure 10.5.

“

Back to me question:

1. Keras LSTM is pattern 2 (previous output to current hidden) by default?

2. Can we use return_state to create a pattern 1 (previous hidden to current hidden) model?

3. Does Keras train LSTM using BPTT? or it can choose between teaching force and BPTT based on patterns?

If you mean laterally within a layer, then no. If you across layers, then yes.

Yes, Keras supports a version of BPTT, more details here in general:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

And here for Keras:

https://machinelearningmastery.com/truncated-backpropagation-through-time-in-keras/

Thank you, Jason!

Hi Jason.

I wanted to stack 2 GRUs. First one has hidden layers 64 and the second one 50 hidden layers

I am unsure how to go about defining that.

can you please help

encoder_inputs = Input(batch_shape=(32, 103, 1), name=’encoder_inputs’)

encoder_gru1 = GRU(64, return_sequences=True, return_state=True,name=’encoder_gru1′)

encoder_out1, encoder_state1 = encoder_gru1(encoder_inputs)

encoder_gru2 = GRU(50, return_sequences=True, name=’encoder_gru’) encoder_out, encoder_state = encoder_gru2(encoder_out1)

decoder_gru = GRU(50, return_sequences=True, name=’decoder_gru’)

decoder_out, decoder_state = decoder_gru(encoder_out)

However i get following error

encoder_out, encoder_state = encoder_gru2(encoder_out1)

File “C:\Users\Harshula\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py”, line 457, in __iter__

“Tensor objects are only iterable when eager execution is ”

TypeError: Tensor objects are only iterable when eager execution is enabled. To iterate over this tensor use tf.map_fn.

You can use this tutorial as a starting point and change the LSTMs to GRUs:

https://machinelearningmastery.com/stacked-long-short-term-memory-networks/

Thank you for these understandable article. I have a question, how to plot predictions. I mean when I apply sine wave to code to see three output of LSTM how can I plot outputs in the form of continues signal.

You could use matplotlib and the plot() function.

I mean I want to plot lstm1, state_h, state_c. but when I write model.fit like that:

model.fit(trainX, trainY=[lstm1, state_h, state_c], epochs=10, batch_size=1, verbose=2)

I got this error:

TypeError: Unrecognized keyword arguments: {‘trainY’: [, array([[]],

dtype=object), array([],

dtype=object)]}

should all of lsrm1, state_h, state_c have three dimension?

This looks really wrong, e.g. state variables as target variables in a call to fit.

What are you trying to achieve exactly?

Hi Jason

My code has three output of lstm : output, hidden_state, cell_state.

I want to see all of them.

lstm, h, c = LSTM(units=20, batch_input_shape=(1,10, 2), return_sequences=True,

return_state=True)(inp)

dense = Dense(2)(lstm)

model = Model(inputs=inp, outputs=dense )

I want to plot all three of my output. I can plot lstm but I can’t plot h_state and c_state.

How can I do that?

That is odd.

Nevertheless, you could use matplotlib to plot anything you wish. e.g. plot(…)

Hello Jason,

Thank you for this good post.

I have two hidden states of two Bi-LSTM, H1 and H2, and i want to use them as inputs in two Dense layer. shoud i connect the two dense layers with the two Bi_LSTM and tha’s done? or connect them directly with the hidden states?

some thing like this :

hidden1 = Dense(100)(H1)

hidden2 = Dense(100)(H2)

thanks again!

If by hidden states you mean those states that are internal to the LSTM layers, then I don’t think there is an effective way to pass them to a dense.

If you mean the outputs of the layer (the common meaning), then this looks fine.

for being mor clear, i have two text inputs and i use embedding for encoding them, and i puted the output of embeddings into two Bi-LSTMs.

so i want to use the hidden states of the two Bi-LSTM to do predictions. The hidden state for the first input is returned as above :

lstm, forward_h, forward_c, backward_h, backward_c= Bidirectional(..)(Embedding)

and H1 is calculated as : H1 = Concatenate()([forward_h, backward_h]).

the same thing i did for the seconde input and i calculated H2.

mathematiccaly, how can i impliment the above formule :

softmax(V tanh(W1*H1 + W2*H2))

which W and V represent all trainable parameter matrices and vectors, respectively.

thank’s Jason for helping me.

Nice work!

Looks like you want a weighted sum of the two vectors, perhaps a custom layer?

perhaps but to decrease complexity, i removed the two Bi-LSTM so i use the embeddings only for encoding.

so in order to do classification by using the 2 embeddings, can i use this mathematique formule: softmax(V tanh(W1*E1 + W2*E2)) ? if so, the code above is correct to represente it?

input1 = Input(shape=(25,))

E1 = Embedding(vocab_size, 100, input_length=25,

weights=[embedding_matrix], trainable=False)(input1 )

input1_hidden1 = Dense(100)(E1)

input2 = Input(shape=(25,))

E2 = Embedding(vocab_size, 100, input_length=25,

weights=[embedding_matrix], trainable=False)(input2 )

input1_hidden2 = Dense(100)(E2 )

added = add([userQuestion_hidden1, tableShema_hidden1])

added = Activation(‘tanh’)(added)

output1 = Dense(1, activation=’softmax’)(added)

model = Model(inputs=[input1 , input2],outputs=output1)

I’m eager to help, but I don’t have the capacity to review/debug your code.

Perhaps try posting to the keras user group:

https://machinelearningmastery.com/get-help-with-keras/

that’s interesting, thanks a lot Jason

No problem.

please i have an error

TypeError:

`GRU`

can accept only 1 positional arguments (‘units’,), but you passed the following positional arguments: [4, 200]Perhaps this will help you to better understand the input shape:

https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input

Amazing explanation!

Just have one confusion.

In the very first example, where LSTM is defined as LSTM(1)(inputs1) and Input as Input(shape=(3,1)).

So, the way you have reshaped data (1,3,1) means the timesteps value is 3(middle value) and the no of cells in LSTM is 1 i.e., LSTM(1).

I am confused about how 1-LSTM is going to process 3 timestep value. I mean shouldn’t there be 3 neurons/LSTM(3) to process the (1,3,1) shape data? Or is the LSTM going to process each input one after the other in sequence?

Thanks

Thanks.

More on time steps vs samples vs features here:

https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input

Number of LSTM units is unrelated to timesteps/features/samples.

Perfectly clear. It was ver helpful to me.

Thanks, I’m happy to hear that!

Hi,

Thanks for the clear discuss. I have a question about a little different implementation. I have a model with an lstm layer, where the hidden layer of the last time step will be passed to a softmax to create a sentiment. Is there any way that I could access the hidden states of this model when passing a new sequence to it?

Yes, you can define the model using the functional api to output the hidden state as a separate output of the model.

Hi,

Thanks for the clear explanation. I am trying to make an encoder-decoder model, but this model will have two decoders(d1 and d2) and one encoder. One decoder(d1) gets input only from encoder while another one(d2) will get input from encoder and other decoder(d1). d2 must get hidden states from d1 only when d1 makes a particular type of prediction. Both decoders have a different set of vocabulary. Say d1 has “a,b,c,d” and d2 has “P,Q,R,S”. I want to pass a hidden state from d1 to d2 only when d1 predicts “b”.

I hope this statement gives some sense of what I am trying to do. Thanks!

That sounds complex. Not sure what I can do for you, sorry.

Perhaps experiment with different prototypes until you achieve what you need?

Dear Jason,

I usually visit your website lot of times for if i have any question. All your articles are so crisp and so is this return sequences and return state. No complex coding and point to point. Thanks for the good work you are doing.

I have a question. If in the above examples instead of LSTM(1), if we give LSTM(5) lets say.

inputs1 = Input(shape=(3, 1))

lstm1 = LSTM(5, return_sequences=True)(inputs1)

Then my output will be a 3 D array. But i wonder how 5 hidden states at each time step are

generated in LSTM. each LSTM has 1 hidden and 1 cell state right.

Can you please clarify my question?

Good question, see this:

https://machinelearningmastery.com/faq/single-faq/how-is-data-processed-by-an-lstm

[[[0.1]

[0.2]

[0.3]]] is the input given to the LSTM. But instead, Can the input to the LSTM be [[0.1 0.2 0.3]]

Ok, I have found the Answer. The LSTM layer requires input only in 3D format.

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Jason,

A quick question.

When you produce a single hidden state output, does that mean the prediction for t4 based on the input data set of [t1, t2, t3]? or the prediction on t3?

Along the same line, when producing three steps hidden state output, does that mean the prediction on for [t1, t2. t3] or [t2, t3, t4]?

Also, if we were to want to get a single hidden state output say n steps ahead (t+n), how do we specify that in your example?

Thanks and hope to hear back from you soon!

This was a dumb question.

There’s no timestep-based prediction set up here including data prep and training accordingly for that need.

I’d interpret hidden state outputs literally as outputs that carry over information up to t3 from t1.

Thanks

When you use return state, you are only getting the state for the last time step.

Thank you Jason!

Your materials helps me very much in learning.

Your simple and clear explanations is what newcommers realy need.

You’re very welcome!

My LSTM is like this:

def _get_model(input_shape, latent_dim, num_classes):

inputs = Input(shape=input_shape)

lstm_lyr,state_h,state_c = LSTM(latent_dim,dropout=0.1,return_state = True)(inputs)

fc_lyr = Dense(num_classes)(lstm_lyr)

soft_lyr = Activation(‘relu’)(fc_lyr)

model = Model(inputs, [soft_lyr,state_h,state_c])

model.compile(optimizer=’adam’, loss=’mse’, metrics=[‘accuracy’])

return model

model =_get_model((n_steps_in, n_features),latent_dim ,n_steps_out)

history = model.fit(X_train,Y_train)

print (history.history.keys)

`dict_keys(['loss', 'activation_26_loss', 'lstm_151_loss', 'activation_26_accuracy', 'lstm_151_accuracy', 'val_loss', 'val_activation_26_loss', 'val_lstm_151_loss', 'val_activation_26_accuracy', 'val_lstm_151_accuracy'])`

I get 2 loss and 3 accuracies like this:

Epoch 1/2000

1/1 [==============================] – 1s 698ms/step – loss: 0.2338 – activation_26_loss: 0.1153 – lstm_151_loss: 0.1185 – activation_26_accuracy: 0.0000e+00 – lstm_151_accuracy: 0.0000e+00 – val_loss: 0.2341 – val_activation_26_loss: 0.1160 – val_lstm_151_loss: 0.1181 – val_activation_26_accuracy: 0.0000e+00 – val_lstm_151_accuracy: 0.0000e+00

How to read the losses and accuracies?

If you are using MSE loss, then calculating accuracy is invalid. You can learn more here:

https://machinelearningmastery.com/faq/single-faq/how-do-i-calculate-accuracy-for-regression

at page 102 , and page 104 of Long short term memory network with python book.

I found that when I run the program in list 8.25 I found that

no predict_classes method in Sequential object

see at

https://www.tensorflow.org/api_docs/python/tf/keras/Sequential

Please observe that there are only methods only “predict, predict_on_batch, predict_step . [***No longer use predict_classes any more]

at page 102 , and page 104 of Long short term memory network with python book.

I found that when I run the program in list 8.25 I found that

# prediction on new data

X, y = generate_examples(size, 1)

yhat = model.predict_classes(X, verbose=0) // Problem at this line

expected = “Right” if y[0]==1 else “Left”

predicted = “Right” if yhat[0]==1 else “Left”

print(‘Expected: %s, Predicted: %s’ % (expected, predicted)

Error from complie:

There is error message that ” Sequencial object has no*** attribute “predict_classes in its class” “.

and get some info from internet that

model.predict_classes method is deprecated.It has been removed after 2021-01-01

This means we can no longer use predict_classed any more.

Way to solve :

Then I solve by replace “yhat “with 2 lines below

” from numpy as np ”

“yhat = np.argmax( model.predict(X), axis= -1)”

after my replace with adding 2 lines

then all can run successfully.

I do it right ?

Please answer me

That’s correct. You did it perfectly.

Dear Dr Jason,

Thank you for your tutorial.

In the section “Return Sequences”, you had a sequence

Questions please:

* I would have thought that the next value in the sequence would be around 0.4, but in the example and trying it out myself as in the above example it is

YET, at https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/ where the sequence is

The output was close to 100:

In other words, in the first example of input = [0.1,0.2,0.3] why did we get -0.09 instead of something near 0.4? in the same was as the sequence of [70,80,90] we have something close to 100, at 101?

Thank you

Anthony of Sydney

As said “Your specific output value will differ given the random initialization of the LSTM weights and cell state.”

The prediction is from a NOT TRAINED model hence it is random depends on the initialization in the LSTM layer. If you did model.fit() then you should see a better answer (given you have enough data to train it, which in this particular example, we don’t)

Dear Dr Adrian,

The key concept was the model was “NOT TRAINED” and key statement in model building is “model.fit()” which was not implemented in order to train the model.

Thank you for the clarification.

Anthony from Sydney.

You’re welcomed!

Dear Dr Jason,

Under the subheading “Return States”,

When you were using return_states = True at

I need please clarification on the last two numbers.

In the text of the tutorial it says “..Running the example returns a sequence of 3 values, one hidden state output for each input time step for the single LSTM cell in the layer….”

If the first value is the predicted value, you then have “one hidden state output”, what is the 3rd output of -0.11 as in the following output.

Thank you,

Anthony of Sydney

LSTM has a cell state, a hidden state, and an output. The three numbers you quoted are all outputs, one for each of your input [0.1, 0.2, 0.3]

See the figure at this stackoverflow question: https://stackoverflow.com/a/50235563

Dear Dr Adrian,

Thank you for your reply.

In the LSTM model you mentioned that it had a cell state, a hidden state and an output.

I need clarification please on the order of presentation/print of the following which came from the example model.

Question is the order of the numbers in the 3D array, cell state, hidden state and output?

Then I tried to model.fit() using the following code:

I wanted to get a one step ahead prediction for x.

However I received an error.

The “split_sequence” was based under subheading “Vanilla LSTM” at https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

Thank you,

Anthony of Sydney

Your model expects input shape of (3,1) but you called “x, y = split_sequence(data,1)” to produce input shape of (1,1) — so change it into “x, y = split_sequence(data,3)” should fix

Dear Dr Adrian,

Unfortunately the modification did not work. I still had runtime errors.

So what I did was to replicate the code under the subheading “Return Sequences” and add compile and fit.

The result: there were no errors, and there was no number which I expected to be 0.4

BUT I could not predict that the next value is expected to be 0.4

In sum:

* I made x, in the shape (1,3,1) = shape (samples, time_steps, features) as per https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

x = array([0.1, 0.2, 0.3]).reshape((1,3,1))

* I made y, in the shape (1,3) = shape (samples, time_steps)

y = array([0.2,0.3,0.4]).reshape((1,3))

* Added compile and fit which was not in the original model

* The aim was to predict the next value of y to be 0.5

Result: I could not find anything from the predictions resembling y being close to 0.5

Thank you,

Anthony of Sydney

300 epoch with bath_size=1 and dataset size of 1 means you allows only 300 chances to update the network weights. I tried to bump this up to 10000 and it seems to produce a better result. Also try too set return_state=False as you’re not using it and not training it anyway.

Dear Dr Adrian,

Thank you for your reply.

By setting return_state=False, I get the following runtime error:

However keeping return_state=True, I don’t have problems. Nevertheless, I don’t have anything matching the predicted value of 0.4.

When epochs=1000. last value is close to 0.4 at 0.377

When epochs = 2000, last value is 0.345 closer to 0.4

When epochs = 4000, last value is 0.303 larger departure from 0.4

When epochs = 8000, last value is 0.322 closer to 0.4

When epochs = 16000, last value is 0.308 larger departure from 0.4

Summary:

* By setting return_state=False. a runtime error occurs

* Increasing the number of epochs does not necessarily mean more accurate. In this experiment, increasing the epochs to 16000 produced a prediction worse than when no. of epochs was 2000.

* I would like to understand why an LSTM for a really short array did not accurately predict the next value in the sequence.

Thank you,

Anthony of Sydney

If you set return_state=False, your LSTM will return only output but not the states. So you should write:

Dear Dr Adrian,

Thank you for your reply.

It appears that the results have much improved by this modifcation:

epochs = 300, final value = 0.32

epochs = 1000, final result = 0.417

epochs = 2000, final result = 0.4

epochs = 4000, final result = 0.391

Here is the final program then I have summary and questions:

Summary:

* The ‘simplified’ model produced no runtime errors. Only interested in lstm output

* More accurate results with return_state=False.

* Increasing the number of epochs may also increase accuracy as evidenced by the magnitude of the error at each epoch. However, increasing the number of epochs may also result in the error increasing. In this case epochs > 2000 resulted in an increased error and less-accurate result.

* A more accurate number of epochs to get the minimum error may well be when number of epochs between 1000 and 2000. The optimum number of epochs can be determined graphically or by early stopping mechanism.

* Note on “return_states=True” and “return_states=False”. Less accurate results when “return_states = True”

Further questions please:

* To get an optimum result, the number of epochs was over 1000. It is such a simple AR(1) process, why so many epochs required to produce an accurate result?

* I wanted to predict with a new x:

(i) why did I need a length 3 array of [0.2, 0.3, 0.4] – Why couldn’t I predict with 1 size array of [0.4]?

(ii) to get as close to a prediction of 0.5, I had to increase no. of epochs to 10000, but the result was only 0.45805833, not 0.5 as expected.

(iii) when I used return_states=True, the result was less accurate. Need clarification of return_states=False and return_states=True, there is documentation, BUT not on the consequences on the final result in a simple model when return_states = True and return_states = False.

Thank you again for your assistance as it gets me to understand the concept a little more clearly.

Anthony of Sydney

(i) because you set return_sequence=True

(ii) because your data is too little for the LSTM to learn this is the rule

(iii) empirically you see the effect. I didn’t dig into the code to tell why, but I guess the internal design of tensorflow make it not care the unused variable and hence trained the network better, kind of like increasing the signal and reducing noise

Dear Dr Adrian,

Thank you again for your kind reply, it is appreciated.

I understood the answers to (ii) and (iii), especially requiring the length of the sequence to be longer and accepting the possible quirkiness of the TensorFlow backend in processing whether return_states=True or False.

Nevertheless for (i), even setting return_sequences=False, I don’t understand why I still need to have len(x) = 3 instead of 1 when I don’t have.

I had to have x to be of length 3 to predict.

I could not have x = array([0.4]).reshape((1,1,1)) to predict..

I get a runtime error

In other words, if I only want to predict one value, why do I need three values as input?

Anyway, I appreciate your replies because it helps me better understand,

Thank you,

Anthony of Sydney

You should notice you passed on an input layer to the LSTM layer. In the input layer, you mentioned what shape you are expecting. Hence it is 3, not 1.

Dear Dr Adrian,

The conclusion is if you want to predict for an LSTM model, the shape of the prediction data must be the same 3D shape of the LSTM’s input layer.

Again thank you very much for your kind reply,

Anthony of Sydney

That’s correct. I think that’s partially a limitation imposed by Keras’ design.

Dear Dr Adrian,

Thank you, it is appreciated

Anthony of Sydney

Hi Adrian,

if it is possible to get the outputs like that, is then also possible to change the RNN & LSTM Layer in some way, so that several hidden states can be used as input & internally in the lstmcell?

if yes, what needs to be changed in the rnn layer/what can be used? i think to concat the hidden states before doesnt work alone, since i want to seperate them afterwards in the cell again.

Thank you for your answer

Hi Steffen…Please kindly reduce your post to a specific question regarding the tutorial content, code listing or ebook so that I may better assist you.

I am confused about input-hidden-output layers. Is there a 1 input and 1 output layer by default and when we add layers in the model are we just changing hidden layer?