### Gentle introduction to the Encoder-Decoder LSTMs for

sequence-to-sequence prediction with example Python code.

The Encoder-Decoder LSTM is a recurrent neural network designed to address sequence-to-sequence problems, sometimes called seq2seq.

Sequence-to-sequence prediction problems are challenging because the number of items in the input and output sequences can vary. For example, text translation and learning to execute programs are examples of seq2seq problems.

In this post, you will discover the Encoder-Decoder LSTM architecture for sequence-to-sequence prediction.

After completing this post, you will know:

- The challenge of sequence-to-sequence prediction.
- The Encoder-Decoder architecture and the limitation in LSTMs that it was designed to address.
- How to implement the Encoder-Decoder LSTM model architecture in Python with Keras.

Let’s get started.

## Sequence-to-Sequence Prediction Problems

Sequence prediction often involves forecasting the next value in a real valued sequence or outputting a class label for an input sequence.

This is often framed as a sequence of one input time step to one output time step (e.g. one-to-one) or multiple input time steps to one output time step (many-to-one) type sequence prediction problem.

There is a more challenging type of sequence prediction problem that takes a sequence as input and requires a sequence prediction as output. These are called sequence-to-sequence prediction problems, or seq2seq for short.

One modeling concern that makes these problems challenging is that the length of the input and output sequences may vary. Given that there are multiple input time steps and multiple output time steps, this form of problem is referred to as many-to-many type sequence prediction problem.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Encoder-Decoder LSTM Architecture

One approach to seq2seq prediction problems that has proven very effective is called the Encoder-Decoder LSTM.

This architecture is comprised of two models: one for reading the input sequence and encoding it into a fixed-length vector, and a second for decoding the fixed-length vector and outputting the predicted sequence. The use of the models in concert gives the architecture its name of Encoder-Decoder LSTM designed specifically for seq2seq problems.

… RNN Encoder-Decoder, consists of two recurrent neural networks (RNN) that act as an encoder and a decoder pair. The encoder maps a variable-length source sequence to a fixed-length vector, and the decoder maps the vector representation back to a variable-length target sequence.

— Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.

The Encoder-Decoder LSTM was developed for natural language processing problems where it demonstrated state-of-the-art performance, specifically in the area of text translation called statistical machine translation.

The innovation of this architecture is the use of a fixed-sized internal representation in the heart of the model that input sequences are read to and output sequences are read from. For this reason, the method may be referred to as sequence embedding.

In one of the first applications of the architecture to English-to-French translation, the internal representation of the encoded English phrases was visualized. The plots revealed a qualitatively meaningful learned structure of the phrases harnessed for the translation task.

The proposed RNN Encoder-Decoder naturally generates a continuous-space representation of a phrase. […] From the visualization, it is clear that the RNN Encoder-Decoder captures both semantic and syntactic structures of the phrases

— Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.

On the task of translation, the model was found to be more effective when the input sequence was reversed. Further, the model was shown to be effective even on very long input sequences.

We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, we introduced many short term dependencies that made the optimization problem much simpler. … The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work

— Sequence to Sequence Learning with Neural Networks, 2014.

This approach has also been used with image inputs where a Convolutional Neural Network is used as a feature extractor on input images, which is then read by a decoder LSTM.

… we propose to follow this elegant recipe, replacing the encoder RNN by a deep convolution neural network (CNN). […] it is natural to use a CNN as an image

`encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences`

— Show and Tell: A Neural Image Caption Generator, 2014.

## Applications of Encoder-Decoder LSTMs

The list below highlights some interesting applications of the Encoder-Decoder LSTM architecture.

- Machine Translation, e.g. English to French translation of phrases.
- Learning to Execute, e.g. calculate the outcome of small programs.
- Image Captioning, e.g. generating a text description for images.
- Conversational Modeling, e.g. generating answers to textual questions.
- Movement Classification, e.g. generating a sequence of commands from a sequence of gestures.

## Implement Encoder-Decoder LSTMs in Keras

The Encoder-Decoder LSTM can be implemented directly in the Keras deep learning library.

We can think of the model as being comprised of two key parts: the encoder and the decoder.

First, the input sequence is shown to the network one encoded character at a time. We need an encoding level to learn the relationship between the steps in the input sequence and develop an internal representation of these relationships.

One or more LSTM layers can be used to implement the encoder model. The output of this model is a fixed-size vector that represents the internal representation of the input sequence. The number of memory cells in this layer defines the length of this fixed-sized vector.

1 2 |
model = Sequential() model.add(LSTM(..., input_shape=(...))) |

The decoder must transform the learned internal representation of the input sequence into the correct output sequence.

One or more LSTM layers can also be used to implement the decoder model. This model reads from the fixed sized output from the encoder model.

As with the Vanilla LSTM, a Dense layer is used as the output for the network. The same weights can be used to output each time step in the output sequence by wrapping the Dense layer in a TimeDistributed wrapper.

1 2 |
model.add(LSTM(..., return_sequences=True)) model.add(TimeDistributed(Dense(...))) |

There’s a problem though.

We must connect the encoder to the decoder, and they do not fit.

That is, the encoder will produce a 2-dimensional matrix of outputs, where the length is defined by the number of memory cells in the layer. The decoder is an LSTM layer that expects a 3D input of [samples, time steps, features] in order to produce a decoded sequence of some different length defined by the problem.

If you try to force these pieces together, you get an error indicating that the output of the decoder is 2D and 3D input to the decoder is required.

We can solve this using a RepeatVector layer. This layer simply repeats the provided 2D input multiple times to create a 3D output.

The RepeatVector layer can be used like an adapter to fit the encoder and decoder parts of the network together. We can configure the RepeatVector to repeat the fixed length vector one time for each time step in the output sequence.

1 |
model.add(RepeatVector(...)) |

Putting this together, we have:

1 2 3 4 5 |
model = Sequential() model.add(LSTM(..., input_shape=(...))) model.add(RepeatVector(...)) model.add(LSTM(..., return_sequences=True)) model.add(TimeDistributed(Dense(...))) |

To summarize, the RepeatVector is used as an adapter to fit the fixed-sized 2D output of the encoder to the differing length and 3D input expected by the decoder. The TimeDistributed wrapper allows the same output layer to be reused for each element in the output sequence.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

### Papers

- Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, 2014.
- Sequence to Sequence Learning with Neural Networks, 2014.
- Show and Tell: A Neural Image Caption Generator, 2014.
- Learning to Execute, 2015.
- A Neural Conversational Model, 2015.

### Keras API

### Posts

- How to use an Encoder-Decoder LSTM to Echo Sequences of Random Integers
- Learn to Add Numbers with an Encoder-Decoder LSTM Recurrent Neural Network
- Attention in Long Short-Term Memory Recurrent Neural Networks

## Summary

In this post, you discovered the Encoder-Decoder LSTM architecture for sequence-to-sequence prediction

Specifically, you learned:

- The challenge of sequence-to-sequence prediction.
- The Encoder-Decoder architecture and the limitation in LSTMs that it was designed to address.
- How to implement the Encoder-Decoder LSTM model architecture in Python with Keras.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Hi Jason

After applying model.add(TimeDistributed(Dense(…))) what kind of output we will receive?

Regards

Great question!

You will get the output from the Dense at each time step that the wrapper receives as input.

Many thanks

That means we will receive a fixed length output as well?

Given an input sequence of the same size but different meaning?

For example

I like this – input_1

and

I bought car – input_2

Will end up as 2 sequences of the same length in any case?

No, the input and output sequences can be different lengths.

The RepeatVector allows you to specify the fixed-length of the output vector, e.g. the number of times to repeat the fixed-length encoding of the input sequence.

Hi Jason, I had a great time reading your post.

One question: a common method to implement seq2seq is to use the encoder’s output as the initial value for the decoder inner state. Each token the decoder outputs is then fed back as input to the decoder.

In your method you do the other way around, where the encoder’s output is fed as input at each time step.

I wonder if you tried both methods, and if you did – which one worked better for you?

Thanks

Thanks Yoel. Great question!

I have tried the other method, and I had to contrive the implementation in Keras (e.g. it was not natural). This approach is also required when using attention. I hope to cover this in more detail in the future.

As for better, I’m not sure. The simple approach above can get you a long way for seq2seq applications in my experience.

Excellent description Jason. I wonder if you have some examples on graph analysis using Keras.

Regards

M.B.

Sorry, I do not.

Hi thanks for your great blog. I have a question.

I wonder if this example is a simplified version of encoder-decoder? because I didn’t find shifted output vector for the decoder in you code.

thank you

It is a simplified version. See here for a more sophisticated version:

https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

Hi Jason, thank you for this great post.

I was wondering why you haven’t used “return_sequences=true” for the encoder, instead of repeating the last value multiple times?

Thanks.

It would not give fine-grained control over the length of the output sequence.

Nevertheless, if you have some ideas, please try them and see, let me know how you go.

To elaborate – If we use return_sequences = true, then we get the output from each encoder time step that only has partial encoding of what ever the encoder has seen till that point of time. The advantage of using the final encoded state across all output sequences is to have a fully encoded state over the entire input sequences.

Sounds good!

Is there a simple github example illustrating this code in practice?

I have code on my blog, try the search feature.

Hi Jason, Thanks for such an intuitive post. One small question what would be y_train if we are using an autoenoder for feature extraction? Will it be the sequence {x1, x2, x3..xT} or {xT+1, xT+2,…xT+f} or just {xT+1}

The y variable would not be affected if you used feature extraction on the input variables.

My bad, I did not frame the question properly. My question was that if separate autoencoder block is used for feature extraction and the output from the encoder is then fed to another Neural network for prediction of another output variable y, what would be the training output for the auto encoder block, will it be {x1,x2,x3…xT} or {xT+1, xT+2,â€¦xT+f} or just {xT+1} where xT is the input feature vector. I hope I am clear now.

It is really up to you.

I would have the front end model perhaps summarize each time step if there are a lot of features, or summarize the sample or a subset if there are few features.

Hi Jason,,

Suppose I’ve to train a network that summarizes text data. I’ve collected a dataset of text-summary pairs. But I’m bit confused about the training. The source-text contains ‘M’ words while the summary-text contains ‘N’ words (M > N). How to do the training?

I have a suite of posts on the topic, for example:

https://machinelearningmastery.com/?s=text+summarization&submit=Search

Yeah.. I read almost all of them. Great work there. <3.

But I'm still doubtful about the training.

As mentioned in this post (https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/), does the source-text length and summary-text length has to fixed before creating the model? The training datasets don't have fixed length source-summary pairs.

Any push towards the right direction would be really helpful.

Yes, you must choose their lengths then pad all data to those lengths.

Hi Jason really nice posts on your blog. I am trying to predict time series but want also a dimensionality reduction to learn the most important features from my signal. the code looks like:

input_dim=1

timesteps=256

samples=17248

batch_size=539

n_dimensions=64

inputs = Input(shape=(timesteps, input_dim))

encoded = LSTM(n_dimensions, activation=’relu’, return_sequences=False, name=”encoder”)(inputs)

decoded = RepeatVector(timesteps)(encoded)

decoded = LSTM(input_dim,activation=’linear’, return_sequences=True, name=’decoder’)(decoded)

autoencoder = Model(inputs,decoded)

encoder = Model(inputs, encoded)

with which I do have to predict when I want a dimensionality reduction with my data ? with auteoncoder.predict or encoder.predict?

Sorry, I cannot debug your code, perhaps post your code and error to stackoverflow?

Hi Jason:

Gave a go to your encoder code. It looks like below:

model = Sequential()

model.add(LSTM(200, input_shape=(n_lag, numberOfBrands)))

model.add(RepeatVector(n_lag))

model.add(LSTM(100, return_sequences=True))

model.add(TimeDistributed(Dense(numberOfBrands)))

model.compile(loss=’mean_squared_error’, optimizer=’adam’,metrics=[‘mae’])

After that when I execute the following line:

history = model.fit(train_X, train_y, epochs=200, batch_size=5, verbose=2), I get the following error:

Error when checking target: expected time_distributed_5 to have 3 dimensions, but got array with shape (207, 30)

I know why. It complains about train_y which has a shape of 207,30! What’s the trick to re-shape Y here to cater for 3D output?

Thanks

It looks like a mismatch between your data and the models expectations. You can either reshape your data or change the input expectations of the model.

For more help with preparing data for the LSTM see this:

https://machinelearningmastery.com/faq/single-faq/how-do-i-prepare-my-data-for-an-lstm

Indeed, the problem is with the shape. Specifically with the RepeatVector layer. What’s the rule to determine the argument for RepeatVector layer? In my model I have passed the time lag, which is clearly not right.

The number of output time steps.

OK so clearly that’s 1 for me, as I am trying to output ‘numberOfBrands’ number of values for one time step i.e. t. Shaping the data is the real challenging part :).

Thanks a lot. You are really helpful

Hey Jason, thank you for a great post! I just have a question, I am trying to build an encoder-decoder LSTM that takes a window of dimension (4800, 15, 39) as input, gives me an encoded vector to which I apply RepeatVector(15) and finally use the repeated encoded vector to output a prediction of the input which is similar to what you are doing in this post. However, I read your other post (https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/) and I can’t figure out the difference between the 2 techniques and which one would be valid in my case.

Thanks a lot, I’m a big fan of your blog.

Well done Christian.

Good question.

The RepeatVector approach is not a “true” encoder decoder, but emulates the behaviour and gives similar skill in my experience. It is more of an auto-encoder type model.

The tutorial you link to is a “true” autoencoder as described in the 2014/2015/etc. papers, but it is a total pain to implement in Keras.

The main difference is the use of the internal state from the encoder seeding the state of the decoder.

If you’re concerned, perhaps try both approaches and use the one that gives better skill.

Hello Jason, thank you for the prompt reply.

I assume you mean the other tutorial I linked to is a “true” encoder decoder* since you said that using the RepeatVector is more of an autoencoder model. Am I correct?

As for the difference between the models, the encoder-decoder LSTM model uses the internal states and the encoded vector to predict the first output vector and then uses that predicted vector to predict the following one and so on. But the autoencoder technique uses solely the predicted vector.

If that is indeed the case, how is the second method (RepeatVector) able to predict the time sequence only by looking at the encoded vector?

I already implemented this technique and it’s giving very good results so I don’t think I’m going to go through the hassle of the encoder-decoder implementation because it has been giving me a lot of trouble

Thanks again!

Correct.

It develops a compressed representation of the input sequence that is interpreted by the decoder.

Hi Jason I used this model as welll for the prediction, I am getting also really good results. But what confuses me if I should use the other one the true encoder decoder or not because: I train this one with a part e.g. AAA of one big data, and I just want that the model can predict this little part, therefore I used subsequences. And than When i predict all the data it should only predict the same part before and not the rest of the data. lets say it should only predict AAA and not the whole data AAABBB. the last should look like AAA AAA.

Sorry, I’m not sure I follow. Can you summarize the inputs/outputs in a sentence?

I try to explain it with an example: train input: AAA, prediction input: AAA-BBB, prediction output: AAA-BBB ( but I was expecting only AAA-AAA). So the model should not predict right. does it have to do something with stateful and reset states ? Or using one submodel for training and one submodel for prediction ?

Perhaps it is the framing of your problem?

Perhaps the chosen model?

Perhaps the chosen config?

…

I know there is many try and fix, many perhaps points… I tested the model the other way around with random data and achieved my expected input/output. I chose already different framings, but did not try yet stateful and reset states , do you think it is worth to try it ?

Yes, if testing modes is cheap/fast.

I also read your post https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/#comment-438058 maybe this would solve my problem ?

Hi Christian, can you share the code which you used to implement that techniques, email (gwkibirige@gmail.com)

Do you have a similar example in R Keras? Because I don’t understanding TimeDistributed. Thank you

Sorry, Id on’t have examples of Keras in R.

I have more on TimeDistributed here:

https://machinelearningmastery.com/timedistributed-layer-for-long-short-term-memory-networks-in-python/

Dear Jason,

Thanks for your nice posts. Have you one on generative (variational) LSTM auto-encoder to able one to capture latent vectors and then generate new timeseries based on training set samples?

Best

I believe I have a great post on LSTM autoencoders scheduled. Not variational autoencoders though.

What do you need help with exactly?

thank you very much

You’re welcome.

A small question which can be very obvious one ðŸ˜€

U said that 1 means unit size.does it means number of LSTM cells in that layer?

As far as my knowledge number of LSTM cells in the input layer is depend on the number of time stamps in input sequence.am i correct?

If that so what 1 actually means?

The input defines the number of time steps and features to expect for each sample.

The number of units in first hidden layer of the LSTM is unrelated to the input.

Keras defines the input using an argument called input_shape on the first hidden layer, perhaps this is why you are confused?

Jason,

Can you please give an example of how to shape multivariate time series data for a LSTM autoencoder. I have been trying to do this, and failed badly. Thank you.

Perhaps this tutorial will help:

https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

Hi Jason, love your blogs.

I was trying to grasp the Show and Tell paper on image captioning, (https://arxiv.org/pdf/1411.4555.pdf)

(The input image is encoded to a vector and sent as input to a LSTM model )

There they mention:

`We empirically verified that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily`

`under the LSTM training section.`

Can you provide some explanation on what this means. Do they remove the timedistributed layer or some other explanation.

This could mean that they tried providing the image as they generated each output word and it did not help.