How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras

The encoder-decoder architecture for recurrent neural networks is proving to be powerful on a host of sequence-to-sequence prediction problems in the field of natural language processing such as machine translation and caption generation.

Attention is a mechanism that addresses a limitation of the encoder-decoder architecture on long sequences, and that in general speeds up the learning and lifts the skill of the model no sequence to sequence prediction problems.

In this tutorial, you will discover how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

After completing this tutorial, you will know:

  • How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
  • How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
  • How to robustly compare the performance of encoder-decoder networks with and without attention.

Let’s get started.

How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras

How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras
Photo by Angela and Andrew, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Encoder-Decoder with Attention
  2. Test Problem for Attention
  3. Encoder-Decoder without Attention
  4. Custom Keras Attention Layer
  5. Encoder-Decoder with Attention
  6. Comparison of Models

Python Environment

This tutorial assumes you have a Python 3 SciPy environment installed.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

Encoder-Decoder with Attention

The encoder-decoder model for recurrent neural networks is an architecture for sequence-to-sequence prediction problems.

It is comprised of two sub-models, as its name suggests:

  • Encoder: The encoder is responsible for stepping through the input time steps and encoding the entire sequence into a fixed length vector called a context vector.
  • Decoder: The decoder is responsible for stepping through the output time steps while reading from the context vector.

A problem with the architecture is that performance is poor on long input or output sequences. The reason is believed to be because of the fixed-sized internal representation used by the encoder.

Attention is an extension to the architecture that addresses this limitation. It works by first providing a richer context from the encoder to the decoder and a learning mechanism where the decoder can learn where to pay attention in the richer encoding when predicting each time step in the output sequence.

For more on attention in the encoder-decoder architecture, see the posts:

Test Problem for Attention

Before we develop models with attention, we will first define a contrived scalable test problem that we can use to determine whether attention is providing any benefit.

In this problem, we will generate sequences of random integers as input and matching output sequences comprised of a subset of the integers in the input sequence.

For example, an input sequence might be [1, 6, 2, 7, 3] and the expected output sequence might be the first two random integers in the sequence [1, 6].

We will define the problem such that the input and output sequences are the same length and pad the output sequences with “0” values as needed.

First, we need a function to generate sequences of random integers. We will use the Python randint() function to generate random integers between 0 and a maximum value and use this range as the cardinality for the problem (e.g. the number of features or an axis of difficulty).

The function generate_sequence() below will generate a random sequence of integers to a fixed length and with the specified cardinality.

Running this example generates a sequence of 5 time steps where each value in the sequence is a random integer between 0 and 49.

Next, we need a function to one hot encode the discrete integer values into binary vectors.

If a cardinality of 50 is used, then each integer will be represented by a 50-element vector of 0 values and 1 in the index of the specified integer value.

The one_hot_encode() function below will one hot encode a given sequence of integers.

We also need to be able to decode an encoded sequence. This will be needed to turn a prediction from the model or an encoded expected sequence back into a sequence of integers we can read and evaluate.

The one_hot_decode() function below will decode a one hot encoded sequence back into a sequence of integers.

We can test out these operations in the example below.

Running the example first prints a randomly generated sequence, then the one hot encoded version, then finally the decoded sequence again.

Finally, we need a function that can create input and output pairs of sequences to train and evaluate a model.

The function below named get_pair() will return one input and output sequence pair given a specified input length, output length, and cardinality. Both input and output sequences are the same length, the length of the input sequence, but the output sequence will be taken as the first n characters of the input sequence and padded with zero values to the required length.

The sequences of integers are then encoded then reshaped into a 3D format required for the recurrent neural network, with the dimensions: samples, time steps, and features. In this case, samples is always 1 as we are only generating one input-output pair, the time steps is the input sequence length and features is the cardinality of each time step.

We can put this all together and demonstrate the data preparation code.

Running the example generates a single input-output pair and prints the shape of both arrays.

The generated pair is then printed in a decoded form where we can see that the first two integers of the sequence are reproduced in the output sequence followed by a padding of zero values.

Encoder-Decoder Without Attention

In this section, we will develop a baseline in performance on the problem with an encoder-decoder model without attention.

We will fix the problem definition at input and output sequences of 5 time steps, the first 2 elements of the input sequence in the output sequence and a cardinality of 50.

We can develop a simple encoder-decoder model in Keras by taking the output from an encoder LSTM model, repeating it n times for the number of timesteps in the output sequence, then using a decoder to predict the output sequence.

For more detail on how to define an encoder-decoder architecture in Keras, see the post:

We will configure the encoder and decoder with the same number of units, in this case 150. We will use the efficient Adam implementation of gradient descent and optimize the categorical cross entropy loss function, given that the problem is technically a multi-class classification problem.

The configuration for the model was found after a little trial and error and is by no means optimized.

The code for an encoder-decoder architecture in Keras is listed below.

We will train the model on 5,000 random input-output pairs of integer sequences.

Once trained, we will evaluate the model on 100 new randomly generated integer sequences and only mark a prediction correct when the entire output sequence matches the expected value.

Finally, we will print 10 examples of expected output sequences and sequences predicted by the model.

Putting all of this together, the complete example is listed below.

Running this example will not take long, perhaps a few minutes on the CPU, no GPU is required.

The accuracy of the model was reported at just under 20%. Your results will vary given the stochastic nature of neural networks; consider running the example a few times and taking the average.

We can see from the sample outputs that the model does get one number in the output sequence correct for most or all cases, and only struggles with the second number. All zero padding values are predicted correctly.

Custom Keras Attention Layer

Now we need to add attention to the encoder-decoder model.

At the time of writing, Keras does not have the capability of attention built into the library, but it is coming soon.

Until attention is officially available in Keras, we can either develop our own implementation or use an existing third-party implementation.

To speed things up, let’s use an existing third-party implementation.

Zafarali Ahmed an intern at Datalogue developed a custom layer for Keras that provides support for attention, presented in a post titled “How to Visualize Your Recurrent Neural Network with Attention in Keras” in 2017 and GitHub project called “keras-attention“.

The custom attention layer is called AttentionDecoder and is available in the custom_recurrents.py file in the GitHub project. We can reuse this code under the GNU Affero General Public License v3.0 license of the project.

A copy of the custom layer is listed below for completeness. Copy it and paste it into a new and separate file in your current working directory called ‘attention_decoder.py‘.

We can make use of this custom layer in our projects by importing it as follows:

The layer implements attention as described by Bahdanau, et al. in their paper “Neural Machine Translation by Jointly Learning to Align and Translate.”

The code is explained well in the original post and linked to both the LSTM and attention equations.

A limitation of this implementation is that it must output sequences that are the same length as the input sequences, the specific limitation that the encoder-decoder architecture was designed to overcome.

Importantly, the new layer manages both the repeating of the decoding as performed by the second LSTM, as well as the softmax output for the model as was performed by the Dense output layer in the encoder-decoder model without attention. This greatly simplifies the code for the model.

It is important to note that the custom layer is built upon the Recurrent layer in Keras, which, at the time of writing, is marked as legacy code, and presumably will be removed from the project at some point.

Encoder-Decoder With Attention

Now that we have an implementation of attention that we can use, we can develop an encoder-decoder model with attention for our contrived sequence prediction problem.

The model with the attention layer is defined below. We can see that the layer handles some of the machinery of the encoder-decoder model itself, making defining the model simpler.

That’s it. The rest of the example is the same.

The complete example is listed below.

Running the example prints the skill of the model on 100 randomly generated input-output pairs. With the same resources and same amount of training, the model with attention performs much better.

Your results may vary given the stochastic nature of neural networks. Try running the example a few times.

Spot-checking some sample outputs and predicted sequences, we can see very few errors, even in cases when there is a zero value in the first two elements.

Comparison of Models

Although we are getting better results from the model with attention, the results were reported from a single run of each model.

In this case, we seek a more robust finding by repeating the evaluation of each model multiple times and reporting the average performance over those runs. For more information on this robust approach to evaluating neural network models, see the post:

We can define a function to create each type of model, as follows.

We can then define a function to fit and evaluate the accuracy of a fit model and return the accuracy score.

Putting this together, we can repeat the process of creating, training, and evaluating each type of model multiple times and reporting the mean accuracy over the repeats. To keep running times down, we will repeat each model evaluation 10 times, although if you have the resources, you could increase this to 30 or 100 times.

The complete example is listed below.

Running this example prints the accuracy for each model repeat to give you an idea of the progress of the run.

We can see that even averaged over 10 runs, the attention model still shows better performance than the encoder-decoder model without attention, 23.10% vs 95.70%.

A good extension to this evaluation would be to capture the model loss each epoch for each model, take the average, and compare how the loss changes over time for the architecture with and without attention. I expect that this trace would show attention achieving better skill much faster and sooner than the non-attentional model, further highlighting the benefit of the approach.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network with attention in Python with Keras.

Specifically, you learned:

  • How to design a small and configurable problem to evaluate encoder-decoder recurrent neural networks with and without attention.
  • How to design and evaluate an encoder-decoder network with and without attention for the sequence prediction problem.
  • How to robustly compare the performance of encoder-decoder networks with and without attention.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


35 Responses to How to Develop an Encoder-Decoder Model with Attention for Sequence-to-Sequence Prediction in Keras

  1. Chetan October 17, 2017 at 6:11 am #

    The timing of this post couldn’t have been more accurate. I’ve spent hours and days on google looking for a reliable Keras implementation of attention. Can’t wait to test this on my specific problem definition. Thanks a ton Jason!

    • Jason Brownlee October 17, 2017 at 4:03 pm #

      I’m glad to hear that Chetan!

      Let me know how you go.

  2. ChrisJew October 17, 2017 at 10:35 pm #

    test soft

  3. Mateo October 18, 2017 at 11:30 pm #

    Thank you for this post!

    Unfortunately the kernel crashes on my laptop! I don’t know why (no RAM issues)
    I use Keras==2.0.8 and TF==1.3.0

  4. Ravi Annaswamy October 20, 2017 at 7:09 pm #

    Jason, very nice tutorial on probably the most important and most powerful neural application architecture (seq2seq with attention – since it is equivalent to a self programming turing machine – it sees an input stream of symbols, then can move back and forth using attention and write out a stream of symbols).

    In fact theoretically it is super-turing, because it works with continuous (real) representation instead of Turing symbolic notation. google ‘recurrent networks super turing’ for proofs.

    I am looking forward to attention being integrated into Keras and your revised code later, but no one can match your ability to setup the problem, generate data, explain step by step.. Keep up the great work.

    Ravi Annaswamy

    • Jason Brownlee October 21, 2017 at 5:29 am #

      Thanks Ravi, I really appreciate your support! You made my day 🙂

  5. Ravi Annaswamy October 20, 2017 at 8:12 pm #

    Jason, I think in order to show the power of sequence mapping, we need to try two things:
    1. The input sequence should be of variable length (not always 5). For example you can make it a length of 10 max, but it should generate sequences of any length between say 4 to 10 (remaining zeros).
    2. The output should not be just zeroing of values, but more complex output for example, the first and last non zero value of the sequence…

    • Ravi Annaswamy October 20, 2017 at 8:15 pm #

      something like the example built here:
      https://talbaumel.github.io/attention/

      • Ravi Annaswamy October 20, 2017 at 9:49 pm #

        I am working on a modification of your excellent code, to illustrate this extended task, will post shortly.

    • Jason Brownlee October 21, 2017 at 5:34 am #

      Yes, you could easily modify the above example to achieve these requirements.

  6. Ravi Annaswamy October 20, 2017 at 10:25 pm #

    Dr.Jason,

    You have done an excellent application and framework code.

    I wanted to expose the great value of this architecture and modularity of
    this code by attempting a harder problem. Harder in two ways:

    First we want to make the input sequence variable length from example to example.

    Second, we want the output to be one that requires attention and long term memory,
    across the length!

    So we come up with this task:

    Given a input sequence which is variable length with zero padding…
    [6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
    I wanted the network to pick out and output the first and last non-zero of the series
    [6, 4, 0, 0, 0, 0, 0, 0, 0, 0]

    To make it even more interesting task for memory, we want it to output
    the two numbers in reverse order:

    input:
    [6, 8, 7, 2, 2, 6, 6, 4, 0, 0]
    output
    [4, 6, 0, 0, 0, 0, 0, 0, 0, 0]

    This would require that the algorithm figure out that we are selecting the first and last of the sequence,
    and then writing out them in reverse order! It really needs some kind of a turing machine that can
    go back and forth on the sequence and decide when to write what! Can the seq2seq with attention LSTM do this?
    Let us try out.

    Here are few more training cases created:
    [5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
    [2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
    [9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

    I made the following changes to your excellent code to make this possible:

    1. In order to use 0 as the padding character, we make the unique letters from 1 to n_unique.

    # generate a sequence of random integers
    def generate_sequence(length, n_unique):
    return [randint(0, n_unique-2)+1 for _ in range(length)]

    I think in your original code also you should adopt the above mechanism so that 0 is reserved as padding
    symbol and generated sequence only contains 1 to n_unique. I think this will increase accuracy to 100% in your tests too.

    2. In order to simplify the domain, for faster training, I restricted the range of values:

    n_features = 8
    n_timesteps_in = 10
    n_timesteps_out = 2

    That is the input has a max of 10 positions but anywhere between 4 to 9 of these could be nonzero sequence, as shown below.
    The input only uses an alphabet of 8 numbers instead of the 50 you used.

    3. Correspondingly the get_pair was modified to generate the series above:

    # prepare data for the LSTM
    def get_pair(n_in, n_out, cardinality, verbose=False): # edited this to add verbose flag
    # generate random sequence
    sequence_in = generate_sequence(n_in, cardinality)
    real_length = randint(4,n_in-1) # i added this
    sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)] # i added this
    sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)] # i edited this
    if verbose: # added this for testing
    print(sequence_in,sequence_out) # added this
    # one hot encode
    X = one_hot_encode(sequence_in, cardinality)
    y = one_hot_encode(sequence_out, cardinality)
    # reshape as 3D
    X = X.reshape((1, X.shape[0], X.shape[1]))
    y = y.reshape((1, y.shape[0], y.shape[1]))
    return X,y

    4. With these changes:
    for _ in range(5):
    a=get_pair(10,2,10,verbose=True)

    generates:

    [6, 8, 7, 2, 2, 6, 6, 4, 0, 0] [4, 6, 0, 0, 0, 0, 0, 0, 0, 0]
    [5, 5, 3, 3, 2, 0, 0, 0, 0, 0] [2, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [4, 7, 7, 4, 3, 9, 0, 0, 0, 0] [9, 4, 0, 0, 0, 0, 0, 0, 0, 0]
    [2, 6, 7, 6, 5, 0, 0, 0, 0, 0] [5, 2, 0, 0, 0, 0, 0, 0, 0, 0]
    [9, 8, 2, 8, 8, 7, 9, 1, 5, 0] [5, 9, 0, 0, 0, 0, 0, 0, 0, 0]

    5. Result of training on this dataset:
    Encoder-Decoder Model
    20.0
    12.0
    18.0
    19.0
    9.0
    10.0
    16.0
    12.0
    12.0
    11.0

    Encoder-Decoder With Attention Model
    100.0
    100.0
    100.0
    100.0
    100.0
    100.0
    100.0
    100.0
    100.0
    100.0

    Yes!

    This shows the capacity of recurrent neural models to learn arbitrary programs from example input and output pairs!
    Of course, one can increase length of the sequence and also the n_unique to make the task harder, but I do not expect
    dramatic failure as we gradually increase to reasonable values.

    I am really very happy that you put together this excellent example. Please feel free to add this extension application to your excellent article/books if it will add value. Also please review the changes to make sure I have not made any errors.

    The only complaint I have is that the keras implementation of attention is very slow. (I think the pytorch implementation will be
    far faster because of avoiding a few layers of abstraction..but I may be wrong, will try it..)

    Ravi

    Attached the complete code for reproducibility:

    from random import randint
    from numpy import array
    from numpy import argmax
    from numpy import array_equal
    from keras.models import Sequential
    from keras.layers import LSTM
    from keras.layers import Dense
    from keras.layers import TimeDistributed
    from keras.layers import RepeatVector
    from attention_decoder import AttentionDecoder

    # generate a sequence of random integers
    def generate_sequence(length, n_unique):
    return [randint(0, n_unique-2)+1 for _ in range(length)]

    # one hot encode sequence
    def one_hot_encode(sequence, n_unique):
    encoding = list()
    for value in sequence:
    vector = [0 for _ in range(n_unique)]
    vector[value] = 1
    encoding.append(vector)
    return array(encoding)

    # decode a one hot encoded string
    def one_hot_decode(encoded_seq):
    return [argmax(vector) for vector in encoded_seq]

    # prepare data for the LSTM
    def get_pair(n_in, n_out, cardinality, verbose=False):
    # generate random sequence
    sequence_in = generate_sequence(n_in, cardinality)
    real_length = randint(4,n_in-1)
    sequence_in = sequence_in[:real_length] + [0 for _ in range(n_in-real_length)]
    sequence_out = [sequence_in[real_length-1]]+[sequence_in[0]] + [0 for _ in range(n_in-2)]
    if verbose:
    print(sequence_in,sequence_out)
    # one hot encode
    X = one_hot_encode(sequence_in, cardinality)
    y = one_hot_encode(sequence_out, cardinality)
    # reshape as 3D
    X = X.reshape((1, X.shape[0], X.shape[1]))
    y = y.reshape((1, y.shape[0], y.shape[1]))
    return X,y

    # define the encoder-decoder model
    def baseline_model(n_timesteps_in, n_features):
    model = Sequential()
    model.add(LSTM(150, input_shape=(n_timesteps_in, n_features)))
    model.add(RepeatVector(n_timesteps_in))
    model.add(LSTM(150, return_sequences=True))
    model.add(TimeDistributed(Dense(n_features, activation=’softmax’)))
    model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
    return model

    # define the encoder-decoder with attention model
    def attention_model(n_timesteps_in, n_features):
    model = Sequential()
    model.add(LSTM(150, input_shape=(n_timesteps_in, n_features), return_sequences=True))
    model.add(AttentionDecoder(150, n_features))
    model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘acc’])
    return model

    # train and evaluate a model, return accuracy
    def train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features):
    # train LSTM
    for epoch in range(5000):
    # generate new random sequence
    X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
    # fit model for one epoch on this sequence
    model.fit(X, y, epochs=1, verbose=0)
    # evaluate LSTM
    total, correct = 100, 0
    for _ in range(total):
    X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features)
    yhat = model.predict(X, verbose=0)
    if array_equal(one_hot_decode(y[0]), one_hot_decode(yhat[0])):
    correct += 1
    return float(correct)/float(total)*100.0

    # configure problem
    n_features = 8
    n_timesteps_in = 10
    n_timesteps_out = 2
    n_repeats = 10

    # evaluate encoder-decoder model
    print(‘Encoder-Decoder Model’)
    results = list()
    for _ in range(n_repeats):
    model = baseline_model(n_timesteps_in, n_features)
    accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
    results.append(accuracy)
    print(accuracy)
    print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))
    # evaluate encoder-decoder with attention model
    print(‘Encoder-Decoder With Attention Model’)
    results = list()
    for _ in range(n_repeats):
    model = attention_model(n_timesteps_in, n_features)
    accuracy = train_evaluate_model(model, n_timesteps_in, n_timesteps_out, n_features)
    results.append(accuracy)
    print(accuracy)
    print(‘Mean Accuracy: %.2f%%’ % (sum(results)/float(n_repeats)))

  7. ravi annaswamy October 20, 2017 at 10:44 pm #

    here is verbose evaluation

    for _ in range(5):
    X,y = get_pair(n_timesteps_in, n_timesteps_out, n_features,verbose=True)
    yhat = model.predict(X, verbose=0)
    print(one_hot_decode(yhat[0]))

    [5, 5, 1, 6, 1, 4, 5, 0, 0, 0] [5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [5, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [5, 5, 4, 7, 2, 1, 3, 0, 0, 0] [3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [3, 5, 0, 0, 0, 0, 0, 0, 0, 0]
    [3, 4, 7, 6, 3, 1, 3, 1, 1, 0] [1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 3, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 4, 1, 4, 7, 2, 2, 3, 4, 0] [4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [4, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [1, 5, 1, 4, 7, 6, 3, 7, 7, 0] [7, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    [7, 1, 0, 0, 0, 0, 0, 0, 0, 0]

  8. Meeklai October 21, 2017 at 2:57 am #

    First of all, thank you so much for this worth reading article. It clarified me a lot about how to implement autoencoder model in Keras.

    I just have a little confused point that I wish you would explain. Why do you need to transform an original vector of integers into a 2D matrix containing a one hot vector of each integer? Can’t you just send that original vector of integers into the encoder as an input?

    Thank you again for this worthful article, Dr. Brownlee

    • Jason Brownlee October 21, 2017 at 5:43 am #

      You can, but the one hot encoding is richer and often results in better model skill.

      • Meeklai October 23, 2017 at 2:21 am #

        Thank you Dr. Brownlee, would one hot encoding is better for a situation that the number of cardinality is much greater than this example? Like fitting an encoder with lots of text documents, which will result in huge number of encoder’s keys

  9. Hendrik October 24, 2017 at 7:00 pm #

    In case of multiple LSTM layers, is the AttentionDecoder layer supposed to stay after all LSTMs only once or it must be inserted after each LSTM layert?

    • Jason Brownlee October 25, 2017 at 6:44 am #

      The attention is only used directly after the encoder.

  10. Trialcritic October 25, 2017 at 8:21 am #

    Usually, when people have 5 input and 2 output steps, we use

    model.add(LSTM(size, input_shape=(n_timesteps_in, n_features)))
    model.add(RepeatVector(n_timesteps_out)) # this is different from input steps
    model.add(LSTM(size, return_sequences=True))

    This makes sense, as suggested

    “we need to repeat the single vector outputted from the encoder network to obtain a sequence which has the same length with the output sequences”.

    Wonder if this must be changed.

    • Jason Brownlee October 25, 2017 at 3:57 pm #

      Yes, the RepeatVector approach is not a pure encoder-decoder as defined in the first papers, but often performs as well or better in my experience.

  11. Aayushee November 3, 2017 at 8:17 pm #

    Hi Jason,

    Thanks for such a well explained post on this topic. You mention the limitation that output sequences are the same length as the input sequences in case of the attention encoder decoder model used.
    Could you please give an idea what should be done in an attention based model when output and input lengths are not same? I was wondering if we can use a RepeatVector(output_timesteps) in the current attention model on the encoder output and then feed it to the AttentionDecoder?

    • Jason Brownlee November 4, 2017 at 5:29 am #

      This implementation of attention cannot handle input and output sequences with different lengths, sorry.

  12. caichao November 4, 2017 at 11:43 pm #

    By running your example (the “with attention part”, I’ve gotten the following error:
    ValueError: Dimensions must be equal, but are 150 and 50 for ‘AttentionDecoder/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,150], [50,150].

    • Jason Brownlee November 5, 2017 at 5:16 am #

      Ensure you have the latest version of Keras.

      • caichao November 5, 2017 at 11:51 am #

        My keras version is 2.0.2

        • Jason Brownlee November 6, 2017 at 4:48 am #

          Perhaps try 2.0.8 or higher?

          • caichao November 6, 2017 at 11:47 pm #

            also when I upgrade keras to 2.0.9
            I got the following problem

            from keras.layers.recurrent import Recurrent, _time_distributed_dense
            “unresolved reference _time_distributed_dense”

          • Jason Brownlee November 7, 2017 at 9:50 am #

            Interesting, perhaps the example requires Keras 2.0.8. This was the version I used when developing the example.

      • caichao November 5, 2017 at 12:08 pm #

        also when I upgrade keras to 2.0.9
        I got the following problem

        from keras.layers.recurrent import Recurrent, _time_distributed_dense
        “unresolved reference _time_distributed_dense”

  13. kamal November 6, 2017 at 12:54 am #

    Hi Jason. thank you for your great tutorials. I have 2 questions:

    1) is there any Dense layer after Decoder in Attention code?
    2)should features input be equal to features output or not ( their length should be equal as you mentioned)?

    thank you, again

    • Jason Brownlee November 6, 2017 at 4:53 am #

      Yes, there is normally a dense output after the decoder (or a part of the decoder).

      Features can vary. Normally/often you would have more input features than output features.

Leave a Reply