How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras

The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging sequence-to-sequence prediction problems, such as machine translation.

Encoder-decoder models can be developed in the Keras Python deep learning library and an example of a neural machine translation system developed with this model has been described on the Keras blog, with sample code distributed with the Keras project.

In this post, you will discover how to define an encoder-decoder sequence-to-sequence prediction model for machine translation, as described by the author of the Keras deep learning library.

After reading this post, you will know:

  • The neural machine translation example provided with Keras and described on the Keras blog.
  • How to correctly define an encoder-decoder LSTM for training a neural machine translation model.
  • How to correctly define an inference model for using a trained encoder-decoder model to translate new sequences.

Let’s get started.

How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras

How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras
Photo by Tom Lee, some rights reserved.

Sequence-to-Sequence Prediction in Keras

Francois Chollet, the author of the Keras deep learning library, recently released a blog post that steps through a code example for developing an encoder-decoder LSTM for sequence-to-sequence prediction titled “A ten-minute introduction to sequence-to-sequence learning in Keras“.

The code developed in the blog post has also been added to Keras as an example in the file lstm_seq2seq.py.

The post develops a sophisticated implementation of the encoder-decoder LSTM as described in the canonical papers on the topic:

The model is applied to the problem of machine translation, the same as the source papers in which the approach was first described. Technically, the model is a neural machine translation model.

Francois’ implementation provides a template for how sequence-to-sequence prediction can be implemented (correctly) in the Keras deep learning library at the time of writing.

In this post, will take a closer look at exactly how the training and inference models were designed and how they work.

You will be able to use this understanding to develop similar models for your own sequence-to-sequence prediction problems.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Machine Translation Data

The dataset used in the example involves short French and English sentence pairs used in the flash card software Anki.

The dataset is called “Tab-delimited Bilingual Sentence Pairs” and is part of the Tatoeba Project and listed on the ManyThings.org site for helping English as a Second Language students.

The dataset used in the tutorial can be downloaded from here:

Below is a sample of the first 10 rows from the fra.txt data file you will see after you unzip the downloaded archive.

The problem is framed as a sequence prediction problem where input sequences of characters are in English and output sequences of characters are in French.

A total of 10,000 of the nearly 150,000 examples in the data file are used in the dataset. Some technical details of the prepared data are as follows:

  • Input Sequences: Padded to a maximum length of 16 characters with a vocabulary of 71 different characters (10000, 16, 71).
  • Output Sequences: Padded to a maximum length of 59 characters with a vocabulary of 93 different characters (10000, 59, 93).

The training data is framed such that the input for the model is comprised of one whole input sequence of English characters and the whole output sequence of French characters. The output of the model is the whole sequence of French characters, but offset forward by one time step.

For example (with minimal padding and without one-hot encoding):

  • Input1: [‘G’, ‘o’, ‘.’, ”]
  • Input2: [ ”, ‘V’, ‘a’, ‘ ‘]
  • Output: [‘V’, ‘a’, ‘ ‘, ‘!’]

Machine Translation Model

The neural translation model is an encoder-decoder recurrent neural network.

It is comprised of an encoder that reads a variable length input sequence and a decoder that predicts a variable length output sequence.

In this section, we will step through each element of the model’s definition, with code taken directly from the post and the code example in the Keras project (at the time of writing).

The model is divided into two sub-models: the encoder responsible for outputting a fixed-length encoding of the input English sequence, and the decoder responsible for predicting the output sequence, one character per output time step.

The first step is to define the encoder.

The input to the encoder is a sequence of characters, each encoded as one-hot vectors with length of num_encoder_tokens.

The LSTM layer in the encoder is defined with the return_state argument set to True. This returns the hidden state output returned by LSTM layers generally, as well as the hidden and cell state for all cells in the layer. These are used when defining the decoder.

Next, we define the decoder.

The decoder input is defined as a sequence of French character one-hot encoded to binary vectors with a length of num_decoder_tokens.

The LSTM layer is defined to both return sequences and state. The final hidden and cell states are ignored and only the output sequence of hidden states is referenced.

Importantly, the final hidden and cell state from the encoder is used to initialize the state of the decoder. This means every time that the encoder model encodes an input sequence, the final internal states of the encoder model are used as the starting point for outputting the first character in the output sequence. This also means that the encoder and decoder LSTM layers must have the same number of cells, in this case, 256.

A Dense output layer is used to predict each character. This Dense is used to produce each character in the output sequence in a one-shot manner, rather than recursively, at least during training. This is because the entire target sequence required for input to the model is known during training.

The Dense does not need to be wrapped in a TimeDistributed layer.

Finally, the model is defined with inputs for the encoder and the decoder and the output target sequence.

We can tie all of this together in a standalone example and fix the configuration and print a graph of the model. The complete code example for defining the model is listed below.

Running the example creates a plot of the defined model that may help you better understand how everything hangs together.

Note that the encoder LSTM does not directly pass its outputs as inputs to the decoder LSTM; as noted above, the decoder uses the final hidden and cell states as the initial state for the decoder.

Also note that the decoder LSTM only passes the sequence of hidden states to the Dense for output, not the final hidden and cell states as suggested by the output shape information.

Graph of Encoder-Decoder Model For Training

Graph of Encoder-Decoder Model For Training

Neural Machine Translation Inference

Once the defined model is fit, it can be used to make predictions. Specifically, output a French translation for an English source text.

The model defined for training has learned weights for this operation, but the structure of the model is not designed to be called recursively to generate one character at a time.

Instead, new models are required for the prediction step, specifically a model for encoding English input sequences of characters and a model that takes the sequence of French characters generated so far and the encoding as input and predicts the next character in the sequence.

Defining the inference models requires reference to elements of the model used for training in the example. Alternately, one could define a new model with the same shapes and load the weights from file.

The encoder model is defined as taking the input layer from the encoder in the trained model (encoder_inputs) and outputting the hidden and cell state tensors (encoder_states).

The decoder is more elaborate.

The decoder requires the hidden and cell states from the encoder as the initial state of the newly defined encoder model. Because the decoder is a separate standalone model, these states will be provided as input to the model, and therefore must first be defined as inputs.

They can then be specified for use as the initial state of the decoder LSTM layer.

Both the encoder and decoder will be called recursively for each character that is to be generated in the translated sequence.

On the first call, the hidden and cell states from the encoder will be used to initialize the decoder LSTM layer, provided as input to the model directly.

On subsequent recursive calls to the decoder, the last hidden and cell state must be provided to the model. These state values are already within the decoder; nevertheless, we must re-initialize the state on each call given the way that the model was defined in order to take the final states from the encoder on the first call.

Therefore, the decoder must output the hidden and cell states along with the predicted character on each call, so that these states can be assigned to a variable and used on each subsequent recursive call for a given input sequence of English text to be translated.

We can tie all of this together into a standalone code example combined with the definition of the training model of the previous section, given the reuse of some elements. The complete code listing is provided below.

Running the example defines the training model, inference encoder, and inference decoder.

Plots of all three models are then created.

Graph of Encoder Model For Inference

Graph of Encoder Model For Inference

The plot of the encoder is straightforward.

The decoder shows the three inputs required to decode a single character in the translated sequence, the encoded translation output so far, and the hidden and cell states provided first from the encoder and then from the output of the decoder as the model is called recursively for a given translation.

Graph of Decoder Model For Inference

Graph of Decoder Model For Inference

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Update

For an example of how to use this model on a standalone problem, see this post:

Summary

In this post, you discovered how to define an encoder-decoder sequence-to-sequence prediction model for machine translation, as described by the author of the Keras deep learning library.

Specifically, you learned:

  • The neural machine translation example provided with Keras and described on the Keras blog.
  • How to correctly define an encoder-decoder LSTM for training a neural machine translation model.
  • How to correctly define an inference model for using a trained encoder-decoder model to translate new sequences.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


140 Responses to How to Define an Encoder-Decoder Sequence-to-Sequence Model for Neural Machine Translation in Keras

  1. Tom October 27, 2017 at 11:51 am #

    Jason – Which book has this article? I did not see this in LSTM, if I am not mistaken.

  2. nandini October 31, 2017 at 10:13 pm #

    i have referenced lstm_seq2seq.py code for requirement,i had executed correctly i got the correct results.
    my requirement is that i have to inputs from user and it has to encode to state vectors ,give to decoder ,generate results for given inputs. I have written a logic for it.but i am not able to generate the correct results.

    userInput=”
    count=0
    while(userInput!=’quit’):
    userInput = input(‘enter the english sentences or want to stop (enter quote with quotes)’);
    userInput=str(userInput)
    f = open(‘testone.txt’, ‘a’ )
    if userInput==’quit’:
    f.close()
    else:
    f.write( userInput + ‘\n’ )
    count=count+1
    print(“count”,count)

    # taking inputs from user

    #saving in testone.txt file

    test_path=’testone.txt’
    test_texts = []
    #target_texts = []
    test_characters = set()
    #target_characters = set()
    lines = open(test_path).read().split(‘\n’)
    for line in lines[: min(30, len(lines) – 1)]:
    test_text = line

    test_texts.append(test_text)
    #target_texts.append(target_text)
    for char in test_text:
    if char not in test_characters:
    test_characters.add(char)

    test_characters = sorted(list(test_characters))
    num_testencoder_tokens = len(test_characters)
    max_testencoder_seq_length = max([len(txt) for txt in test_texts])

    print(‘Number of samples:’, len(test_texts))
    #print(‘Number of unique input tokens:’, num_encoder_tokens)
    print(“max test encoder seq length”,max_testencoder_seq_length)
    print(“num_testencoder_tokens”,num_testencoder_tokens)
    test_token_index = dict(
    [(char, i) for i, char in enumerate(test_characters)])
    print(“test_token_index”,test_token_index)
    encoder_test_data = np.zeros(
    (len(test_texts), max_testencoder_seq_length,num_testencoder_tokens),
    dtype=’float32′)
    print(“encoder_test_data”,encoder_test_data.shape)

    for i,test_text in enumerate(test_texts):
    for t, char in enumerate(test_text):
    encoder_test_data[i, t,test_token_index[char]] = 1.
    print(“encoder_test_data”,encoder_test_data.shape)

    encoder_test_inputs = Input(shape=(None,num_testencoder_tokens))
    print(“encoder_inputs”,encoder_test_inputs.shape)

    encoder_test_inputs = Input(shape=(None, num_testencoder_tokens))
    print(“—————-“,encoder_test_inputs.shape)

    encoder = LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_test_inputs)
    print(“encoder_outputs”,encoder_outputs)

    # We discard encoder_outputs and only keep the states.
    encoder_test_states = [state_h, state_c]

    print(“encoder_test_states”,encoder_test_states)
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
    initial_state=encoder_test_states)
    print(‘encoder_test_inputs.shape’,encoder_test_inputs.shape)
    print(‘encoder_test_states’,encoder_test_states)

    encoder_test_model = Model(encoder_test_inputs, encoder_test_states)

    print(encoder_test_model.summary)

    decoder_state_input_h = Input(shape=(latent_dim,))
    decoder_state_input_c = Input(shape=(latent_dim,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(
    decoder_inputs, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)

    decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs] + decoder_states)

    # Reverse-lookup token index to decode sequences back to
    # something readable.
    reverse_input_char_index = dict(
    (i, char) for char, i in test_token_index.items())
    reverse_testtarget_char_index=dict(
    (i, char) for char, i in target_token_index.items())

    def decode_sequence(input_seq):
    # Encode the input as state vectors.
    #print(input_seq)
    states_value = encoder_test_model.predict(input_seq)

    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # print(“target_seq”,target_seq)
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index[‘\t’]] = 1.

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = ”
    while not stop_condition:
    output_tokens, h, c = decoder_model.predict(
    [target_seq] + states_value)
    #print(“output_tokens”,output_tokens)
    # Sample a token
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    print(‘sampled_token_index’, sampled_token_index)
    #sampled_char = reverse_target_char_index[sampled_token_index]
    sampled_char = reverse_testtarget_char_index[sampled_token_index]
    print(“sampled_char”,sampled_char)
    decoded_sentence += sampled_char

    # Exit condition: either hit max length
    # or find stop character.
    if (sampled_char == ‘\n’ or
    len(decoded_sentence) > max_decoder_seq_length):
    stop_condition = True

    # Update the target sequence (of length 1).
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    #print(“target_seq legnth”,len(target_seq))
    target_seq[0, 0, sampled_token_index] = 1.

    # Update states
    states_value = [h, c]

    return decoded_sentence

    for seq_index in range(count):
    #print(encoder_test_data[0:1])
    input_seq = encoder_test_data[seq_index: seq_index + 1]
    print(“input_seq”,input_seq.shape)
    decoded_sentence = decode_sequence(input_seq)
    print(‘-‘)
    print(‘Input sentence:’, test_texts[seq_index])
    print(‘Decoded sentence:’, decoded_sentence)

    Above is my code ,please suggest where i am going wrong.

    • Jason Brownlee November 1, 2017 at 5:43 am #

      Sorry, I cannot debug your code for you.

      What is the problem exactly?

  3. nandini November 1, 2017 at 3:44 pm #

    Actually in original code ,they are testing on already trained code,in my code what i am doing is ,i am giving inputs from console and stored in a file,this inputs i am giving to encoder_model, for console inputs i am not getting proper outputs like we got i n trained data.

    Suppose i have given same words from console to encoder model also i am getting wrong results.

    Please tell me where i am going wrong.

    • Jason Brownlee November 1, 2017 at 4:15 pm #

      Perhaps you can debug your example with static data?

  4. nandini November 1, 2017 at 5:16 pm #

    for static data it is working, but when i am taking inputs from users i am not getting proper results.

    • Jason Brownlee November 2, 2017 at 5:07 am #

      It might be related to how you are reading the input and preparing it for the model.

  5. Ambika November 2, 2017 at 6:31 pm #

    How to do predictions for sequence to sequence model using keras,why we are not using train model directly,why we are creating inference model in this scenario ,Please can you explain.

    i would like to know,how to predict the model outputs for model unknown inputs in sequence to sequence model using keras.

  6. Nandu November 2, 2017 at 11:01 pm #

    I have unknown inputs in eng, i want to translate these inputs to french.So i converted these textst inputs into vectors,like in seq2sq model they have encoded as constant length vector.
    I have given these inputs to inference encoder_model,along with these inputs states(hidden,cell states) like encoder_model = Model(encoder_test_inputs, encoder_test_states)
    from these i am extracting state value from encoder_model(),this state value i am passing to decoder model,like that decoder_model.predict(
    [target_seq] + states_value),but i am getting scarmbled outputs if i will give unknown inputs to encoder_model.

    is there any difference procedure available to predict target sequnce for given unknwon inputs sequnce,Please suggest me for this problem.
    I am able to decode the already trained data,if is not trained data i am not able to decode.

    • Jason Brownlee November 3, 2017 at 5:17 am #

      Once the model is trained, you must encode new data using the same procedure as you used for the training data then call model.predict()

  7. nandu November 3, 2017 at 4:37 pm #

    in that they are not doing model.predict,they are doing encoder_model.predict and decoder_model.predict() seperately.

  8. nandini November 6, 2017 at 5:12 pm #

    How to predict unknown target sequence for unknown inputs using keras? for unknown inputs again i need to create input layer and encoder states and encoder model or else or i need to use same encoder_model its taking parameters as encoder_inputs and encoders_states.

    • Jason Brownlee November 7, 2017 at 9:47 am #

      The inference models could be saved, loaded and used to make predictions as in the above examples.

  9. Nandu November 14, 2017 at 11:09 pm #

    Please can you explain how this inference decoder_model will work in this examples,i am ok with training of the encoder and decoder model ,inference encoder_model,I didn’t understnd the inference decoder_model in this example.

    # define decoder inference model
    decoder_state_input_h = Input(shape=(latent_dim,))
    decoder_state_input_c = Input(shape=(latent_dim,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    in training model we are passing decoder_inputs and intial state as encoder states,but in inference something it is differnent things we are passing ,why what is the reason.

    decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)

    decoder_states = [state_h, state_c]

    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs]

    • Jason Brownlee November 15, 2017 at 9:52 am #

      The key difference is that we recursively pass in the last state as input, starting with the encoder state for the first pass.

      Does that help?

  10. Jonas November 15, 2017 at 6:52 am #

    Could you please elaborate more on why TimeDistributed is not need here? The Dense layer in Decoder must output one character for each character in the input sequence, so how is it different here from the case where TimeDistributed must be aplied?

    • Jason Brownlee November 15, 2017 at 9:57 am #

      Great question. Dense can now support time steps!

      You can add a TimeDistribted wrapper and it will have the same effect.

      I know… Keras is getting a little confusing.

  11. nandini November 15, 2017 at 6:27 pm #

    HI Jason,

    currently i am working on seq2seq model using keras.

    my requirement: they will test case in english like (adding two numbers).
    output: source has to generate for that particular test case in python .

    this requirement is quietly different from language translation tool.

    Please do any suggestions for it,how to proceed further.

  12. nandini November 15, 2017 at 9:14 pm #

    while predicting target character .

    output_tokens, h, c = decoder_test_model.predict(
    [target_seq] + states_value)

    # Sample a token
    sampled_token_index = np.argmax(output_tokens[0, -1, :])
    they are using np.argmax(output_tokesn) ,Could you explain how does it works inorder to predict the target character.

  13. Krtk November 21, 2017 at 1:38 pm #

    Hi Jason, thanks for making a detailed blog.

    Did you try saving and restoring the model for inference later?
    I’m following the lstm_seq2seq example where the model.save stored the HDF5 file but when I try restoring the model just for inference, the output is all garbage despite the model providing good test responses when train is followed by inference.

  14. nandini November 22, 2017 at 6:23 pm #

    Hi Jason,

    Can we identify the grammer using keras,like what is noun and pronoun ,

    Please suggest any procedure is there to identify thr grammer using keras.

    • Jason Brownlee November 23, 2017 at 10:28 am #

      Sure. It could be framed as word classification.

      You must prepare a dataset of examples then fit your model.

  15. ambika November 22, 2017 at 9:24 pm #

    Hi jason,

    do you have any example of word level encoding for language translation ,above examples is for character level encoding rigjt.
    if you have example please share the link.

  16. nandu November 24, 2017 at 12:02 am #

    Here provided example ,they are doing character level encoding for encoder and decoder,Same model i would like to work on word level encoding for encoder and decoder?
    is word level encoding will better than character level encoding?

    in order to word level encoding ,which method i need to follow,please suggest me before going to start.

    • Jason Brownlee November 24, 2017 at 9:46 am #

      It depends on the problem whether word or char level will be better. Char may be more flexible but be slower to train, Word may require larger vocab/memory but train sooner.

  17. Jason Brownlee November 24, 2017 at 9:39 am #

    Sorry, I’m not sure I follow, can you please restate the question?

  18. Baptiste Amato November 26, 2017 at 2:29 am #

    Well explained article! I wonder if these sequence-to-sequence can be applied to images, for example to handle different size images, and apply encoder-decoder for segmentation?

    • Jason Brownlee November 26, 2017 at 7:33 am #

      Perhaps, I’m not sure I follow sorry. Do you have an example?

      • Baptiste Amato November 26, 2017 at 8:59 am #

        Let’s say I want to develop a Neural Network that returns, given an image, a sort of contour map (like the cars in blue, the people in green), but my data set has various images from different size, making them impossible to stack in an array and put them directly in a CNN. Would it be possible to apply the sequence-to-sequence idea to these images, as we want an image as result and we don’t have a uniform size for the input data?

        • Jason Brownlee November 27, 2017 at 5:42 am #

          Interesting challenge.

          I don’t think seq2seq is the right framing, but I could be wrong.

          There are many ways to frame this type of problem and I’d encourage you to explore a few. Perhaps a network that outputs a red and a green image that you combine downstream, or perhaps the network outputs one image with all pixels colored.

          • Baptiste Amato November 27, 2017 at 10:10 am #

            “a network that outputs a red and a green image that you combine downstream”: it’s pretty simple but I did not think about it… Thank you for your comment 🙂

          • Jason Brownlee November 28, 2017 at 8:36 am #

            No problem, let me know how you go.

  19. ambika November 28, 2017 at 5:55 pm #

    can we store model states and use it further for testing,
    Please suggest me the way to store the states in file ,use it further.

    • Jason Brownlee November 29, 2017 at 8:20 am #

      You can, e.g. pickle the state LSTM variable.

      Why do you want to save the state?

      • ambika November 29, 2017 at 5:19 pm #

        I am writting the training model and testing model seperately thats why i want to feed encoder_inputs and encoder states to inference model,thats why i would like to store the states.

        • Jason Brownlee November 30, 2017 at 8:06 am #

          I’d recommend using one model to keep things simple.

  20. ambika November 29, 2017 at 5:28 pm #

    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    model.save(“seq2seq.h5”)
    here i am loading the model from h5 file
    model=model.load(seq2seq.h5)
    after that i have stored encoder_inputs,encoders_states of trained model using pickling .
    next i am trying to create the inference model using encoder_inputs and encoder_states
    but i am not able to create the inference model correctly, i am getting the issue like graph disconnected with input layer 1 .
    encoder_model = Model(encoder_inputs, encoder_states

    Please suggest why i am getting like this issue.

    • Etienne December 16, 2017 at 1:22 am #

      Hey , I had the same issue ( I have implemented a seq 2 seq model with R but I had the same problem) and I solved this with a trick.
      So after training your model you can save it using keras function save model. Then I defined a function to build the encoder model and another one to build the decoder model from the model I have trained. Here is my code :

      It is R code but you can adapt it to python. The trick was mainly to create new layers and set weight with those trained by your model.

      • Jason Brownlee December 16, 2017 at 5:34 am #

        Thanks for sharing, I added some formatting.

      • ambika December 19, 2017 at 6:27 pm #

        thanks for your suggestion 🙂

    • Benoit December 27, 2017 at 6:59 pm #

      This should do the trick for Python:

      • Jason Brownlee December 28, 2017 at 5:22 am #

        Thanks for sharing!

      • Reihana July 29, 2018 at 1:22 pm #

        Is ‘s2s.h5’ model they only thing you saved from your encode-decoder for lated prediction?
        I mean in the main code, could we eliminate building the encoder-interface and decoder-interface and just saving the model. Later for prediction, we retrieve and build the interface based on only ‘s2s.h5’?

        If I do so I get this error:
        TypeError: Tensor objects are not iterable when eager execution is not enabled. To iterate over this tensor use tf.map_fn.

        My main confusion is whether we have to build the encoder/decoder_interface at the time of building the model (training) or not?

        Thanks in advance!

  21. Kevin December 15, 2017 at 8:30 pm #

    Hi Jason,

    Thank you for this great article! I created a Tensorflow implementation. This gives another perspective on the implementation side of the seq2seq model. The blog post about the Tensorflow implementation is found on Data Blogger: https://www.data-blogger.com/2017/12/14/create-a-character-based-seq2seq-using-python-and-tensorflow/.

  22. ambika December 19, 2017 at 6:46 pm #

    Can we add more lstm layres in encoder and decoder inorder to predict the results correctly,
    is it a good approach to add more layers in encoder and decoder using keras.

  23. ambika December 19, 2017 at 6:50 pm #

    encoder and decoder model is good approach to implement a model ,for algorithm as input to model,output is source code generation for specified language .

    i am trying to this problem statement using encoder and decoder model using keras,is this good idea to implement,or is there any apporach is there for this problem statement.

    Please do suggest for it.

  24. Etienne December 20, 2017 at 9:06 pm #

    Hey, First thank you for your article. I’ve managed to set up a model with only 1 layer for the enoder and 1 layer for the decoder ( both LSTM) . I would like to train my model with more than 1 LSTM layer.

    I have added 1 more LSTM to my decoder and set states for both of them with the encoder states. However when I want to predict a new sequence the ouput is very bad. Have you tried to add more than 1 LSTM with keras ?

    • nandini December 20, 2017 at 9:24 pm #

      why are you not adding more lstm layers to encoder,can i know reason.
      Actually i am also trying for the issue,let me why are you not adding more lstm layers to encoder rather than decoder?

      • Etienne December 21, 2017 at 1:13 am #

        When I try to create sophisticated model I start with basic model and then i try to improve it.
        Sure you can add 8 layers for the encoder and 8 for the decoder. But I prefer build it step by ste. I w’ll try to add layers for the encoder for sure but i don’t think that it will solve my problem… The training part is Ok with 2 lstm for the decoder but thr problem comes when i try to predict something with my inference model

        • Nandini December 21, 2017 at 3:08 pm #

          how can share code of 2 lstm layers ,how you have implemented,same i have tried with 3 lstm layers encoder, and 3 lstm layers with decoder.but while creating inference decoder model , i am getting shaping issue.

          • Nandini December 21, 2017 at 5:03 pm #

            how can share code of 2 lstm layers decoder,how you have implemented,same i have tried with 3 lstm layers encoder, and 3 lstm layers with decoder.but while creating inference decoder model , i am getting shaping issue.

      • Jason Brownlee December 21, 2017 at 5:25 am #

        I use a single layer for the encoder and decoder to keep things simple in the example.

    • Jason Brownlee December 21, 2017 at 5:24 am #

      You will need to tune the model, perhaps longer training, a new batch size and other config changes.

    • Greg April 14, 2018 at 6:44 am #

      How can I add more LSTM layers to the encoder and decoder? I’m having some trouble with the syntax.

  25. nandin January 3, 2018 at 4:43 pm #

    given test case i need to generate a script file for it,i am able to generate results some how ok.
    but i would to recognize in my test case what are the variables they has to be same in script file also,Please give any suggestion for it.

    how to recognize variable as per the input.
    ex: Define a integer variable a : int a

    Define a integer variable add : int add.

    how model can recognize a and add variable in test case.
    please do any suggestions for it.

  26. Nandini January 9, 2018 at 11:04 pm #

    is it good approach to increase accuracy of encoder and decoder model to adding more lstm layers ?

  27. Hardik Khandelwal January 12, 2018 at 8:38 pm #

    Thanks for the post. It’s really helpful.
    Why are we using only 10,000 samples out of 150,000. Is it because of huge memory requirements or is there some other reason.

    • Jason Brownlee January 13, 2018 at 5:32 am #

      Yes, because of the large space and time complexity.

  28. Soumil Mandal January 31, 2018 at 5:11 am #

    Thanks for the tutorial, it was quite helpful. In https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html he does mention in FAQ sec how to change the code to convert it into a GRU seq2seq model, but he does not mention how to change the inference model accordingly, any help would be really appreciated, I’m stuck here for quite some time. Thanks again though.

  29. huaiyanggongzi February 1, 2018 at 4:55 am #

    Hi, thanks for the tutorial. Can we extend this model to generic sequence instead of just language model. For instance, both input and output sequences are numerical sequences. Thanks for your insights.

  30. Ajay Prasadh Viswanathan February 6, 2018 at 9:20 am #

    Hey Jason, Thanks for an awesome article.
    I was wondering why you were padding the input with a max number of characters as 16. I could not see where this particular number was explicitly referenced in the model again. So I suppose I could feed non padded input to this model and the sequence to sequence model would still work ?

    Also if we are going to pad input and output to a finite length. Could have not used just a simple sequence classification architecture something like


    inputs = Input(shape = (in_max_characters, in_char_vocab_size))

    h1 = LSTM(128)(inputs)

    outputs = TimeDistributed(Dense(output_char_vocab_size), input_shape = (out_max_characters, 128))

    My question is regarding the conceptual and practical necessity of padding in context of sequence to sequence models.

    • Ajay Prasadh Viswanathan February 6, 2018 at 9:24 am #

      I forgot to add the repeat vector in my code earlier. Here is an updated version.

      inputs = Input(shape = (in_max_characters, in_char_vocab_size))
      h1 = LSTM(128)(inputs)
      h2 = RepeatVector(out_max_characters)(h1)
      outputs = TimeDistributed(Dense(output_char_vocab_size), input_shape = (out_max_characters, 128))(h2)

    • Jason Brownlee February 6, 2018 at 9:28 am #

      For this model, I believe padding is not required as it processes one time step at a time.

  31. Daniel February 23, 2018 at 3:50 am #

    Hello Jason, Thank you so much for your explanation.

    I would like to ask you about the model’s fit. Do you give a matrix as dimension of (10000,16,71) and as output a matrix as dimension of (10000,59,93)?

    In the case of sequence2sequence word based (as explained in the: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) they introduced an embedding layer. With the embedding layer, how do I set the model to make it able to understand that a single word is a single time step? Do i have to set the embedding’s input_dim parameter to 1 that give an output of (?,1,vocab_size) ? Or can I set the embedding’s input_dim parameter to maxlen of the corpus sentences that give an output of (?,maxlen,vocab_size) ?

    Thank you for your reply

    • Jason Brownlee February 23, 2018 at 12:02 pm #

      With the embedding, your input will be a sequence of integers. The embedding will map integers to the high-dimensional vector.

      Therefore, this post will help you with reshaping your sequences of integer inputs:
      https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

      Does that help?

      • Daniel February 25, 2018 at 6:07 am #

        Yes, thank you.
        But if i have the following code:

        i1=Input(shape=(2,))
        emb=Embedding(input_dim=1, output_dim= 10 , input_length=1(i1) # tensor (num_sample, 2, 10)
        lst=LSTM (16)(emb) # <- is it necessary declare input_shape?

        In this case is it necessary declare input_shape of LSTM? emb is a tensor 3D with number of example, time step and dimension of output.

        Thank you so much for your time.

        • Jason Brownlee February 25, 2018 at 7:45 am #

          No, the embedding will provide 3d input to the LSTM layer.

  32. Saket Karve February 24, 2018 at 7:02 pm #

    I am using a sequence to sequence model to predict keyphrases from a given text article. I am using the exact same model as mentioned in this blog for this purpose. However, I am confused whether the weights which are trained for the model will be used while inference?

    As per the example code on github, during inference, each character has to be predicted recursively. For this, the encoder_model.predict() is called to encode the input and decoder_model.predict() is called to predict the output. However, after saving the trained model, how do we make sure the trained weights are used in the encoder_model and the decoder_model while inference?

    Actually, I am getting the same output irrespective of the input.

    • Jason Brownlee February 25, 2018 at 7:42 am #

      Perhaps there is something going on with the saved model.

      Perhaps try to get it working in memory with a small example, then try saving/loading and reproducing the result to ensure there are no faults.

  33. Shannon March 11, 2018 at 4:39 pm #

    Great tutorial. I want to create a sequence to sequence model for images.

    Here is an example of my training set :

    x1 x2 x3 y1 y2 y3, where x1 x2 and x3 are the input sequence of images and y1, y2, and y3 are the following sequence of images which I want to forecast.

    My question is how do I represent x1 x2 and x3 as input to a neural network or to say Encoder-Decoder Sequence-to-Sequence Model?

    • Jason Brownlee March 12, 2018 at 6:26 am #

      That sounds like a challenging problem. Perhaps you can find examples of existing models that output an image that you can use for the output part of the model. The input could be a modified VGG or similar network.

  34. Zoe March 24, 2018 at 6:13 am #

    Thank you very much for sharing this great article, Jason.
    I don’t understand the inference model. why do we need the inference model after we defined our encoder and decoder. Don’t we just need to define the inputs/outputs and structure of the encoder/decoder, train this model, and then use it to predict? Very confused now. Thank you for your help in advance.

    • Jason Brownlee March 24, 2018 at 6:33 am #

      Here, we are using two very different models, one for learning and one for making predictions.

      The separation is required given the need to correctly perform the encoding and decoding in Keras.

      I’d recommend writing wrapper functions to train/save/load/predict with models in this form to hide the complexity.

  35. Parul April 5, 2018 at 8:43 am #

    How do I fit the model? In your vanilla lstm you had used

    model.fit(trainX, trainY, epochs=30, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

    What would be the equivalent for that in this model?

  36. George M. April 10, 2018 at 1:32 am #

    Why is the decoder output ahead by one timestep?

  37. George M. April 10, 2018 at 8:33 am #

    Dr. Brownlee,

    First of all, very great tutorial! I have all you articles bookmarked.

    I went through the https://github.com/keras-team/keras/blob/master/examples/lstm_seq2seq.py
    which has the same code you have here. At line 117 where they create the target values for the output of decoder it says it’s ahead of the decoder input by 1 timestep.
    Why do you think that is?

    Sorry, I wasn’t specific enough before.

  38. midabomb April 11, 2018 at 2:00 pm #

    Hello Jason,

    In inference mode, we define an encoder_model:

    # Define sampling models
    encoder_model = Model(encoder_inputs, encoder_states)

    Then we use this model here:
    states_value = encoder_model.predict(input_seq)

    I am used to defining model, then compile and fit, before using it to predict. Here, we just define the model then predict.

    Obviously, I am missing something. Could you elaborate a bit on what is going on here behind the scene?
    Thanks,
    -MDB

    • Jason Brownlee April 11, 2018 at 4:27 pm #

      I believe compilation is only required for fitting the Keras model.

      • midabomb April 12, 2018 at 1:05 am #

        Okay Thanks Jason.

        In my example, there is no fitting. Where the weights for the predict is coming from Jason?
        -MDB

  39. Sukruthi April 12, 2018 at 5:42 pm #

    HI Jason. How can i save the model then use it another time from the model saved

  40. changito April 14, 2018 at 7:06 am #

    Here, the difference between a good post(Github) and an excellent post(this). Thank you amigo!

  41. Mayur Jain April 24, 2018 at 11:32 pm #

    Thank you for the wonderful post.
    I have this question that the, LSTM model used for Machine Translation, can be applied for dialogue generation ?

    Please let me know if it can be done. If yes, what are major tweaks that needs to be done.

  42. bratt April 24, 2018 at 11:49 pm #

    Hi Jason, your blogs are really great and taught me much, thank you for your opensource work!

    I made this code for a prediction, regression task but I want to understand the math behind it as well. Why I go first with return sequence false in this case and what are my activation functions ? they are not named and where do I know which is the default one ? Thank you in advance.

    inputs = Input(shape=(timesteps, input_dim))
    encoded = LSTM(n_dimensions, return_sequences=False, name=”encoder”)(inputs)
    decoded = RepeatVector(timesteps)(encoded)
    decoded = LSTM(input_dim, return_sequences=True, name=’decoder’)(decoded)

    autoencoder = Model(inputs, decoded)
    encoder = Model(inputs, encoded)

    • Jason Brownlee April 25, 2018 at 6:34 am #

      LSTMs and the encoder-decoder are not suited for regression, unless you have a sequence of inputs and outputs.

      You can learn more here:
      https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

      • bratt April 25, 2018 at 5:57 pm #

        My inputs are sequences, 100 measurements with each containing 10.000 data points. The point that I do not understand is, which I type of activation function is build in in the model, also in your other link it is not mentioned what the function is ?

        • Jason Brownlee April 26, 2018 at 6:25 am #

          Perhaps use the defaults first in order to evaluate model skill?

          • bratt April 27, 2018 at 1:51 am #

            Hi Jason, sorry for the missunderstanding, but you brought it on point. What are the default activation functions of the encoding decoding layers ?

          • Jason Brownlee April 27, 2018 at 6:06 am #

            You can see here:
            https://keras.io/layers/recurrent/#lstm

  43. wende May 11, 2018 at 5:04 pm #

    Hi Jason, how can i adopt it for bidirectional architecture? I keep getting an error on the decoder part
    ‘valueError: Dimensions must be equal, but are 256 and 512 for ‘lstm_2_1/MatMul_4’ (op: ‘MatMul’) with input shapes: [?,256], [512,512].

  44. zishan May 20, 2018 at 10:57 pm #

    Hi Jason, when I try to use this model I got an error with expecting 2 input arrays. I know it has something to do with this line model = Model([encoder_inputs, decoder_inputs], decoder_outputs) and decoder part. I have a target array which my model should encode and decode in the end. but how do I define the decoder_inputs in model.fit(train, train….) ?

    • Jason Brownlee May 21, 2018 at 6:31 am #

      Are you able to confirm that you copied all of the code exactly?

  45. paul June 1, 2018 at 9:01 am #

    Hi Jason, i am really new to python and machine learning and have not an IT background, so I might ask not clear enough and would provide details later as necessary. I know that you commented that it is not good for time series but I still want to see my results. How do I do use this encoder decoder model with a sliding window approach ? Could you give me like the steps or blocks how to proceed with that ? Thank you in advance.

  46. didu June 4, 2018 at 9:20 pm #

    I am a bit confused since I am new to ML, is there a big difference in the two models you described here and on your post here https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/

    • Jason Brownlee June 5, 2018 at 6:39 am #

      Yes, the linked approach matches the definition of the method as described in the research papers.

      The above approach is a simplification of the approach that is easier to implement and understand and gives similar performance.

  47. duwa June 9, 2018 at 12:26 pm #

    Hi sir,
    i have followed your tutorial and everything working fine until i train model to the fullest. I have 8gb ram and another 8gb on swap. I can see the ram going to the fullest through hardware monitors. Is there any way to avoid memory issues, or train the model again and again for different dataset ? P.S- I’m newbie to this stuffs 🙂
    ===========================
    Using TensorFlow backend.
    English Vocabulary Size: 8773
    English Max Length: 9
    German Vocabulary Size: 15723
    German Max Length: 17
    Traceback (most recent call last):
    File “train_model.py”, line 85, in
    trainY = encode_output(trainY, eng_vocab_size)
    File “train_model.py”, line 48, in encode_output
    y = array(ylist)
    MemoryError

    • Jason Brownlee June 10, 2018 at 5:58 am #

      Sorry to hear that.

      Perhaps try running the example on EC2 with more RAM?
      Perhaps try using progressive loading instead of loading all data into memory?

  48. Reihana July 24, 2018 at 8:12 am #

    Hi Json,

    How we can see the score of model? I mean the precision, recall, … on train set?

    • Jason Brownlee July 24, 2018 at 2:30 pm #

      Generally, we use measures like BLEU or ROGUE to evaluate the performance of a NMT system.

  49. tom July 26, 2018 at 5:12 am #

    Hi Jason,
    I read your blogs and its really easy to follow. But one thing confused me I saw somewehre a model for seq2seq with a repeatvector. Could you please tell the difference ? are there any papers describing the model with the inference and the model with the repeatvector? thanks in advance.

  50. Reihana July 31, 2018 at 3:51 am #

    Hi Jason,
    Thanks a lot for this beautiful tutorial.

    I have a very huge problem with the logic behind the interface model.
    Interface model is going to predict the label for the data that we don’t have label for it. am I correct?

    So then, the the decode_sequence(input_seq):
    we have this:
    # Generate empty target sequence of length 1.
    target_seq = np.zeros((1, 1, num_decoder_tokens))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0, target_token_index[‘\t’]] = 1.

    the second input (the input of our decoder) is set to be [1, 0, 0, …, 0] here. Meaning a sequence of zeros. On the other hand, this input is impactful in predicting the target sequence.
    How we could do that? I mean why we are setting the input decoder like this when we know that it is going to be used in predicting the output data.

    For sure I was expecting that since this is a vector of zero, no matter what are the encoder_states, the model will underfit everything to zero which is doing that for me now!

    • Jason Brownlee July 31, 2018 at 6:12 am #

      Perhaps try this much simpler approach:
      https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

      • Reihana July 31, 2018 at 8:25 am #

        Just noticed that the nature of Decoder is designed to decode one token at a time. Meaning that it is not taking a sequence of tokens and predict a sequence of tokens.
        That being said, the decoder input is being used to keep track of each token being discovered at a timestep, and because of that it just have the starting point to predict the next token.

        Got it!
        Thanks Jason! your blog is my first recommendation to everyone!

  51. Monika Jain August 2, 2018 at 11:01 pm #

    Hi, I need a single encoder and three decoders branching out parallely out of it. Can somebody suggest how to code for that?

  52. betsi August 9, 2018 at 12:54 pm #

    what is the meaning of the latent dim? How can I find this value to fit to my problem?

  53. Angel August 12, 2018 at 6:34 pm #

    Hello Jason 🙂
    Really thanks a lot for the tutorial it’s very helpful

    I’m still rather new to deep learning,…I have tried to train the model on the dataset, but I wonder how can I reload my saved model to test it on a new set of data?

  54. Richa Ranjan August 20, 2018 at 8:45 am #

    One of the best tutorials I’ve seen for sequence2sequence!

    Just one doubt, how are the inference models connected to the actual model that we trained? While making predictions, if I try to load a previously trained model, I get different results than at the time of training. Basically, I’m not clear on how the model, encoder_model and the decoder_model are linked?

    This may be a repeated query, but I haven’t had any luck with the answers. Any help would be appreciated. Thanks in advance!

    • Jason Brownlee August 20, 2018 at 2:15 pm #

      They are linked by using the same weights, but different interfaces to the data.

  55. Vishwas August 28, 2018 at 8:00 pm #

    Hi Jason, Is there Back propogation taking place in Encoder?,If yes,then how loss function is calculated as we dont have any target variables

    • Jason Brownlee August 29, 2018 at 8:09 am #

      Yes, we are backpropagating from the output to the decoder to the encoder.

      • Vishwas August 29, 2018 at 3:01 pm #

        Hello Jason,
        Thanks for your quick response, But I had one more question.

        In the encoder step will be creating a thought vector for the given input sentence(incase of Machine Translation) and giving it to the Decoder. So during backpropogation , In the encoder we are not performing any prediction,So why do we need to update the weights?

        *Hope you could give a brief description on this ,As I was not able to find any material which explains Backprop in Seq2Seq model.*

        Thanks!

        • Jason Brownlee August 30, 2018 at 6:25 am #

          Think of it as one large model, errors are propagated back as per any other neural net.

  56. Amel Musić September 23, 2018 at 5:12 am #

    Hi,
    thank you for this great tutorial. Can you please share code for stacked layers for us. From what I can see, lot of us are struggling to stack decoder and encoder.

    I’ve stacked encoder like this:
    encoder_inputs = Input(shape=(None, num_encoder_tokens))
    encoder_hidden = LSTM(latent_dim, return_state=True, return_sequences=True)(encoder_inputs)
    encoder = LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_hidden)

    And decoder like this:
    decoder_inputs = Input(shape=(None, num_decoder_tokens))
    # We set up our decoder to return full output sequences,
    # and to return internal states as well. We don’t use the
    # return states in the training model, but we will use them in inference.

    decoder_hidden = LSTM(latent_dim, return_sequences=True, return_state=True)(decoder_inputs,
    initial_state=encoder_states)

    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_hidden)

    Problem is inference, model seems to be training but I’m getting weird results at inference time.

    Can you please post how would you stack additional layer during training and inference.
    Thnx

Leave a Reply