How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

The encoder-decoder model provides a pattern for using recurrent neural networks to address challenging sequence-to-sequence prediction problems such as machine translation.

Encoder-decoder models can be developed in the Keras Python deep learning library and an example of a neural machine translation system developed with this model has been described on the Keras blog, with sample code distributed with the Keras project.

This example can provide the basis for developing encoder-decoder LSTM models for your own sequence-to-sequence prediction problems.

In this tutorial, you will discover how to develop a sophisticated encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

After completing this tutorial, you will know:

  • How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
  • How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
  • How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

Let’s get started.

How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras
Photo by Björn Groß, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  • Encoder-Decoder Model in Keras
  • Scalable Sequence-to-Sequence Problem
  • Encoder-Decoder LSTM for Sequence Prediction

Python Environment

This tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial.

You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend.

The tutorial also assumes you have scikit-learn, Pandas, NumPy, and Matplotlib installed.

If you need help with your environment, see this post:

Encoder-Decoder Model in Keras

The encoder-decoder model is a way of organizing recurrent neural networks for sequence-to-sequence prediction problems.

It was originally developed for machine translation problems, although it has proven successful at related sequence-to-sequence prediction problems such as text summarization and question answering.

The approach involves two recurrent neural networks, one to encode the source sequence, called the encoder, and a second to decode the encoded source sequence into the target sequence, called the decoder.

The Keras deep learning Python library provides an example of how to implement the encoder-decoder model for machine translation ( described by the libraries creator in the post: “A ten-minute introduction to sequence-to-sequence learning in Keras.”

For a detailed breakdown of this model see the post:

For more information on the use of return_state, which might be new to you, see the post:

For more help getting started with the Keras Functional API, see the post:

Using the code in that example as a starting point, we can develop a generic function to define an encoder-decoder recurrent neural network. Below is this function named define_models().

The function takes 3 arguments, as follows:

  • n_input: The cardinality of the input sequence, e.g. number of features, words, or characters for each time step.
  • n_output: The cardinality of the output sequence, e.g. number of features, words, or characters for each time step.
  • n_units: The number of cells to create in the encoder and decoder models, e.g. 128 or 256.

The function then creates and returns 3 models, as follows:

  • train: Model that can be trained given source, target, and shifted target sequences.
  • inference_encoder: Encoder model used when making a prediction for a new source sequence.
  • inference_decoder Decoder model use when making a prediction for a new source sequence.

The model is trained given source and target sequences where the model takes both the source and a shifted version of the target sequence as input and predicts the whole target sequence.

For example, one source sequence may be [1,2,3] and the target sequence [4,5,6]. The inputs and outputs to the model during training would be:

The model is intended to be called recursively when generating target sequences for new source sequences.

The source sequence is encoded and the target sequence is generated one element at a time, using a “start of sequence” character such as ‘_’ to start the process. Therefore, in the above case, the following input-output pairs would occur during training:

Here you can see how the recursive use of the model can be used to build up output sequences.

During prediction, the inference_encoder model is used to encode the input sequence once which returns states that are used to initialize the inference_decoder model. From that point, the inference_decoder model is used to generate predictions step by step.

The function below named predict_sequence() can be used after the model is trained to generate a target sequence given a source sequence.

This function takes 5 arguments as follows:

  • infenc: Encoder model used when making a prediction for a new source sequence.
  • infdec: Decoder model use when making a prediction for a new source sequence.
  • source:Encoded source sequence.
  • n_steps: Number of time steps in the target sequence.
  • cardinality: The cardinality of the output sequence, e.g. the number of features, words, or characters for each time step.

The function then returns a list containing the target sequence.

Scalable Sequence-to-Sequence Problem

In this section, we define a contrived and scalable sequence-to-sequence prediction problem.

The source sequence is a series of randomly generated integer values, such as [20, 36, 40, 10, 34, 28], and the target sequence is a reversed pre-defined subset of the input sequence, such as the first 3 elements in reverse order [40, 36, 20].

The length of the source sequence is configurable; so is the cardinality of the input and output sequence and the length of the target sequence.

We will use source sequences of 6 elements, a cardinality of 50, and target sequences of 3 elements.

Below are some more examples to make this concrete.

You are encouraged to explore larger and more complex variations. Post your findings in the comments below.

Let’s start off by defining a function to generate a sequence of random integers.

We will use the value of 0 as the padding or start of sequence character, therefore it is reserved and we cannot use it in our source sequences. To achieve this, we will add 1 to our configured cardinality to ensure the one-hot encoding is large enough (e.g. a value of 1 maps to a ‘1’ value in index 1).

For example:

We can use the randint() python function to generate random integers in a range between 1 and 1-minus the size of the problem’s cardinality. The generate_sequence() below generates a sequence of random integers.

Next, we need to create the corresponding output sequence given the source sequence.

To keep thing simple, we will select the first n elements of the source sequence as the target sequence and reverse them.

We also need a version of the output sequence shifted forward by one time step that we can use as the mock target generated so far, including the start of sequence value in the first time step. We can create this from the target sequence directly.

Now that all of the sequences have been defined, we can one-hot encode them, i.e. transform them into sequences of binary vectors. We can use the Keras built in to_categorical() function to achieve this.

We can put all of this into a function named get_dataset() that will generate a specific number of sequences that we can use to train a model.

Finally, we need to be able to decode a one-hot encoded sequence to make it readable again.

This is needed for both printing the generated target sequences but also for easily comparing whether the full predicted target sequence matches the expected target sequence. The one_hot_decode() function will decode an encoded sequence.

We can tie all of this together and test these functions.

A complete worked example is listed below.

Running the example first prints the shape of the generated dataset, ensuring the 3D shape required to train the model matches our expectations.

The generated sequence is then decoded and printed to screen demonstrating both that the preparation of source and target sequences matches our intention and that the decode operation is working.

We are now ready to develop a model for this sequence-to-sequence prediction problem.

Encoder-Decoder LSTM for Sequence Prediction

In this section, we will apply the encoder-decoder LSTM model developed in the first section to the sequence-to-sequence prediction problem developed in the second section.

The first step is to configure the problem.

Next, we must define the models and compile the training model.

Next, we can generate a training dataset of 100,000 examples and train the model.

Once the model is trained, we can evaluate it. We will do this by making predictions for 100 source sequences and counting the number of target sequences that were predicted correctly. We will use the numpy array_equal() function on the decoded sequences to check for equality.

Finally, we will generate some predictions and print the decoded source, target, and predicted target sequences to get an idea of whether the model is working as expected.

Putting all of these elements together, the complete code example is listed below.

Running the example first prints the shape of the prepared dataset.

Next, the model is fit. You should see a progress bar and the run should take less than one minute on a modern multi-core CPU.

Next, the model is evaluated and the accuracy printed. We can see that the model achieves 100% accuracy on new randomly generated examples.

Finally, 10 new examples are generated and target sequences are predicted. Again, we can see that the model correctly predicts the output sequence in each case and the expected value matches the reversed first 3 elements of the source sequences.

You now have a template for an encoder-decoder LSTM model that you can apply to your own sequence-to-sequence prediction problems.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Related Posts

Keras Resources


In this tutorial, you discovered how to develop an encoder-decoder recurrent neural network for sequence-to-sequence prediction problems with Keras.

Specifically, you learned:

  • How to correctly define a sophisticated encoder-decoder model in Keras for sequence-to-sequence prediction.
  • How to define a contrived yet scalable sequence-to-sequence prediction problem that you can use to evaluate the encoder-decoder LSTM model.
  • How to apply the encoder-decoder LSTM model in Keras to address the scalable integer sequence-to-sequence prediction problem.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.

128 Responses to How to Develop an Encoder-Decoder Model for Sequence-to-Sequence Prediction in Keras

  1. Alex November 2, 2017 at 7:06 pm #

    Is this model suited for sequence regression too? For example the shampoo sales problem

  2. Teimour November 2, 2017 at 9:55 pm #

    Hi. is it possible to have multi layers of LSTM in encoder and decoder in this code? thank you for your great blog

    • Jason Brownlee November 3, 2017 at 5:16 am #

      Yes, but I don’t have an example. For this specific case it would require some careful re-design.

  3. Kyu November 3, 2017 at 12:06 am #

    How can I extract the bottleneck layer to extract the important features with sequence data?

    • Jason Brownlee November 3, 2017 at 5:18 am #

      You could access the returned states to get the context vector, but it does not help you understand which input features are relevant/important.

  4. Thabet November 3, 2017 at 4:09 am #

    Thank you Jason!

  5. Harry Garrison November 18, 2017 at 3:30 am #

    Thanks for the wonderful tutorial, Jason!
    I am facing an issue, though: I tried to execute your code as is (copy-pasted it), but it throws an error:

    Using TensorFlow backend.
    Traceback (most recent call last):
    File “C:\Users\User\Documents\pystuff\”, line 91, in
    train, infenc, infdec = define_models(n_features, n_features, 128)
    File “C:\Users\User\Documents\pystuff\”, line 40, in define_models
    encoder = LSTM(n_units, return_state=True)
    File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\legacy\”, line 88, in wrapper
    return func(*args, **kwargs)
    File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\layers\”, line 949, in __init__
    super(LSTM, self).__init__(**kwargs)
    File C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\layers\”, line 191, in __init__
    super(Recurrent, self).__init__(**kwargs)
    File “C:\Users\User\Anaconda3\envs\py35\lib\site-packages\keras\engine\”, line 281, in __init__
    raise TypeError(‘Keyword argument not understood:’, kwarg)
    TypeError: (‘Keyword argument not understood:’, ‘return_state’)

    I am using an anaconda environment (python 3.5.3). What could have possibly gone wrong?

    • Jason Brownlee November 18, 2017 at 10:23 am #

      Perhaps confirm that you have the most recent version of Keras and TensorFlow installed.

      • Carolyn December 8, 2017 at 12:34 pm #

        I had the same problem, and updated Keras (to version 2.1.2) and TensorFlow (to version 1.4.0). The problem above was solved. However, I now see that the shapes of X1, X2, and y are ((100000, 1, 6, 51), (100000, 1, 3, 51), (100000, 1, 3, 51)) instead of ((100000, 6, 51), (100000, 3, 51), (100000, 3, 51)). Why could this be?

        • Jason Brownlee December 8, 2017 at 2:29 pm #

          I’m not sure, perhaps related to recent API changes in Keras?

          • Carolyn December 9, 2017 at 1:36 am #

            Here’s how I fixed the problem.

            At the top of the code, include this line (before any ‘from numpy import’ statements:

            import numpy as np

            Change the get_dataset() function to the following:

            Notice instead of returning array(X1), array(X2), array(y), we now return arrays that have been squeezed – one axis has been removed. We remove axis 1 because it’s the wrong shape for what we need.

            The output is now as it should be (although I’m getting 98% accuracy instead of 100%).

          • Jason Brownlee December 9, 2017 at 5:42 am #

            Thanks for sharing.

            Perhaps confirm that you have updated Keras to 2.1.2, it fixes bugs with to_categorical()?

  6. Thabet November 26, 2017 at 8:47 am #

    Hi Jason!
    Are the encoder-decoder networks suitable for time series classification?

  7. Python November 30, 2017 at 11:40 pm #

    Hi Jason,
    running the get_dataset the function returns one additional row in the Arrays X1, X2, y:
    (1,1,6,51) (1,1,3,51)(1,1,3,51).
    This results form numpy.array(), before the X1, X2, y are lists with the correct sizes:
    (1,6,51) (1,3,51)(1,3,51).
    It seems the comand array() adds an additional dimension. Could you help on solving this problem?

  8. nandu December 1, 2017 at 6:42 pm #

    how to train the keras ,it has to identify capital letter and small letter has to same

    Please suggest any tactics for it.

  9. Pritish Yuvraj December 7, 2017 at 4:34 am #

    Could you please give me a tutorial for a case where we are input to the seq2seq model is word embeddings and outputs is also word embeddings. I find it frustrating to see a Dense layer at the end. this is what is stopping a fully seq2seq model with input as well as output word embeddings.

    • Jason Brownlee December 7, 2017 at 8:08 am #

      Why would we have a word embedding at the output layer?

      • Uthman Apatira March 2, 2018 at 6:59 pm #

        If embeddings are coming in, we’d want to have embeddings going out (auto-encoder). Imagine input = category (cat, dog, frog, wale, human). These animals are quite different, so we represent w/ embedding. Rather than having a dense output of 5 OHE, if an embedding is used for output, the assumption is that, especially if weights are shared between inputs + outputs, we could train the net with and give it more detail about what exactly each class is… rather than use something like cross entropy.

        • Jason Brownlee March 3, 2018 at 8:08 am #


          • Joe May 25, 2018 at 9:12 am #

            Is there such a tutorial yet? Sounds interesting.

          • Jason Brownlee May 25, 2018 at 9:36 am #

            I’ve not written one, I still don’t get/see benefit in the approach. Happy to be proven wrong.

  10. Dinter December 13, 2017 at 12:00 am #

    Hi ! When adding dropout and recurrent_dropout as LSTM arguments on line 40 of the last complete code example with everything else being the same, the code went wrong. So how can I add dropout in this case? Thanks!

    • Jason Brownlee December 13, 2017 at 5:38 am #

      I give a worked example of dropout with LSTMs here:

      • Dipesh January 11, 2018 at 6:20 am #

        Hi Jason! Thanks for your great effort to put encoder decoder implementations here. As Dinter mentioned, when dropout is added, the code runs well for training phase but gives following error during prediction.

        InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor ‘lstm_1/keras_learning_phase’ with dtype bool
        [[Node: lstm_1/keras_learning_phase = Placeholder[dtype=DT_BOOL, shape=, _device=”/job:localhost/replica:0/task:0/cpu:0″]()]]

        How to fix the problem in this particular implementation?

        [Note: Your worked example of dropout however worked for me, but the difference is you are adding layer by layer in sequential model there which is different than this example of encoder decoder]

        • Jason Brownlee January 12, 2018 at 5:45 am #

          Sorry to hear that, perhaps it’s a bug? See if you can reproduce the fault on a small standalone network?

          • Dipesh Gautam January 17, 2018 at 6:57 am #

            When I tried with exactly the same network in the example you presented and added dropout=0.0 in line 10 and line 15 of define_models function, the program runs but for other values of dropout it gives error. Also changing the size of network, for example, number of units to 5, 10, 20 does give the same error.

            Any idea to add dropout?

            line 5: encoder = LSTM(n_units, return_state=True,dropout=0.0)

            line 10: decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True,dropout=0.0)

  11. Huzefa Calcuttawala January 23, 2018 at 8:40 pm #

    Hi Jason,
    what is the difference between specifying the model input:
    Model( [decoder_inputs] + decoder_states_inputs,….)
    and this

    Does the 1st version add the elements of decoder_states_inputs array to corresponding elements of decoder_inputs

  12. Alfredo January 24, 2018 at 10:26 pm #

    Hi Jason, Thanks for sharing this tutorial. I am only confused when you are defining the model. This is the line:

    train, infenc, infdec = define_models(n_features, n_features, 128)

    It is a silly question but why n_features in this case is used for the n_input and for the n_output instead of n_input equal to 6 and n_output equal to 3 ?

    I look forward to hearing from you soon.


    • Jason Brownlee January 25, 2018 at 5:56 am #

      Good question, because the model only does one time step per call, so we walk the input/output time steps manually.

  13. Dat February 22, 2018 at 8:15 pm #

    I would like to see the hidden states vector. Because there are 96 training samples, there would be 96 of these (each as a vector of length 4).

    I added the “return_sequences=True” in the LSTM

    model = Sequential()
    model.add( LSTM(4, input_shape=(1, look_back), return_sequences=True ) )
    model.compile(loss=’mean_squared_error’, optimizer=’adam’), trainY, epochs=20, batch_size=20, verbose=2)

    But, I get the error

    Traceback (most recent call last):
    File “”, line 1, in
    File “E:\ProgramData\Anaconda3\lib\site-packages\keras\”, line 965, in fit
    File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\”, line 1593, in fit
    File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\”, line 1430, in _standardize_user_data
    File “E:\ProgramData\Anaconda3\lib\site-packages\keras\engine\”, line 110, in _standardize_input_data
    ‘with shape ‘ + str(data_shape))
    ValueError: Error when checking target: expected dense_24 to have 3 dimensions, but got array with shape (94, 1)

    How can I make this model work, and also how can I view the hidden states for each of the input samples (should be 96 hidden states).

    Thank you.

  14. D. Liebman March 3, 2018 at 12:24 am #

    Hi Jason,
    I like your blog posts. I have some code based on this post, but I get this error message all the time. There’s something wrong with my dense layer. Can you point it out? Here units is 300 and tokens_per_sentence is 25.

    an error message:

    ValueError: Error when checking target: expected dense_layer_b to have 2 dimensions, but got array with shape (1, 25, 300)

    this is some code:

    • Jason Brownlee March 3, 2018 at 8:12 am #

      Perhaps the data does not match the model, you could change one or the other.

      I’m eager to help, but I cannot debug the code for you sorry.

      • D. Liebman March 4, 2018 at 6:30 am #

        hi. so I think I needed to set ‘return_sequences’ to True for both lstm_a and lstm_b.

  15. Jia Yee March 16, 2018 at 2:24 pm #

    Dear Jason,

    Do you think that this algorithm works for weather prediction? For example, by having the input integer variables as dew point, humidity, and temperature, to predict rainfall as output

    • Jason Brownlee March 16, 2018 at 2:26 pm #

      Only for demonstration purposes.

      In practice, weather forecasts are performed using simulations of physics models and are more accurate than small machine learning models.

  16. Lukas March 20, 2018 at 10:10 pm #

    Hi Jason.

    I would like tou ask you, how could I use this model with float numbers? Yours training data seems like this:

    I would need something like this:
    [0.12354,0.9854,5875, 0.0659]
    [0.12354,0.9854,5875, 0.0659]
    [0.12354,0.9854,5875, 0.0659]
    [0.12354,0.9854,5875, 0.0659]

    Whan i run your model with float numbers, the network doesn’t learn. Should I use some different LOSS function?

    Thank you

    • Jason Brownlee March 21, 2018 at 6:33 am #

      The loss function is related to the output, if you have a real-valued output, consider mse or mae loss functions.

  17. Jugal March 31, 2018 at 4:03 pm #

    How to add bidirectional layer in encoder decoder architecture?

  18. Luke April 3, 2018 at 9:51 pm #

    I would like to use your model with word embedding. I was inspired by ->features -> word-level model

    I decoded words in sentences as integer numbers. My input is list of sentences with 15 words. [[1,5,7,5,6,4,5, 10,15,12,11,10,8,1,2], […], [….], …]

    My model seems:

    encoder_inputs = Input(shape=(None,))
    x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
    x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
    encoder_states = [state_h, state_c]

    # Set up the decoder, using encoder_states as initial state.
    decoder_inputs = Input(shape=(None,))
    x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
    x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
    decoder_outputs = Dense(num_decoder_tokens, activation=’softmax’)(x)

    # Define the model that will turn
    # encoder_input_data & decoder_input_data into decoder_target_data
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    Whan i try to train the model, i get this error: expected dense_1 to have 3 dimensions, but got array with shape (10, 15). Could you help me with that please?

    Thank you

    • Fatemeh August 8, 2018 at 4:52 am #

      I also have the same issue. could you solve your problem?

  19. Sunil April 4, 2018 at 6:44 am #

    Hi ,

    I am trying to do seq2seq problem using Keras – LSTM. Predicted output words matches with most frequent words of the vocabulary built using the dataset. Not sure what could be the reason. After training is completed while trying to predict the output for the given input sequence model is predicting same output irrespective of the input seq. Also using separate embedding layer for both encoder and decoder.

    Can you help me what could be the reason ?

    Question elaborated here:


  20. Phetsa Ndlangamandla April 9, 2018 at 9:45 pm #

    Hi Jason,

    In your current setup, how would you add a pre-trained embedding matrix, like glove?

  21. Lukas April 11, 2018 at 2:11 am #


    Could you help me with my problem? I think that a lot of required logic is implemented in your code but I am very beginner in python. I want to predict numbers according to input test sequence (seq2seq) so output of my decoder should by sequence of 6 numbers (1-7). U can imagine it as lottery prediction. I have very long input vector of numbers 1-7 (contains sequenses of 6 numbers) so I don’t need to generate test data. I just need to predict next 6 numbers which should be generated. 1,1,2,2,3 -> 3,4,4,5,5,6

    Thank you

  22. Jameshwart Lopez April 17, 2018 at 11:41 am #

    Thank you Jason for this tutorial it gets me started. I tried to use larger sequence where input is 400 series of numbers and output is 40 numbers. I have problem with categorical because of index error and i don’t know how to set or get the the value for cardinality/n_features. Can you give me an idea on this?

    Im also not clear if this is a classification type of model. Can you please confirm. Thanks

  23. Kadirou April 17, 2018 at 8:15 pm #

    thank you for this totu, please i receive this error when i run your example :
    ValueError: Error when checking input: expected input_1 to have 3 dimensions, but got array with shape (100000, 1, 6, 51)

    • Claudiu April 23, 2018 at 10:21 pm #

      Hi! I have the same problem.. Did you solve it ? Thanks!

      • Gary April 30, 2018 at 1:18 am #

        change this:

        src_encoded = to_categorical([source], num_classes=cardinality)[0]
        tar_encoded = to_categorical([target], num_classes=cardinality)[0]
        tar2_encoded = to_categorical([target_in], num_classes=cardinality)[0]

  24. Jonathan K April 25, 2018 at 3:32 am #

    Hi Jason, thank you very much for the tutorial. Is it possible to decode many sequences simultaneously instead of decoding a single sequence one character at a time? The inference for my project takes too long and I thought that doing it in larger batches may help.

    • Jason Brownlee April 25, 2018 at 6:37 am #

      You could use copies of the model to make prediction in parallel for different inputs.

  25. Jameshwart Lopez April 26, 2018 at 4:08 pm #

    Hi im just wondering why do we need to use to_categorical if the sequence is already numbers. On my case i have a series of input features(numbers) and another series of output features(number). Should i still use to_categorical method?

  26. chunky April 28, 2018 at 10:16 pm #

    Hi Jason,

    I am working on word boundary detection problem where dataset containing .wav files in which a sentence is spoken are given, and corresponding to each .wav file a .wrd file is also given which contains the words spoken in a sentence and also its boundaries (starting and end boundaries).
    Our task is to identify word boundaries in test .wav file (words spoken will also be given).
    I want to do this with sequential models .

    What I have tried is:

    I have read .wav files using librosa module in numpy array (made it equal to max size using padding)

    Its output is like 3333302222213333022221333302222133333 (for i/p ex: I am hero)
    where (0:start of word, 1:end of word, 2:middle, 3:space)

    means I want to solve this as supervised learning problem, can I train such model with RNN?

    • Jason Brownlee April 29, 2018 at 6:27 am #

      Sounds like a great project.

      I don’t have examples of working with audio data sorry, I cannot give you good off the cuff advice.

      Perhaps find some code bases on related modeling problems and use them for inspiration?

  27. Lukas April 30, 2018 at 1:03 am #

    I’ve tried this example for word-level chatbot. Everything works great, on a small data (5000 sentences)

    When I use dataset of 50 000 sentences something is wrong. Model accuracy is 95% but when i try to chat with this chatbot, responses are generated randomly. Chatbot is capable of learning sentences from dataset, but it use them randomly to respond to users questions.

    How it is possible, when accuracy is so hight?

    • Jason Brownlee April 30, 2018 at 5:37 am #

      Perhaps it memorized the training data? E.g. overfitting?

  28. ricky May 3, 2018 at 12:06 am #

    Sir, how to create confusion matrix, evaluated and the accuracy printed for this model :

    # Define an input sequence and process it.
    encoder_inputs = Input(shape=(None, num_encoder_tokens))
    encoder = LSTM(latent_dim, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    # We discard encoder_outputs and only keep the states.
    encoder_states = [state_h, state_c]

    # Set up the decoder, using encoder_states as initial state.
    decoder_inputs = Input(shape=(None, num_decoder_tokens))
    # We set up our decoder to return full output sequences,
    # and to return internal states as well. We don’t use the
    # return states in the training model, but we will use them in inference.
    decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
    decoder_dense = Dense(num_decoder_tokens, activation=’softmax’)
    decoder_outputs = decoder_dense(decoder_outputs)

    # Define the model that will turn
    # encoder_input_data & decoder_input_data into decoder_target_data
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

    model.compile(optimizer=’rmsprop’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])[encoder_input_data, decoder_input_data], decoder_target_data,


  29. Chandra Sutrisno May 8, 2018 at 7:54 pm #

    Hi Jason,

    Thank you for this awesome tutorial, so useful. I have one simple question. Is there any specific
    reason why use 50+1 as n_features?

    Please advise

    • Jason Brownlee May 9, 2018 at 6:19 am #

      The +1 is to leave room for the “0” value, for “no data”.

  30. Kunwar May 15, 2018 at 11:28 pm #

    Hi Jason,

    Thanks for all the wonderful tutorial ..great work.

    i have question,

    in time series prediction does multivariate give better result then uni-variate.
    eg.- for “Beijing PM2.5 Data Set” we have multivariate data will the multivariate give better results, or by taking the single pollution data for uni-variate will give better result.

    2 – what is better encoder-decoder or normal RNN for time series prediction.

    • Jason Brownlee May 16, 2018 at 6:04 am #

      For both questions, it depends on the specific problem.

  31. matt May 20, 2018 at 3:44 am #

    Hi Jason, here it looks like one time step in the model. What do I have to change to add here more time steps in the model ?

    # returns train, inference_encoder and inference_decoder models
    def define_models(n_input, n_output, n_units):
    # define training encoder
    encoder_inputs = Input(shape=(None, n_input))
    encoder = LSTM(n_units, return_state=True)
    encoder_outputs, state_h, state_c = encoder(encoder_inputs)
    encoder_states = [state_h, state_c]
    # define training decoder
    decoder_inputs = Input(shape=(None, n_output))
    decoder_lstm = LSTM(n_units, return_sequences=True, return_state=True)
    decoder_outputs, _, _ = decoder_lstm(decoder_inputs, initial_state=encoder_states)
    decoder_dense = Dense(n_output, activation=’softmax’)
    decoder_outputs = decoder_dense(decoder_outputs)
    model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
    # define inference encoder
    encoder_model = Model(encoder_inputs, encoder_states)
    # define inference decoder
    decoder_state_input_h = Input(shape=(n_units,))
    decoder_state_input_c = Input(shape=(n_units,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
    # return all models
    return model, encoder_model, decoder_model

  32. Skye May 21, 2018 at 6:09 pm #

    Hi Jason,

    Thank you very much! I would like to ask a question which seems silly…When training the model, we store the weights in training model. Are the inference encoder and inference decoder empty? At predict stage, the training model is not used, so how are the training model weights used to predict?

    Looking forward to your reply. Thank you!

    • Jason Brownlee May 22, 2018 at 6:24 am #

      No silly questions here!

      The state of the model is reset at the end of each batch.

  33. Skye May 24, 2018 at 10:36 am #

    Get it. Thank you a lot!

  34. Joe May 25, 2018 at 9:28 am #

    Can this model described in this blog post be used when we have variable length input? How.

    • Joe May 25, 2018 at 9:29 am #

      And also variable length output.

    • Jason Brownlee May 25, 2018 at 9:37 am #

      Yes, it processes time steps one at a time, therefore supports variable length by default.

  35. michel June 9, 2018 at 3:32 am #

    Hi Jason how to deal and implement this model if you have time series data with 3D shape (samples, timesteps, features) and can not /want not one hot encode them? Thank you in advance.

  36. mlguy June 17, 2018 at 2:41 pm #

    Hi Jason,
    thank you for sharing your codes. I used this for my own problem and it works, but I still get good prediction results on values that are far away from the ones used in the training. What could my problem be ? I made a regression using mse for loss, do I need a different loss function ?

  37. patrick June 18, 2018 at 8:27 am #

    Hi Jason, great contribution! when I use timeseries data for this model, can I also use not the shfitet case in the target data for y, so:[input, output], output) , so output=input.reversed() and than would this make sense as well? because I want to use sliding windows for input and output; and than shfiting by one for the output being would not make sense in my eyes.

  38. simon June 18, 2018 at 9:56 am #

    lets say I trained with shape (1000,20,1) but I want to predict with (20000,20,1) than this would not work, because the sample size is bigger. how do I have to adjust this ?

    for t in range(2):
    output_tokens, h, c = decoder_model.predict([target_seqs] + states_values)
    states_values = [h,c]
    target_seq = output_tokens

    • Jason Brownlee June 18, 2018 at 3:06 pm #

      Why would it not work?

      • simon June 19, 2018 at 3:32 am #

        it says the following: index 1000 is out of bounds for axis 0 with size 1000,
        so the axis of the second one needs to be in the same length. But I will not train with the same length of data that I am predicting. How can this be solved?

        • Jason Brownlee June 19, 2018 at 6:37 am #

          Perhaps double check your data.

          • simon June 24, 2018 at 6:54 pm #

            still not working Jason, Unfortunately it seems like variable sequence length is not possible ? I trained my model with a part of the whole data and wanted to use in the inference the whole data ?

          • simon June 24, 2018 at 7:20 pm #

            I correct myself: to I have to padd my data with zeros in the beginning of my training data since I want to use the last state of the encoder as initial state ?

  39. Sarah June 20, 2018 at 1:19 am #

    I have 2 questions:
    1- This example is using teacher forcing, right? I’m trying to re-implement it without teacher forcing but it fails. When I define the train model as (line 49 of your example): model = Model(encoder_inputs, decoder_outputs), it gives me this error: RuntimeError: Graph disconnected: cannot obtain value for tensor Tensor(“decoder_inputs:0”, shape=(?, n_units, n_output_features), dtype=float32) at layer “decoder_inputs”. The following previous layers were accessed without issue: [‘encoder_inputs’, ‘encoder_LSTM’]

    2- Can you please explain a bit more about how you defined the decoder model? I don’t get what is happening in the part between line 52 to 60 of the code (copied below)? Why do you need to re-define the decoder_outputs? How the model is defined?

    # define inference decoder
    decoder_state_input_h = Input(shape=(n_units,))
    decoder_state_input_c = Input(shape=(n_units,))
    decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
    decoder_outputs, state_h, state_c = decoder_lstm(decoder_inputs, initial_state=decoder_states_inputs)
    decoder_states = [state_h, state_c]
    decoder_outputs = decoder_dense(decoder_outputs)
    decoder_model = Model([decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states)
    # return all models

    • Jason Brownlee June 20, 2018 at 6:30 am #

      Correct. If performance is poor without teacher forcing, then don’t train without it. I’m not sure why you would want to remove it?!?

      The decoder is a new model that uses parts of the encoder. Does that help?

  40. dave June 29, 2018 at 5:38 am #

    HI Jason, I looked at different pages and found that a seq2seq is not possible with variable length without manipulating the data with zeros etc. So in your example, is it right that you can only use in inference data with shape (100000,timesteps,features), so not having variable length? If not how you can change that ? thanks a lot for responding!

    • Jason Brownlee June 29, 2018 at 6:16 am #

      Generally, variable length inputs can be truncated or padded and a masking layer can be used to ignore the padding.

      Alternately, the above model can be used time-step wise, allowing truely multivariate inputs.

      • dave June 29, 2018 at 4:11 pm #

        nice explanation, a last question: my timesteps woud be the same in all cases,just the sample size woud be different. The keras padding sequences shows only padding for timesteps or am I wrong ? in that case I would need to pad my samples

        • Jason Brownlee June 30, 2018 at 6:03 am #

          Good question. Only the time steps and feature dimensions are fixed, the number of samples can vary.

  41. DrCam June 29, 2018 at 8:09 pm #


    These are really awesome little tutorials and you do a great job explaining them.

    Just as an FYI, to help out other readers, I had to use tensorflow v1.7 to get this to run with Keras 2.2.0. With older versions I got a couple of errors while initializing the models and with the newer v1.8 I got some session errors with TF.

  42. Yasir Hussain July 16, 2018 at 5:53 pm #

    Hello Jason, Your work has always been awesome. I have learned a lot form your tutorials.
    I am trying a simple encoder-decoder model following your example but facing some problem.

    my issue is vocabulary size where it is around 5600, which makes my one hot encoder really big and puts my pc to freeze. can you give some simple example in which I don’t need to one hot encode my data? I know about sparse_categorical_crossentropy but I am facing trouble implementing it. Maybe my approach is wrong. your guidance can help me a lot.

    Thanks once again for such great tutorials…..

    • Jason Brownlee July 17, 2018 at 6:12 am #

      Perhaps try a word embedding instead?

      Or, perhaps try training on an EC2 instance?

  43. George Kibirige July 24, 2018 at 1:32 am #

    Hi Jason
    Why I got this shape ((100000, 1, 6, 51), (100000, 1, 3, 51), (100000, 1, 3, 51)) when I run your code….Or Do I need to reshape or i did not copy correct

    Because it shows you get ((100000,6, 51), (100000,3, 51), (100000,3, 51))

    • Jason Brownlee July 24, 2018 at 6:21 am #

      Is your version of Keras up to date?

      Did you copy all of the code exactly?

      • George July 24, 2018 at 4:49 pm #

        Hi Jason, the version is different i put this
        X1 = np.squeeze(array(X1), axis=1)
        X2 = np.squeeze(array(X2), axis=1)
        y = np.squeeze(array(y), axis=1)

        Its Ok now, What are you doing is called many to one encoding? and in decoding case is it called one to many or?

        another question I want to remove those one hot encode, I want to encode the exactly number and predict the exactly number

        Also to remove this option src_encoded = to_categorical([source], num_classes=cardinality)
        but I got a lot of error

  44. broley July 27, 2018 at 12:27 am #

    Hi Jason,

    good explanations appreciate your sharings! I wanted to ask if in the line for the inference prediction
    for t in range(n_steps):

    you also could go with not the n_steps but also with one more ? Or why are you looping with n_steps? you also would get a result with for t in range(1), right= I hope you understood what I try to find out.

    • Jason Brownlee July 27, 2018 at 5:55 am #

      Sorry, i don’t follow. Perhaps you can provide more details?

  45. broley July 27, 2018 at 6:10 am #

    so in your code here below, you are predicting in a for loop with n_steps, what needs n_steps to be ? Can you also go with 1? or do you need that n_steps from the data shape (batch,n_steps,features) as timesteps? Because when you infdec.predict isnt the model is taking the whole target_seq for prediction at once why you need n_steps?

    for t in range(n_steps):
    # predict next char
    yhat, h, c = infdec.predict([target_seq] + state)
    # store prediction
    # update state
    state = [h, c]
    # update target sequence
    target_seq = yhat
    return array(output)

    • Jason Brownlee July 27, 2018 at 11:02 am #

      Sure, you can set it to 1.

      • broley July 28, 2018 at 7:16 pm #

        and what is the benefit of using 1 or a higher iteration number ?

  46. guofeng July 30, 2018 at 8:41 pm #

    In the “def define_models()”, there are three models: model, encoder_model, decoder_model,

    what is the function of encoder_model and decoder_model? Can I use only the “model” for

    prediction? Looking forward to your reply

  47. Rahul Sattar July 31, 2018 at 6:28 pm #

    How do we use Normalized Discounted Cumulative Rate (NDCR) for evaluating the model?
    Is it necessary to use NDCR or we can live with accuracy as a performance metric?

  48. Fatemeh August 8, 2018 at 7:53 am #

    Hi, thank you for your great description. why you didn’t use “rmsprop” optimizer?

  49. Fatemeh August 10, 2018 at 1:02 am #

    I added the embediing matrix to your code and received an error related to matrix dimentions same as what Luke mentioned in the previous comments. do you have any post for adding embedding matrix to encoder-decoder?

  50. star August 15, 2018 at 1:57 am #

    Hi Jason,
    I have a low-level question. why did you train the model in only “one” epoch?wasn’t better to choose the higher number?

    • Jason Brownlee August 15, 2018 at 6:08 am #

      Because we have a very large number of random samples.

      Perhaps re-read the tutorial?

  51. mohammad H August 15, 2018 at 11:37 pm #

    Thank you, Jason. How can we save the results of our model to use them later on?

Leave a Reply