Encoder-Decoder Models for Text Summarization in Keras

Text summarization is a problem in natural language processing of creating a short, accurate, and fluent summary of a source document.

The Encoder-Decoder recurrent neural network architecture developed for machine translation has proven effective when applied to the problem of text summarization.

It can be difficult to apply this architecture in the Keras deep learning library, given some of the flexibility sacrificed to make the library clean, simple, and easy to use.

In this tutorial, you will discover how to implement the Encoder-Decoder architecture for text summarization in Keras.

After completing this tutorial, you will know:

  • How text summarization can be addressed using the Encoder-Decoder recurrent neural network architecture.
  • How different encoders and decoders can be implemented for the problem.
  • Three models that you can use to implemented the architecture for text summarization in Keras.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Encoder-Decoder Models for Text Summarization in Keras

Encoder-Decoder Models for Text Summarization in Keras
Photo by Diogo Freire, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Encoder-Decoder Architecture
  2. Text Summarization Encoders
  3. Text Summarization Decoders
  4. Reading Source Text
  5. Implementation Models

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Encoder-Decoder Architecture

The Encoder-Decoder architecture is a way of organizing recurrent neural networks for sequence prediction problems that have a variable number of inputs, outputs, or both inputs and outputs.

The architecture involves two components: an encoder and a decoder.

  • Encoder: The encoder reads the entire input sequence and encodes it into an internal representation, often a fixed-length vector called the context vector.
  • Decoder: The decoder reads the encoded input sequence from the encoder and generates the output sequence.

For more about the Encoder-Decoder architecture, see the post:

Both the encoder and the decoder submodels are trained jointly, meaning at the same time.

This is quite a feat as traditionally, challenging natural language problems required the development of separate models that were later strung into a pipeline, allowing error to accumulate during the sequence generation process.

The entire encoded input is used as context for generating each step in the output. Although this works, the fixed-length encoding of the input limits the length of output sequences that can be generated.

An extension of the Encoder-Decoder architecture is to provide a more expressive form of the encoded input sequence and allow the decoder to learn where to pay attention to the encoded input when generating each step of the output sequence.

This extension of the architecture is called attention.

For more about Attention in the Encoder-Decoder architecture, see the post:

The Encoder-Decoder architecture with attention is popular for a suite of natural language processing problems that generate variable length output sequences, such as text summarization.

The application of architecture to text summarization is as follows:

  • Encoder: The encoder is responsible for reading the source document and encoding it to an internal representation.
  • Decoder: The decoder is a language model responsible for generating each word in the output summary using the encoded representation of the source document.

Text Summarization Encoders

The encoder is where the complexity of the model resides as it is responsible for capturing the meaning of the source document.

Different types of encoders can be used, although more commonly bidirectional recurrent neural networks, such as LSTMs, are used. In cases where recurrent neural networks are used in the encoder, a word embedding is used to provide a distributed representation of words.

Alexander Rush, et al. uses a simple bag-of-words encoder that discards word order and convolutional encoders that explicitly try to capture n-grams.

Our most basic model simply uses the bag-of-words of the input sentence embedded down to size H, while ignoring properties of the original order or relationships between neighboring words. […] To address some of the modelling issues with bag-of-words we also consider using a deep convolutional encoder for the input sentence.

A Neural Attention Model for Abstractive Sentence Summarization, 2015.

Konstantin Lopyrev uses a deep stack of 4 LSTM recurrent neural networks as the encoder.

The encoder is fed as input the text of a news article one word of a time. Each word is first passed through an embedding layer that transforms the word into a distributed representation. That distributed representation is then combined using a multi-layer neural network

Generating News Headlines with Recurrent Neural Networks, 2015.

Abigail See, et al. use a single-layer bidirectional LSTM as the encoder.

The tokens of the article w(i) are fed one-by-one into the encoder (a single-layer bidirectional LSTM), producing a sequence of encoder hidden states h(i).

Get To The Point: Summarization with Pointer-Generator Networks, 2017.

Ramesh Nallapati, et al. use bidirectional GRU recurrent neural networks in their encoders and incorporate additional information about each word in the input sequence.

The encoder consists of a bidirectional GRU-RNN…

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond, 2016.

Text Summarization Decoders

The decoder must generate each word in the output sequence given two sources of information:

  1. Context Vector: The encoded representation of the source document provided by the encoder.
  2. Generated Sequence: The word or sequence of words already generated as a summary.

The context vector may be a fixed-length encoding as in the simple Encoder-Decoder architecture, or may be a more expressive form filtered via an attention mechanism.

The generated sequence is provided with little preparation, such as distributed representation of each generated word via a word embedding.

On each step t, the decoder (a single-layer unidirectional LSTM) receives the word embedding of the previous word (while training, this is the previous word of the reference summary; at test time it is the previous word emitted by the decoder)

Get To The Point: Summarization with Pointer-Generator Networks, 2017.

Alexander Rush, et al. show this cleanly in a diagram where x is the source document, enc is the encoder providing internal representation of the source document, and yc is the sequence of previously generated words.

Example of inputs to the decoder for text summarization

Example of inputs to the decoder for text summarization.
Taken from “A Neural Attention Model for Abstractive Sentence Summarization”, 2015.

Generating words one at a time requires that the model be run until some maximum number of summary words are generated or a special end-of-sequence token is reached.

The process must be started by providing the model with a special start-of-sequence token in order to generate the first word.

The decoder takes as input the hidden layers generated after feeding in the last word of the input text. First, an end-of-sequence symbol is fed in as input, again using an embedding layer to transform the symbol into a distributed representation. […]. After generating each word that same word is fed in as input when generating the next word.

Generating News Headlines with Recurrent Neural Networks, 2015.

Ramesh Nallapati, et al. generate the output sequence using a GRU recurrent neural network.

… the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of the encoder

Reading Source Text

There is flexibility in the application of this architecture depending on the specific text summarization problem being addressed.

Most studies focus on one or just a few source sentences in the encoder, but this does not have to be the case.

For example, the encoder could be configured to read and encode the source document in different sized chunks:

  • Sentence.
  • Paragraph.
  • Page.
  • Document.

Equally, the decoder can be configured to summarize each chunk or aggregate the encoded chunks and output a broader summary.

Some work has been done along this path, where Alexander Rush, et al. use a hierarchical encoder model with attention at both the word and the sentence level.

This model aims to capture this notion of two levels of importance using two bi-directional RNNs on the source side, one at the word level and the other at the sentence level. The attention mechanism operates at both levels simultaneously

A Neural Attention Model for Abstractive Sentence Summarization, 2015.

Implementation Models

In this section, we will look at how to implement the Encoder-Decoder architecture for text summarization in the Keras deep learning library.

General Model

A simple realization of the model involves an Encoder with an Embedding input followed by an LSTM hidden layer that produces a fixed-length representation of the source document.

The Decoder reads the representation and an Embedding of the last generated word and uses these inputs to generate each word in the output summary.

General Text Summarization Model in Keras

General Text Summarization Model in Keras

There is a problem.

Keras does not allow recursive loops where the output of the model is fed as input to the model automatically.

This means the model as described above cannot be directly implemented in Keras (but perhaps could in a more flexible platform like TensorFlow).

Instead, we will look at three variations of the model that we can implement in Keras.

Alternate 1: One-Shot Model

The first alternative model is to generate the entire output sequence in a one-shot manner.

That is, the decoder uses the context vector alone to generate the output sequence.

Alternate 1 - One-Shot Text Summarization Model

Alternate 1 – One-Shot Text Summarization Model

Here is some sample code for this approach in Keras using the functional API.

This model puts a heavy burden on the decoder.

It is likely that the decoder will not have sufficient context for generating a coherent output sequence as it must choose the words and their order.

Alternate 2: Recursive Model A

A second alternative model is to develop a model that generates a single word forecast and call it recursively.

That is, the decoder uses the context vector and the distributed representation of all words generated so far as input in order to generate the next word.

A language model can be used to interpret the sequence of words generated so far to provide a second context vector to combine with the representation of the source document in order to generate the next word in the sequence.

The summary is built up by recursively calling the model with the previously generated word appended (or, more specifically, the expected previous word during training).

The context vectors could be concentrated or added together to provide a broader context for the decoder to interpret and output the next word.

Alternate 2 - Recursive Text Summarization Model A

Alternate 2 – Recursive Text Summarization Model A

Here is some sample code for this approach in Keras using the functional API.

This is better as the decoder is given an opportunity to use the previously generated words and the source document as a context for generating the next word.

It does put a burden on the merge operation and decoder to interpret where it is up to in generating the output sequence.

Alternate 3: Recursive Model B

In this third alternative, the Encoder generates a context vector representation of the source document.

This document is fed to the decoder at each step of the generated output sequence. This allows the decoder to build up the same internal state as was used to generate the words in the output sequence so that it is primed to generate the next word in the sequence.

This process is then repeated by calling the model again and again for each word in the output sequence until a maximum length or end-of-sequence token is generated.

Alternate 3 - Recursive Text Summarization Model B

Alternate 3 – Recursive Text Summarization Model B

Here is some sample code for this approach in Keras using the functional API.

Do you have any other alternate implementation ideas?
Let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking go deeper.




In this tutorial, you discovered how to implement the Encoder-Decoder architecture for text summarization in the Keras deep learning library.

Specifically, you learned:

  • How text summarization can be addressed using the Encoder-Decoder recurrent neural network architecture.
  • How different encoders and decoders can be implemented for the problem.
  • Three models that you can use to implement the architecture for text summarization in Keras.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

138 Responses to Encoder-Decoder Models for Text Summarization in Keras

  1. Avatar
    Oswaldo Ludwig December 9, 2017 at 3:53 pm #

    Regarding the Recursive Model A, here is a similar approach proposed 5 months ago (with shared embedding): https://github.com/oswaldoludwig/Seq2seq-Chatbot-for-Keras

    The advantage of this model can be seen in Section 3.1 of this paper: https://www.researchgate.net/publication/321347271_End-to-end_Adversarial_Learning_for_Generative_Conversational_Agents

    • Avatar
      Jason Brownlee December 10, 2017 at 5:17 am #

      Thanks for the links.

      • Avatar
        Oswaldo Ludwig December 10, 2017 at 5:37 am #

        You’re welcome!

  2. Avatar
    Oswaldo Ludwig December 9, 2017 at 4:31 pm #

    For the Recursive Model A you can kindly cite the Zenodo document: https://zenodo.org/record/825303#.Wit0jc_TXqA

    or the ArXiv paper: https://arxiv.org/abs/1711.10122

    Thanks in advance,

    Oswaldo Ludwig

  3. Avatar
    viky December 10, 2017 at 10:22 pm #

    Sir, could you explain it with an example.??

  4. Avatar
    RJPG December 19, 2017 at 6:11 am #

    It would be nice if you provide some example of using autoencoders in simple classification problems. Using encoders/decoders pretrain (with inputs = outputs “unsupervised pretrain”) to have a high abstraction level of information in the middle then split in half this network and use the encoder to feed a dense NN with softmax (for ex) and execute supervised “post train”. Do you think it is possible with keras ?

    • Avatar
      Jason Brownlee December 19, 2017 at 3:55 pm #

      Thanks for the suggestion.

      Generally, deep MLPs outperform autoencoders for classification tasks.

  5. Avatar
    RJPG December 19, 2017 at 6:15 am #

    something like this in keras would be super : https://www.mathworks.com/help/nnet/examples/training-a-deep-neural-network-for-digit-classification.html

  6. Avatar
    Anita December 19, 2017 at 7:58 pm #

    Please provide the code links for Encoder-Decoder Models for Text Summarization in Keras

  7. Avatar
    Andy K January 5, 2018 at 12:23 pm #

    How well does this work for the following cases?

    1) Messy data. e.g. Say I want a summary of a chat conversation I missed.
    2) Long content. Can it summarize a book?

    Is this actually used in industry or just academic?

    • Avatar
      Jason Brownlee January 6, 2018 at 5:51 am #

      Great question, but too hard to answer. I’d recommend either diving into some papers to see examples or run some experiments on your data.

  8. Avatar
    Daniel January 19, 2018 at 2:31 am #

    Hi Jason,

    thank you for the article.

    I tried combining the first approach with the dataset from your article about preparation of news articles for text summarization (https://machinelearningmastery.com/prepare-news-articles-text-summarization/).

    Unfortunately, I cannot get the Encoder-Decoder architecture to work, maybe you can provide some help.

    After the code from the preparation article I added the following code:

    X, y = [' '.join(t['story']) for t in stories], [' '.join(t['highlights']) for t in stories]

    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences


    tokenizer = Tokenizer()
    total = X + y
    sequences_X = tokenizer.texts_to_sequences(X)
    sequences_y = tokenizer.texts_to_sequences(y)

    word_index = tokenizer.word_index

    data = pad_sequences(sequences_X, maxlen=MAX_SEQUENCE_LENGTH)
    labels = pad_sequences(sequences_y, maxlen=100) # test with maxlen=100

    # train/test split

    TEST_SIZE = 5
    X_train, y_train, X_test, y_test = data[:-TEST_SIZE], labels[:-TEST_SIZE], data[-TEST_SIZE:], labels[-TEST_SIZE:]

    # create model

    # encoder input model
    inputs = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

    encoder1 = Embedding(len(word_index) + 1, 128, input_length=MAX_SEQUENCE_LENGTH)(inputs)
    encoder2 = LSTM(128)(encoder1)
    encoder3 = RepeatVector(2)(encoder2)

    # decoder output model
    decoder1 = LSTM(128, return_sequences=True)(encoder3)
    outputs = TimeDistributed(Dense(len(word_index) + 1, activation='softmax'))(decoder1)

    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    batch_size = 32
    epochs = 4

    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, verbose=1)

    The error I get is: Error when checking target: expected time_distributed_2 to have 3 dimensions, but got array with shape (3000, 100)

    Thank you in advance.

    • Avatar
      Jason Brownlee January 19, 2018 at 6:35 am #

      Sorry, I cannot debug your code for you.

    • Avatar
      coco March 12, 2018 at 1:00 am #

      Hi Daniel,

      I have the same problem as you. Were you able to solve it ?


      • Avatar
        Utkarsh June 2, 2018 at 5:49 am #

        hey did your code work I am having a problem in fitting it

        • Avatar
          Amine Biad July 19, 2018 at 5:08 pm #

          I’m having the same problem here. Did someone manage to solve this ?
          Thank you!

    • Avatar
      Koushik J February 27, 2019 at 3:06 pm #

      Hey Daniel,did u debug it?

  9. Avatar
    Akash February 1, 2018 at 4:34 pm #

    Hello Jason,
    May I please know that shall we use conv2dlstm for language modelling

    • Avatar
      Jason Brownlee February 2, 2018 at 8:06 am #

      I would recommend using an LSTM for a language model.

  10. Avatar
    Anirban Konar February 14, 2018 at 4:25 pm #

    Have a basic query, what we get as final output of Decoder is a Vector rt ? How do we convert this to English words please. Do we find cosine similarity with words in vocabulary for ex? …thanks

    • Avatar
      Jason Brownlee February 15, 2018 at 8:38 am #

      You can output a one hot encoded vector and map this to an integer, then a word in your vocabulary.

      • Avatar
        Anirban February 19, 2018 at 6:09 pm #

        Thanks for your help, all the three models now work for me, thanks also to your article on How to generate a Neural Language Model. I am getting some words in output, but this is far from summary.
        However, I am having one more basic query. In your second and third model, there are two inputs : the article, and the summary. Now, while training the model, this is fine. But when predicting the output or generating the summary, the summary will not be known. So how do we input the second input, should it be zeros in place of word indexes ? Please suggest.

        • Avatar
          Jason Brownlee February 21, 2018 at 6:24 am #

          Yes, start with zeros, and build up the summary word by word in order to output the following word.

          • Avatar
            Geo_Mehe June 27, 2019 at 4:04 am #

            Hi Jason,
            Thanks for all the knowledge shared here and the books.

            I have 2 questions: one is on prediction and when to save the model. Second is on the padded values.

            Rec. Model A works during training (I see the words being added). However, prediction is a bit challenging. When predicting the first word I am not getting anything. I am loading a current summary of numpy.zeros and the source document and expect to predict the first word to update the current summary. This in my mind is the very first iteration for prediction. I hope I am not wrong thus far.

            At first I thought that maybe there is a problem with the model weights so I saved the model during training for each word. And i am loading the model saved during the last step of training. model.save(“model.h5”), model_load… so I am not loosing the weights. Model summary prints the architecture correctly, But the result is no word being predicted. Am I correct assuming that the model trained is not correctly saved?

            Second question: The sequences are padded with 0 values ‘post’. During training, the model gets these 0 values last and predicts 0 values. What happens to the model and its weights during the last part of the training when it gets these zeros? Does it loses the weights and training?

            Thanks again and keep up the good work,

          • Avatar
            Jason Brownlee June 27, 2019 at 8:03 am #

            The model will predict integers that must be mapped to words.

            If the model is predicting all zeros, then perhaps the model requires further tuning?

            You can use a Masking layer to skip/ignore the padded values.

        • Avatar
          VJ March 19, 2018 at 10:46 pm #

          @Anirban, would you mind sharing the working example codes for 3 models which worked for you, in the article above

  11. Avatar
    Bhavya Popli March 4, 2018 at 9:38 pm #

    Hi Jason, can you please provide a sample running code for one of the architectures on a small text dataset. I shall be grateful to you for the same.

  12. Avatar
    vikas dixit March 10, 2018 at 4:03 pm #

    Sir, i have summary of shape (3000,100,12000) i.e 3000-> examples, 100-> maximum length of summary and 12000-> vocab size. Now i need to convert this shape into categorical values to fins loss using keras to_categorical. But i am getting memory error. I have 8 gb RAM. Please provide appropriate solution.

    Is it necessary to convert summaries into categorical or can’t we use embedding on summaries too.If we can then what should be loss because for categorical cross entropy loss we need to convert our summaries into one hot encodings.

    Many people are facing this problem. plz suggest solution.

    • Avatar
      Jason Brownlee March 11, 2018 at 6:21 am #

      When using embedding layers as input, you must provide sequences of integers, where each int is mapped to a word in the vocab.

      I have many examples on the blog.

      If you are running out of memory, try using progressive loading. I have an example of this for photo captioning.

      • Avatar
        vikas dixit March 14, 2018 at 4:24 pm #

        yeah sir, but the main problem lies while converting target summaries into categorical data as num_classes in my case is 12000.

        AND i have one more doubt.

        while training model, i an facing a unknown issue where my training and validation loss is decreasing continously but accuracy has become constant after some time.

        Plz answer to second question. I m really stuck in this problem. can’t understand logic behind this issue.

        • Avatar
          Jason Brownlee March 15, 2018 at 6:26 am #

          Yes, the output would be a one hot encoding. An output vocab of 12K is very small.

          Accuracy is a bad metric for text data. Ignore it.

          • Avatar
            vikas dixit March 15, 2018 at 1:04 pm #

            that means for text data, i should focus on decreasing loss only because in my case, training and validation loss both are decreasing continuously but training and validation accuracy are stuck at a point nearly 0.50.

          • Avatar
            Jason Brownlee March 15, 2018 at 2:51 pm #

            Yes, focus on loss. Do not measure accuracy on text problems.

  13. Avatar
    vikas dixit March 26, 2018 at 10:00 pm #

    hi sir, just wanted to know what is the logic behind abstractive text summarization how does our model is capable to summarize test data as generally in case of classification we have a curve which generalizes our data but in case of text , generalization is not possible since every test data is a new and different from training data.

    • Avatar
      Jason Brownlee March 27, 2018 at 6:37 am #

      Great question. The model learns the concepts in the text and how to describe those concepts concisely.

      • Avatar
        vikas dixit March 27, 2018 at 12:28 pm #

        So, in that case there is very less probability that our summaries would match gold- summaries for test data. Then why do we use Bleu or Rouge matrixes for evaluation of our model.

  14. Avatar
    Nigi April 3, 2018 at 6:49 am #

    For all the approaches I become following error:
    Error when checking target: expected time_distributed_38 to have 3 dimensions, but got array with shape (222, 811)

    How can I fix this problem?

    • Avatar
      Jason Brownlee April 3, 2018 at 12:13 pm #

      Either change your data to meet the expectations of the model or change the model to meet the expectations of the data.

      • Avatar
        Nigi April 3, 2018 at 5:07 pm #

        Hallo Jason.
        Thank you for your answer. I am beginner and try to learn how to summarize texts.
        I wrote a simple example as you see below. but it has problems with shapes.
        Could you please tell me how and where in this example, I have to change the data or the model.

        # Modules
        import os

        import numpy as np
        from keras import *
        from keras.layers import *
        from keras.models import Sequential
        from keras.preprocessing.text import Tokenizer, one_hot
        from keras.preprocessing.sequence import pad_sequences

        src_txt = [“I read a nice book yesterday”, “I have been living in canada for a long time”, “I study computer science this year in frence”]
        sum_txt = [“I read a book”, “I have been in canada”, “I study computer science”]

        vocab_size = len(set((” “.join(src_txt)).split() + (” “.join(sum_txt)).split()))
        print(“Vocab_size: ” + str(vocab_size))
        src_txt_length = max([len(item) for item in src_txt])
        print(“src_txt_length: ” + str(src_txt_length))
        sum_txt_length = max([len(item) for item in sum_txt])
        print(“sum_txt_length: ” + str(sum_txt_length))

        # integer encode the documents
        encoded_articles = [one_hot(d, vocab_size) for d in src_txt]
        encoded_summaries = [one_hot(d, vocab_size) for d in sum_txt]

        # pad documents to a max length of 4 words
        padded_articles = pad_sequences(encoded_articles, maxlen=10, padding=’post’)
        padded_summaries = pad_sequences(encoded_summaries, maxlen=5, padding=’post’)

        print(“padded_articles: {}”.format(padded_articles.shape))
        print(“padded_summaries: {}”.format(padded_summaries.shape))

        # encoder input model
        inputs = Input(shape=(src_txt_length,))
        encoder1 = Embedding(vocab_size, 128)(inputs)
        encoder2 = LSTM(128)(encoder1)
        encoder3 = RepeatVector(sum_txt_length)(encoder2)

        # decoder output model
        decoder1 = LSTM(128, return_sequences=True)(encoder3)
        outputs = TimeDistributed(Dense(vocab_size, activation=’softmax’))(decoder1)

        # tie it together
        model = Model(inputs=inputs, outputs=outputs)
        model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

        # Fitting the model
        model.fit(padded_articles, padded_summaries, epochs=10, batch_size=32)
        #model.fit(padded_articles, padded_summaries)

        ValueError Traceback (most recent call last)
        in ()
        1 # Fitting the model
        —-> 2 model.fit(padded_articles, padded_summaries, epochs=10, batch_size=32)
        3 #model.fit(padded_articles, padded_summaries)

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
        1579 class_weight=class_weight,
        1580 check_batch_axis=False,
        -> 1581 batch_size=batch_size)
        1582 # Prepare validation data.
        1583 do_validation = False

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
        1412 self._feed_input_shapes,
        1413 check_batch_axis=False,
        -> 1414 exception_prefix=’input’)
        1415 y = _standardize_input_data(y, self._feed_output_names,
        1416 output_shapes,

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
        151 ‘ to have shape ‘ + str(shapes[i]) +
        152 ‘ but got array with shape ‘ +
        –> 153 str(array.shape))
        154 return arrays

        ValueError: Error when checking input: expected input_23 to have shape (None, 44) but got array with shape (3, 10)

        • Avatar
          Jason Brownlee April 4, 2018 at 6:08 am #

          Sorry, I don’t have the capacity to debug code for you. Perhaps post to stackoverflow?

        • Avatar
          Madhan October 23, 2018 at 3:39 am #

          did you solve it?

  15. Avatar
    Nigi April 4, 2018 at 7:43 am #

    Ok, thank you.
    I have learnt a lot here.

  16. Avatar
    anand pratap singh April 6, 2018 at 9:48 pm #

    Hi jason!

    i want to implement text summarizer using hierarchical LSTM using only tensorflow. I am a beginner and i have got the dataset https://github.com/SignalMedia/Signal-1M-Tools/blob/master/README.md but i am not able to use this dataset as it is too large to handle with my laptop can you tell me how to preprocess this data so that i can tokenize it and use pre trained glove model as embeddings..

  17. Avatar
    Andrea April 13, 2018 at 8:21 pm #

    Hi Jason, thank you for your post, it has been very helpful to approach the problem.

    What I don’t understand is the network’s training process. I mean, we don’t have the summary of our document (in the test set), so I suppose that, at the beginning, we start with a zero sequence as summary input.

    Then, during the training phase, my guess is that we need to cycle all the words in the summary and set those words as output. So, in terms of number of iteration, we need to perform a “train” operation for each word in the summary, for each entry in our dataset and repeat the process for some epochs.

    In some way the process is not really clear in my mind. Could you point me to some resources to understand the training process?

    Thank you

  18. Avatar
    Jing Wen May 8, 2018 at 1:30 pm #

    Hi, Jason
    What is the purpose of merge layer in alternate 2? Is that used enable the Dense layer to have knowledge on both input and output ?

    • Avatar
      Jason Brownlee May 8, 2018 at 2:57 pm #

      To merge the representation of the document and the text generated so far into a single representation for the basis for predicting the next word.

  19. Avatar
    Emna May 17, 2018 at 7:28 pm #

    Hi, Thank you for sharing this with us. Can you explain for me why we are using the vocab_size variable ? and how this involves in the AutoEncoder architecture ? If we have a text as example, the vocab_size will represent the number of unique tokens in the text ? Thank you in advance.

    • Avatar
      Jason Brownlee May 18, 2018 at 6:22 am #

      The vocab size is the number of words we wish to model.

      There is no autoencoder in this example.

  20. Avatar
    Nick June 7, 2018 at 12:10 pm #

    Hi Jason, I don’t understand what’s the loss function that’s being used by the decoder. I know that in language translation, we have labelled training data (x, y) of (source language sentence, target language translation). However, in text summarization, do we actually have labelled training data of (source sentence, target summary)? I know that in your code you use the categorical_crossentropy loss function, but what label is the loss computed against?

    • Avatar
      Jason Brownlee June 8, 2018 at 6:04 am #

      Yes, you need source and target text to train the model.

  21. Avatar
    Sameer July 16, 2018 at 6:37 pm #

    When I implement ( Recursive Model B ) I phase issue with the summary input layer.

    How to make keras to push output word into the summary layer input.

    Printed Error: when checking model input: Expected to see 2 array(s), but instead got the following list of 1 arrays

    My current .fit implementation:
    model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

    Model implementation:

    # summary
    inputs_summary = Input(shape=(count_output,))
    layer_summary = Embedding(vocab_size, embed_size)(inputs_summary)
    # encoder
    input_comm = Input(shape=(input_size,))
    layer_comm = Embedding(vocab_size, embed_size)(input_comm)
    # decoder
    decoder1 = concatenate([layer_comm, layer_titles])
    model = Model(inputs=[input_comm, inputs_summary], outputs=[layer_out])

  22. Avatar
    Ricardo Martinez July 23, 2018 at 8:41 pm #

    Greetings Jason,

    I was hoping you could figure out the following:

    In the Recursive Models A & B, I notice that you use two Embedding layers, one for each input:


    I know that an embedding layer is composed of the information (shape & weights) of a trained word2vec or glove model—I have trained a word2vec model on the source—-what I don’t know is if in your code the Embedding1 and Embedding2 layers are the same, or if I have to train a word2vec model for the summaries as well to create the Embedding2 layer
    the same, or if I have to create a word2vec model of the

    and I know that these are composed of the shape of the weights of the word2vec model. What I don’t know is the following:

    • Avatar
      Jason Brownlee July 24, 2018 at 6:16 am #

      They could be the same, if that makes sense for the specific problem.

  23. Avatar
    Chiclayo July 25, 2018 at 12:38 am #

    In the third model , “article3” layer outputs the hidden states at each time step that will be concatenated with the output of the embedding layer ? what is th etensor shape of “article3” ?

    Thank you

    • Avatar
      Jason Brownlee July 25, 2018 at 6:20 am #

      You can print the shapes of the layers as follows:

  24. Avatar
    Chibuzor Agubulom August 5, 2018 at 2:56 pm #

    Hi Jason,

    Is there a way to train a model to summarize text without having a target summary during training?

  25. Avatar
    Ray August 13, 2018 at 9:09 pm #

    For Alternate 3 model, what will inputs1 and inputs2 be? The code doesn’t do what you describe in the figure.

  26. Avatar
    Swetha November 27, 2018 at 4:28 pm #

    Hi Jason,
    Does the decoder output a meaningful summary of sentences or a bag of words?

  27. Avatar
    Chanuwas February 1, 2019 at 8:24 pm #

    Can’t you use the similar encoder-decoder architecture to the one in another article you wrote before? But maybe you can add an Embedding layer into it.

  28. Avatar
    tanisha February 3, 2019 at 7:44 pm #

    I do not understand the use to RepeatVector please explain

    • Avatar
      Jason Brownlee February 4, 2019 at 5:45 am #

      It repeats the output of a layer, e.g. a vector, multiple times as input the subsequent layer.

      Does that help?

  29. Avatar
    Nick March 20, 2019 at 5:12 am #

    Hi Jason, can you please help? My understanding is that you developed the three models (Alternate 1, 2, and 3) due to the problem you mentioned: “Keras does not allow recursive loops where the output of the model is fed as input to the model automatically.”

    However, I don’t see why you wrote that statement. Isn’t that looping exactly what you implemented for the language model text generation in your article “How to Develop a Word-Level Neural Language Model and Use it to Generate Text” at:


    In that article as part of text generation, you created: (1) a loop over N words to generate; (b) called model.predict() to generate the next word; (c) added the generated word to a window of generated words; and (d) used the generated words as input to the next call of model.predict().

    Can you please explain the difference?

    • Avatar
      Jason Brownlee March 20, 2019 at 8:37 am #

      Yes, Keras does not do it, we have to write a loop to do it.

  30. Avatar
    Abisk Raioj March 29, 2019 at 1:26 am #

    Why is the model usually fit on “un-embedded” outputs ?
    I seen a lot of codes(almost all) where the input is passed through a “Embedding” layer ( in the functional API style of code ) and and the model is fit on “outputs” as the real output sequences.
    That would mean, we are mapping( training the parameters of the network in the real sense ) the embedded inputs against the real outputs..which seems a bit a strange.

    Does that mean the Dense layer takes care of un-embedded part ?
    That would check up with the fact that the “units” parameter is set to the no. of embedding features !

    Also, how is the “units” parameter of an LSTM selected ?
    Is it set to the length of the padded sequences ( input and outputs sequences alike )

  31. Avatar
    Sinan Ozdemir April 27, 2019 at 12:48 am #

    Hi Jason,

    I have a security dataset and I would like to use either ANN or LSTMN to predict if a website is malicious.

    IP Cont OS Domain Attack Sig Threat US Windows somedomain.net Comp Server 899238erdjshgh90ds Yes

    What is the best way of encoding these columns? Do you have a tutorial link that you could suggest? And Do you think ANN or LSTMN would better to predict/classify the outcome, such as Threat yes or no?

    Thank you so much.

    • Avatar
      Jason Brownlee April 27, 2019 at 6:34 am #

      Sounds like a great project.

      I recommend testing a suite of representations in order to discover what works best.

      Perhaps a word embedding would be a useful approach?

  32. Avatar
    Shashi June 28, 2019 at 3:57 pm #

    Sir, since word embeddings are already fixed lengthed vectors can I directly use them with decoders? However, Encoder converts them to fixed length vectors.

    • Avatar
      Jason Brownlee June 29, 2019 at 6:35 am #

      You will end up with a sequence of embeddings, this sequence must also be fixed length.

  33. Avatar
    Sara Salem July 18, 2019 at 8:42 am #

    Hi Jason,

    Now, i am working at text to image project and i want to train my captions(texts) and images of dataset to get pickle file. i am still searching about this problem but i found nothing untill now. so i want to ask you to help me to get Text2Image Encoder-Decoder ? do you have any idea about that ?

    Thanks in advance.

    • Avatar
      Jason Brownlee July 18, 2019 at 2:12 pm #

      I have seen some interesting papers on GANs for this task of text to image.

      I recommend checking for some of the latest work on the topic via google scholar.

      • Avatar
        Sara Salem July 20, 2019 at 4:27 am #

        yes i got it and i worked at stack-GAN algorithm but there are already a text and image encoder file ( char-CNN-RNN text embeddings.pickle ) and i want to train it from scratch on my own data set.Could you tell me how to preprocess this file?

        I’m looking forward to your reply!

  34. Avatar
    Hariharan August 28, 2019 at 9:39 pm #

    Do you have any working code for the above mentioned 3 models…If it is there means it will be helpful.

  35. Avatar
    Swarna Gupta September 26, 2019 at 4:39 am #

    Hi Jason,

    Thanks for this wonderful article.I have one question regarding the model 2 – “Alternate 2: Recursive Model A” .Does it follow the “teacher forcing strategy” since you are using the already generated summary information also along with the generated representation by the encoder?

  36. Avatar
    Marshal September 30, 2019 at 8:49 am #

    Hi Jason,

    Thanks for the post! It is very insightful. I like how you described the pros and cons of each structure as well.

    For the Recursive Model B, could you help me understand how should I go about preparing the data?

    Given the structure, I am a little confused about how the input should look like. Below are my thoughts:

    src_txt_length = 8
    sum_txt_length = 4

    Then the traditional way of text preparation would end up with something like (I gave the first value for each row as an example):

    padded sequence for text:
    [w1, w2, w3, w4, w5, 0, 0, 0]

    padded sequence for summary:
    [s1, s2, 0, 0, 0]

    Apparently the above cannot be used directly to Recursive B’s structure (Maybe I’m wrong here).

    I thought we will need to take a step further and have something like this:

    Input 1:
    [[w1, w2, w3, w4,w5, 0, 0, 0], [w1, w2, w3, w4, w5, 0, 0, 0], [w1, w2, w3, w4, w5, 0, 0, 0]]

    input 2:
    [[0,0,0,0,0], [s1,0,0,0,0], [s1,s2,0,0,0]]

    [[one-hot encoded vector for s1], [one-hot encoded vector for s2],[one-hot encoded vector for 0]]

    In this way, a batch_size = 1 training, for example, will be using:

    [w1, w2, w3, w4,w5, 0, 0, 0] and [0,0,0,0,0] to inference a vector which will be optimized against the one-hot encoded vector for s1

    Am I on the right track here? I would appreciate it if you could point me in the right direction!

    Thanks in advance!

    • Avatar
      Jason Brownlee September 30, 2019 at 2:25 pm #

      Perhaps mock up some test examples and try feeding them into the model?

      • Avatar
        Marshal October 1, 2019 at 8:13 am #

        Hi Jason,

        Yes, I did build up test examples and the model fits without error.

        However, the transformation process I mentioned above is quite tedious. So I wonder if you have other ideas in mind about how to prepare data for such structure.

        • Avatar
          Jason Brownlee October 1, 2019 at 2:16 pm #

          Sorry, I cannot prepare customized examples – I just don’t have the capacity.

  37. Avatar
    Eddie October 17, 2019 at 1:01 pm #

    Thank you for this guide.
    What would be the training and target data for fitting the model?
    I built a model with the following structure

    Model: “model_1”
    Layer (type) Output Shape Param # Connected to
    input_3 (InputLayer) [(None, 5000)] 0
    embedding_2 (Embedding) (None, 5000, 128) 796928 input_3[0][0]
    lstm_2 (LSTM) (None, 64) 49408 embedding_2[0][0]
    input_4 (InputLayer) [(None, 30)] 0
    repeat_vector_1 (RepeatVector) (None, 30, 64) 0 lstm_2[0][0]
    embedding_3 (Embedding) (None, 30, 128) 796928 input_4[0][0]
    concatenate_1 (Concatenate) (None, 30, 192) 0 repeat_vector_1[0][0]
    lstm_3 (LSTM) (None, 128) 164352 concatenate_1[0][0]
    dense_1 (Dense) (None, 6226) 803154 lstm_3[0][0]
    Total params: 2,610,770
    Trainable params: 2,610,770
    Non-trainable params: 0

    When I try
    model.fit([texts, summaries], validation_split = 0.1, epochs=epochs, batch_size=batch_size, verbose=1, callbacks=[LossHistory()])

    I get a IndexError: list index out of range when processing the first batch, probably for the lack of target data

  38. Avatar
    Vaibhav Kotwal October 30, 2019 at 8:34 am #

    Can we not use teacher forcing method in the general model that explained at the first? This will circumvent the recursive looping blockade..

  39. Avatar
    black7375 January 9, 2020 at 10:16 pm #

    Recursive Model B can be implemented very simply.

    • Avatar
      Jason Brownlee January 10, 2020 at 7:27 am #


    • Avatar
      MaryJ September 28, 2021 at 6:16 am #

      Have you used this model on any dataset?

  40. Avatar
    Arindam Sengupta January 16, 2020 at 9:15 am #

    Hi Jason – thanks for this post, its great! I just had a question about what inputs1, inputs2 and outputs mean in the sample code. My understanding is as follows;

    inputs1: The entire source sequence [Eg: Say a paragraph with 300 words (post tokenization)]
    inputs2: Technically this should be a single word [Predicted word from the previous time-step].
    outputs: This should be the entire summary. But would we need this to be TimeDistributed to obtain the summary word-by-word?

    Is my understanding correct? Or for inputs2, would that be a sequence of *all* the words until the last step and not just a single word? Thanks in advance!

    • Avatar
      Jason Brownlee January 16, 2020 at 1:34 pm #

      Yes, or inputs2 would be the whole sequence generated so far.

  41. Avatar
    pemfir January 20, 2020 at 11:08 am #

    Thank you much for this very useful post.

    I would greatly appreciate if you could also provide some code for (1) fitting (2) prediction for for Alternate 3: Recursive Model A and B,.

    In particular, for fitting the confusing part is how to generate inputs2. It seems the outputs is just a single prediction (i.e., Dense layer shape = (vocabulary size))

    For prediction, how is the model called, specially because inputs2 is not known and the model again predicts 1 output.

    Initially, i thought something like this would work:

    for fitting the model:
    outputs = [‘this’ , ‘is’ , ‘a’ , ‘summary’]
    inputs2_during_fitting = [‘start summarizing’, ‘this’ , ‘is’ , ‘a’]

    but i am confused how this would work, when the model spits out one prediction/word every time it is called.
    So it cannot train on the entire sequence at once.

    for prediction of first word i thought i would pass something like this to the model.
    outputs = [‘this’ , ‘is’ , ‘a’ , ‘summary’]
    inputs2_during_prediction = [‘start summarizing’, ‘unknown’, ‘unknown’, ‘unknown’]

    but how would the model learn that i am asking for the first output word and not the second output word ?

    Thank you !

    • Avatar
      Jason Brownlee January 20, 2020 at 2:07 pm #

      Thanks for the suggestion, I may cover it in the future.

  42. Avatar
    aico June 16, 2020 at 6:03 pm #

    hey Jason, regardin Recursive model B, I don’t unnderstand the workflow very well, in the picture it looks like is a loop, i have implemented just like in the example above, so it does loop or not?

  43. Avatar
    Mohammad javad July 9, 2020 at 2:25 am #

    hi jason thanks for amazing article. i have a question about 3rd architecture. finaly if wa want to use internal representation of a soruce document we should use output of article2 layer?

  44. Avatar
    Red July 13, 2020 at 8:01 am #

    Hi Jason I prepared my data using your another article and how to use those pkl files here

    • Avatar
      Jason Brownlee July 13, 2020 at 1:36 pm #

      I don’t know what your data is or how to load it.

  45. Avatar
    karishma dahal September 15, 2020 at 5:07 pm #

    Hi Jason,

    As per encoder decoder with attention , Decoder processes the input one time step after another. First time step prediction takes last state of encoder as initial state and outputs the decoder output and hidden states and that hidden state serves as initial state for next time step.
    How is this iterative process captured in enc-dec with attention architecture using tensorflow. Do we apply loop for each time-steps of decoder . Also what about embedding vector of decoder. Is it calculated outside the loop and accessed or its calculated inside this loop.

    The reason i am asking is , i didn’t see any loop and feedback in your decoder architecture

  46. Avatar
    Shukur Alam April 26, 2021 at 3:36 am #

    Model: “model_4”
    Layer (type) Output Shape Param #
    input_5 (InputLayer) [(None, 5000)] 0
    embedding_4 (Embedding) (None, 5000, 128) 799488
    lstm_8 (LSTM) (None, 128) 131584
    repeat_vector_4 (RepeatVecto (None, 125, 128) 0
    lstm_9 (LSTM) (None, 125, 128) 131584
    time_distributed_4 (TimeDist (None, 125, 6246) 805734
    Total params: 1,868,390
    Trainable params: 1,868,390
    Non-trainable params: 0

    how fit my data ?

  47. Avatar
    Mary July 21, 2021 at 4:13 am #

    Hi Jason,
    I tried to use the last architecture with cnn news dataset that you introduced, everything is quite like that but it gives me an error when I try to train the model that I can’t find anywhere,
    “ValueError: Layer model_7 expects 2 input(s), but it received 1 input tensors. Inputs received: []”

  48. Avatar
    nabilaberrabia July 26, 2021 at 6:34 am #

    can u share the code with us plz

  49. Avatar
    Mary August 13, 2021 at 1:53 am #

    Hi Jason,
    I think if we assume that source and summary have different vocabulary size, the Dense layer should be like this : outputs = Dense(summary_vocab_size, activation=’softmax’)(decoder2)

    • Adrian Tam
      Adrian Tam August 13, 2021 at 2:12 am #

      Yes, you’re right.

      • Avatar
        Mary September 28, 2021 at 5:13 am #

        then please edit that in the code, so people don’t get confused.

  50. Avatar
    Francesco August 29, 2021 at 9:52 pm #

    Hello Jason, I have a question about this tutorial. You set an embedding layer for the encoder (hence, the original texts) and an other one for the decoder (the summaries). I deduce that these two embedding models are independent. It means that for the same word, the two models will generate two different embeddings. Why are we OK with this?

    • Adrian Tam
      Adrian Tam September 1, 2021 at 7:25 am #

      Simply speaking the training on the data will make this work. The two embedding may produce different things, and that is not a problem because the text and summary may have different set of vocabularies. What is important is the LSTM layers that receives the embedding output. After training, it will know how to use them.

      • Avatar
        Francesco September 1, 2021 at 5:56 pm #

        thank you, Adrian!

  51. Avatar
    Mary September 28, 2021 at 5:10 am #

    Hi Jason,

    I used an encoder-decoder model for generating summarization news, but the predicted sequence is like this:

    actual: [[‘startseq as of thursday facebook allows users to edit comments rather than retype them each comment will show its editing history in a dropdown menu to give users context editing will be rolled out to users gradually over the next few days endseq’]]
    predicted: [‘startseq the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the’]

    As you can see it generated a lot of “the”, have you ever encountered such a problem?
    do you have any suggestion?

Leave a Reply