Encoder-Decoder Models for Text Summarization in Keras

Text summarization is a problem in natural language processing of creating a short, accurate, and fluent summary of a source document.

The Encoder-Decoder recurrent neural network architecture developed for machine translation has proven effective when applied to the problem of text summarization.

It can be difficult to apply this architecture in the Keras deep learning library, given some of the flexibility sacrificed to make the library clean, simple, and easy to use.

In this tutorial, you will discover how to implement the Encoder-Decoder architecture for text summarization in Keras.

After completing this tutorial, you will know:

  • How text summarization can be addressed using the Encoder-Decoder recurrent neural network architecture.
  • How different encoders and decoders can be implemented for the problem.
  • Three models that you can use to implemented the architecture for text summarization in Keras.

Let’s get started.

Encoder-Decoder Models for Text Summarization in Keras

Encoder-Decoder Models for Text Summarization in Keras
Photo by Diogo Freire, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Encoder-Decoder Architecture
  2. Text Summarization Encoders
  3. Text Summarization Decoders
  4. Reading Source Text
  5. Implementation Models

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Encoder-Decoder Architecture

The Encoder-Decoder architecture is a way of organizing recurrent neural networks for sequence prediction problems that have a variable number of inputs, outputs, or both inputs and outputs.

The architecture involves two components: an encoder and a decoder.

  • Encoder: The encoder reads the entire input sequence and encodes it into an internal representation, often a fixed-length vector called the context vector.
  • Decoder: The decoder reads the encoded input sequence from the encoder and generates the output sequence.

For more about the Encoder-Decoder architecture, see the post:

Both the encoder and the decoder submodels are trained jointly, meaning at the same time.

This is quite a feat as traditionally, challenging natural language problems required the development of separate models that were later strung into a pipeline, allowing error to accumulate during the sequence generation process.

The entire encoded input is used as context for generating each step in the output. Although this works, the fixed-length encoding of the input limits the length of output sequences that can be generated.

An extension of the Encoder-Decoder architecture is to provide a more expressive form of the encoded input sequence and allow the decoder to learn where to pay attention to the encoded input when generating each step of the output sequence.

This extension of the architecture is called attention.

For more about Attention in the Encoder-Decoder architecture, see the post:

The Encoder-Decoder architecture with attention is popular for a suite of natural language processing problems that generate variable length output sequences, such as text summarization.

The application of architecture to text summarization is as follows:

  • Encoder: The encoder is responsible for reading the source document and encoding it to an internal representation.
  • Decoder: The decoder is a language model responsible for generating each word in the output summary using the encoded representation of the source document.

Text Summarization Encoders

The encoder is where the complexity of the model resides as it is responsible for capturing the meaning of the source document.

Different types of encoders can be used, although more commonly bidirectional recurrent neural networks, such as LSTMs, are used. In cases where recurrent neural networks are used in the encoder, a word embedding is used to provide a distributed representation of words.

Alexander Rush, et al. uses a simple bag-of-words encoder that discards word order and convolutional encoders that explicitly try to capture n-grams.

Our most basic model simply uses the bag-of-words of the input sentence embedded down to size H, while ignoring properties of the original order or relationships between neighboring words. […] To address some of the modelling issues with bag-of-words we also consider using a deep convolutional encoder for the input sentence.

A Neural Attention Model for Abstractive Sentence Summarization, 2015.

Konstantin Lopyrev uses a deep stack of 4 LSTM recurrent neural networks as the encoder.

The encoder is fed as input the text of a news article one word of a time. Each word is first passed through an embedding layer that transforms the word into a distributed representation. That distributed representation is then combined using a multi-layer neural network

Generating News Headlines with Recurrent Neural Networks, 2015.

Abigail See, et al. use a single-layer bidirectional LSTM as the encoder.

The tokens of the article w(i) are fed one-by-one into the encoder (a single-layer bidirectional LSTM), producing a sequence of encoder hidden states h(i).

Get To The Point: Summarization with Pointer-Generator Networks, 2017.

Ramesh Nallapati, et al. use bidirectional GRU recurrent neural networks in their encoders and incorporate additional information about each word in the input sequence.

The encoder consists of a bidirectional GRU-RNN…

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond, 2016.

Text Summarization Decoders

The decoder must generate each word in the output sequence given two sources of information:

  1. Context Vector: The encoded representation of the source document provided by the encoder.
  2. Generated Sequence: The word or sequence of words already generated as a summary.

The context vector may be a fixed-length encoding as in the simple Encoder-Decoder architecture, or may be a more expressive form filtered via an attention mechanism.

The generated sequence is provided with little preparation, such as distributed representation of each generated word via a word embedding.

On each step t, the decoder (a single-layer unidirectional LSTM) receives the word embedding of the previous word (while training, this is the previous word of the reference summary; at test time it is the previous word emitted by the decoder)

Get To The Point: Summarization with Pointer-Generator Networks, 2017.

Alexander Rush, et al. show this cleanly in a diagram where x is the source document, enc is the encoder providing internal representation of the source document, and yc is the sequence of previously generated words.

Example of inputs to the decoder for text summarization

Example of inputs to the decoder for text summarization.
Taken from “A Neural Attention Model for Abstractive Sentence Summarization”, 2015.

Generating words one at a time requires that the model be run until some maximum number of summary words are generated or a special end-of-sequence token is reached.

The process must be started by providing the model with a special start-of-sequence token in order to generate the first word.

The decoder takes as input the hidden layers generated after feeding in the last word of the input text. First, an end-of-sequence symbol is fed in as input, again using an embedding layer to transform the symbol into a distributed representation. […]. After generating each word that same word is fed in as input when generating the next word.

Generating News Headlines with Recurrent Neural Networks, 2015.

Ramesh Nallapati, et al. generate the output sequence using a GRU recurrent neural network.

… the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of the encoder

Reading Source Text

There is flexibility in the application of this architecture depending on the specific text summarization problem being addressed.

Most studies focus on one or just a few source sentences in the encoder, but this does not have to be the case.

For example, the encoder could be configured to read and encode the source document in different sized chunks:

  • Sentence.
  • Paragraph.
  • Page.
  • Document.

Equally, the decoder can be configured to summarize each chunk or aggregate the encoded chunks and output a broader summary.

Some work has been done along this path, where Alexander Rush, et al. use a hierarchical encoder model with attention at both the word and the sentence level.

This model aims to capture this notion of two levels of importance using two bi-directional RNNs on the source side, one at the word level and the other at the sentence level. The attention mechanism operates at both levels simultaneously

A Neural Attention Model for Abstractive Sentence Summarization, 2015.

Implementation Models

In this section, we will look at how to implement the Encoder-Decoder architecture for text summarization in the Keras deep learning library.

General Model

A simple realization of the model involves an Encoder with an Embedding input followed by an LSTM hidden layer that produces a fixed-length representation of the source document.

The Decoder reads the representation and an Embedding of the last generated word and uses these inputs to generate each word in the output summary.

General Text Summarization Model in Keras

General Text Summarization Model in Keras

There is a problem.

Keras does not allow recursive loops where the output of the model is fed as input to the model automatically.

This means the model as described above cannot be directly implemented in Keras (but perhaps could in a more flexible platform like TensorFlow).

Instead, we will look at three variations of the model that we can implement in Keras.

Alternate 1: One-Shot Model

The first alternative model is to generate the entire output sequence in a one-shot manner.

That is, the decoder uses the context vector alone to generate the output sequence.

Alternate 1 - One-Shot Text Summarization Model

Alternate 1 – One-Shot Text Summarization Model

Here is some sample code for this approach in Keras using the functional API.

This model puts a heavy burden on the decoder.

It is likely that the decoder will not have sufficient context for generating a coherent output sequence as it must choose the words and their order.

Alternate 2: Recursive Model A

A second alternative model is to develop a model that generates a single word forecast and call it recursively.

That is, the decoder uses the context vector and the distributed representation of all words generated so far as input in order to generate the next word.

A language model can be used to interpret the sequence of words generated so far to provide a second context vector to combine with the representation of the source document in order to generate the next word in the sequence.

The summary is built up by recursively calling the model with the previously generated word appended (or, more specifically, the expected previous word during training).

The context vectors could be concentrated or added together to provide a broader context for the decoder to interpret and output the next word.

Alternate 2 - Recursive Text Summarization Model A

Alternate 2 – Recursive Text Summarization Model A

Here is some sample code for this approach in Keras using the functional API.

This is better as the decoder is given an opportunity to use the previously generated words and the source document as a context for generating the next word.

It does put a burden on the merge operation and decoder to interpret where it is up to in generating the output sequence.

Alternate 3: Recursive Model B

In this third alternative, the Encoder generates a context vector representation of the source document.

This document is fed to the decoder at each step of the generated output sequence. This allows the decoder to build up the same internal state as was used to generate the words in the output sequence so that it is primed to generate the next word in the sequence.

This process is then repeated by calling the model again and again for each word in the output sequence until a maximum length or end-of-sequence token is generated.

Alternate 3 - Recursive Text Summarization Model B

Alternate 3 – Recursive Text Summarization Model B

Here is some sample code for this approach in Keras using the functional API.

Do you have any other alternate implementation ideas?
Let me know in the comments below.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Papers

Related

Summary

In this tutorial, you discovered how to implement the Encoder-Decoder architecture for text summarization in the Keras deep learning library.

Specifically, you learned:

  • How text summarization can be addressed using the Encoder-Decoder recurrent neural network architecture.
  • How different encoders and decoders can be implemented for the problem.
  • Three models that you can use to implement the architecture for text summarization in Keras.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


56 Responses to Encoder-Decoder Models for Text Summarization in Keras

  1. Oswaldo Ludwig December 9, 2017 at 3:53 pm #

    Regarding the Recursive Model A, here is a similar approach proposed 5 months ago (with shared embedding): https://github.com/oswaldoludwig/Seq2seq-Chatbot-for-Keras

    The advantage of this model can be seen in Section 3.1 of this paper: https://www.researchgate.net/publication/321347271_End-to-end_Adversarial_Learning_for_Generative_Conversational_Agents

    • Jason Brownlee December 10, 2017 at 5:17 am #

      Thanks for the links.

      • Oswaldo Ludwig December 10, 2017 at 5:37 am #

        You’re welcome!

  2. Oswaldo Ludwig December 9, 2017 at 4:31 pm #

    For the Recursive Model A you can kindly cite the Zenodo document: https://zenodo.org/record/825303#.Wit0jc_TXqA

    or the ArXiv paper: https://arxiv.org/abs/1711.10122

    Thanks in advance,

    Oswaldo Ludwig

  3. viky December 10, 2017 at 10:22 pm #

    Sir, could you explain it with an example.??

  4. RJPG December 19, 2017 at 6:11 am #

    It would be nice if you provide some example of using autoencoders in simple classification problems. Using encoders/decoders pretrain (with inputs = outputs “unsupervised pretrain”) to have a high abstraction level of information in the middle then split in half this network and use the encoder to feed a dense NN with softmax (for ex) and execute supervised “post train”. Do you think it is possible with keras ?

    • Jason Brownlee December 19, 2017 at 3:55 pm #

      Thanks for the suggestion.

      Generally, deep MLPs outperform autoencoders for classification tasks.

  5. RJPG December 19, 2017 at 6:15 am #

    something like this in keras would be super : https://www.mathworks.com/help/nnet/examples/training-a-deep-neural-network-for-digit-classification.html

  6. Anita December 19, 2017 at 7:58 pm #

    Please provide the code links for Encoder-Decoder Models for Text Summarization in Keras

  7. Andy K January 5, 2018 at 12:23 pm #

    How well does this work for the following cases?

    1) Messy data. e.g. Say I want a summary of a chat conversation I missed.
    2) Long content. Can it summarize a book?

    Is this actually used in industry or just academic?

    • Jason Brownlee January 6, 2018 at 5:51 am #

      Great question, but too hard to answer. I’d recommend either diving into some papers to see examples or run some experiments on your data.

  8. Daniel January 19, 2018 at 2:31 am #

    Hi Jason,

    thank you for the article.

    I tried combining the first approach with the dataset from your article about preparation of news articles for text summarization (https://machinelearningmastery.com/prepare-news-articles-text-summarization/).

    Unfortunately, I cannot get the Encoder-Decoder architecture to work, maybe you can provide some help.

    After the code from the preparation article I added the following code:


    X, y = [' '.join(t['story']) for t in stories], [' '.join(t['highlights']) for t in stories]

    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences

    MAX_SEQUENCE_LENGTH = 1000

    tokenizer = Tokenizer()
    total = X + y
    tokenizer.fit_on_texts(total)
    sequences_X = tokenizer.texts_to_sequences(X)
    sequences_y = tokenizer.texts_to_sequences(y)

    word_index = tokenizer.word_index

    data = pad_sequences(sequences_X, maxlen=MAX_SEQUENCE_LENGTH)
    labels = pad_sequences(sequences_y, maxlen=100) # test with maxlen=100

    # train/test split

    TEST_SIZE = 5
    X_train, y_train, X_test, y_test = data[:-TEST_SIZE], labels[:-TEST_SIZE], data[-TEST_SIZE:], labels[-TEST_SIZE:]

    # create model

    # encoder input model
    inputs = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

    encoder1 = Embedding(len(word_index) + 1, 128, input_length=MAX_SEQUENCE_LENGTH)(inputs)
    encoder2 = LSTM(128)(encoder1)
    encoder3 = RepeatVector(2)(encoder2)

    # decoder output model
    decoder1 = LSTM(128, return_sequences=True)(encoder3)
    outputs = TimeDistributed(Dense(len(word_index) + 1, activation='softmax'))(decoder1)

    model = Model(inputs=inputs, outputs=outputs)
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    batch_size = 32
    epochs = 4

    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, verbose=1)

    The error I get is: Error when checking target: expected time_distributed_2 to have 3 dimensions, but got array with shape (3000, 100)

    Thank you in advance.

    • Jason Brownlee January 19, 2018 at 6:35 am #

      Sorry, I cannot debug your code for you.

    • coco March 12, 2018 at 1:00 am #

      Hi Daniel,

      I have the same problem as you. Were you able to solve it ?

      Thanks.

  9. Akash February 1, 2018 at 4:34 pm #

    Hello Jason,
    May I please know that shall we use conv2dlstm for language modelling

    • Jason Brownlee February 2, 2018 at 8:06 am #

      I would recommend using an LSTM for a language model.

  10. Anirban Konar February 14, 2018 at 4:25 pm #

    Have a basic query, what we get as final output of Decoder is a Vector rt ? How do we convert this to English words please. Do we find cosine similarity with words in vocabulary for ex? …thanks

    • Jason Brownlee February 15, 2018 at 8:38 am #

      You can output a one hot encoded vector and map this to an integer, then a word in your vocabulary.

      • Anirban February 19, 2018 at 6:09 pm #

        Thanks for your help, all the three models now work for me, thanks also to your article on How to generate a Neural Language Model. I am getting some words in output, but this is far from summary.
        However, I am having one more basic query. In your second and third model, there are two inputs : the article, and the summary. Now, while training the model, this is fine. But when predicting the output or generating the summary, the summary will not be known. So how do we input the second input, should it be zeros in place of word indexes ? Please suggest.

        • Jason Brownlee February 21, 2018 at 6:24 am #

          Yes, start with zeros, and build up the summary word by word in order to output the following word.

        • VJ March 19, 2018 at 10:46 pm #

          @Anirban, would you mind sharing the working example codes for 3 models which worked for you, in the article above

  11. Bhavya Popli March 4, 2018 at 9:38 pm #

    Hi Jason, can you please provide a sample running code for one of the architectures on a small text dataset. I shall be grateful to you for the same.

  12. vikas dixit March 10, 2018 at 4:03 pm #

    Sir, i have summary of shape (3000,100,12000) i.e 3000-> examples, 100-> maximum length of summary and 12000-> vocab size. Now i need to convert this shape into categorical values to fins loss using keras to_categorical. But i am getting memory error. I have 8 gb RAM. Please provide appropriate solution.

    Is it necessary to convert summaries into categorical or can’t we use embedding on summaries too.If we can then what should be loss because for categorical cross entropy loss we need to convert our summaries into one hot encodings.

    Many people are facing this problem. plz suggest solution.

    • Jason Brownlee March 11, 2018 at 6:21 am #

      When using embedding layers as input, you must provide sequences of integers, where each int is mapped to a word in the vocab.

      I have many examples on the blog.

      If you are running out of memory, try using progressive loading. I have an example of this for photo captioning.

      • vikas dixit March 14, 2018 at 4:24 pm #

        yeah sir, but the main problem lies while converting target summaries into categorical data as num_classes in my case is 12000.

        AND i have one more doubt.

        while training model, i an facing a unknown issue where my training and validation loss is decreasing continously but accuracy has become constant after some time.

        Plz answer to second question. I m really stuck in this problem. can’t understand logic behind this issue.

        • Jason Brownlee March 15, 2018 at 6:26 am #

          Yes, the output would be a one hot encoding. An output vocab of 12K is very small.

          Accuracy is a bad metric for text data. Ignore it.

          • vikas dixit March 15, 2018 at 1:04 pm #

            that means for text data, i should focus on decreasing loss only because in my case, training and validation loss both are decreasing continuously but training and validation accuracy are stuck at a point nearly 0.50.

          • Jason Brownlee March 15, 2018 at 2:51 pm #

            Yes, focus on loss. Do not measure accuracy on text problems.

  13. vikas dixit March 26, 2018 at 10:00 pm #

    hi sir, just wanted to know what is the logic behind abstractive text summarization how does our model is capable to summarize test data as generally in case of classification we have a curve which generalizes our data but in case of text , generalization is not possible since every test data is a new and different from training data.

    • Jason Brownlee March 27, 2018 at 6:37 am #

      Great question. The model learns the concepts in the text and how to describe those concepts concisely.

      • vikas dixit March 27, 2018 at 12:28 pm #

        So, in that case there is very less probability that our summaries would match gold- summaries for test data. Then why do we use Bleu or Rouge matrixes for evaluation of our model.

  14. Nigi April 3, 2018 at 6:49 am #

    For all the approaches I become following error:
    Error when checking target: expected time_distributed_38 to have 3 dimensions, but got array with shape (222, 811)

    How can I fix this problem?

    • Jason Brownlee April 3, 2018 at 12:13 pm #

      Either change your data to meet the expectations of the model or change the model to meet the expectations of the data.

      • Nigi April 3, 2018 at 5:07 pm #

        Hallo Jason.
        Thank you for your answer. I am beginner and try to learn how to summarize texts.
        I wrote a simple example as you see below. but it has problems with shapes.
        Could you please tell me how and where in this example, I have to change the data or the model.
        Thanks

        # Modules
        import os

        import numpy as np
        from keras import *
        from keras.layers import *
        from keras.models import Sequential
        from keras.preprocessing.text import Tokenizer, one_hot
        from keras.preprocessing.sequence import pad_sequences

        src_txt = [“I read a nice book yesterday”, “I have been living in canada for a long time”, “I study computer science this year in frence”]
        sum_txt = [“I read a book”, “I have been in canada”, “I study computer science”]

        vocab_size = len(set((” “.join(src_txt)).split() + (” “.join(sum_txt)).split()))
        print(“Vocab_size: ” + str(vocab_size))
        src_txt_length = max([len(item) for item in src_txt])
        print(“src_txt_length: ” + str(src_txt_length))
        sum_txt_length = max([len(item) for item in sum_txt])
        print(“sum_txt_length: ” + str(sum_txt_length))

        # integer encode the documents
        encoded_articles = [one_hot(d, vocab_size) for d in src_txt]
        encoded_summaries = [one_hot(d, vocab_size) for d in sum_txt]

        # pad documents to a max length of 4 words
        padded_articles = pad_sequences(encoded_articles, maxlen=10, padding=’post’)
        padded_summaries = pad_sequences(encoded_summaries, maxlen=5, padding=’post’)

        print(“padded_articles: {}”.format(padded_articles.shape))
        print(“padded_summaries: {}”.format(padded_summaries.shape))

        # encoder input model
        inputs = Input(shape=(src_txt_length,))
        encoder1 = Embedding(vocab_size, 128)(inputs)
        encoder2 = LSTM(128)(encoder1)
        encoder3 = RepeatVector(sum_txt_length)(encoder2)

        # decoder output model
        decoder1 = LSTM(128, return_sequences=True)(encoder3)
        outputs = TimeDistributed(Dense(vocab_size, activation=’softmax’))(decoder1)

        # tie it together
        model = Model(inputs=inputs, outputs=outputs)
        model.compile(loss=’categorical_crossentropy’, optimizer=’adam’)

        # Fitting the model
        model.fit(padded_articles, padded_summaries, epochs=10, batch_size=32)
        #model.fit(padded_articles, padded_summaries)

        —————————————————————————
        ValueError Traceback (most recent call last)
        in ()
        1 # Fitting the model
        —-> 2 model.fit(padded_articles, padded_summaries, epochs=10, batch_size=32)
        3 #model.fit(padded_articles, padded_summaries)

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
        1579 class_weight=class_weight,
        1580 check_batch_axis=False,
        -> 1581 batch_size=batch_size)
        1582 # Prepare validation data.
        1583 do_validation = False

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_batch_axis, batch_size)
        1412 self._feed_input_shapes,
        1413 check_batch_axis=False,
        -> 1414 exception_prefix=’input’)
        1415 y = _standardize_input_data(y, self._feed_output_names,
        1416 output_shapes,

        C:\Users\diyakopalizi\AppData\Local\Continuum\Anaconda3\lib\site-packages\keras\engine\training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
        151 ‘ to have shape ‘ + str(shapes[i]) +
        152 ‘ but got array with shape ‘ +
        –> 153 str(array.shape))
        154 return arrays
        155

        ValueError: Error when checking input: expected input_23 to have shape (None, 44) but got array with shape (3, 10)

        • Jason Brownlee April 4, 2018 at 6:08 am #

          Sorry, I don’t have the capacity to debug code for you. Perhaps post to stackoverflow?

  15. Nigi April 4, 2018 at 7:43 am #

    Ok, thank you.
    I have learnt a lot here.

  16. anand pratap singh April 6, 2018 at 9:48 pm #

    Hi jason!

    i want to implement text summarizer using hierarchical LSTM using only tensorflow. I am a beginner and i have got the dataset https://github.com/SignalMedia/Signal-1M-Tools/blob/master/README.md but i am not able to use this dataset as it is too large to handle with my laptop can you tell me how to preprocess this data so that i can tokenize it and use pre trained glove model as embeddings..

  17. Andrea April 13, 2018 at 8:21 pm #

    Hi Jason, thank you for your post, it has been very helpful to approach the problem.

    What I don’t understand is the network’s training process. I mean, we don’t have the summary of our document (in the test set), so I suppose that, at the beginning, we start with a zero sequence as summary input.

    Then, during the training phase, my guess is that we need to cycle all the words in the summary and set those words as output. So, in terms of number of iteration, we need to perform a “train” operation for each word in the summary, for each entry in our dataset and repeat the process for some epochs.

    In some way the process is not really clear in my mind. Could you point me to some resources to understand the training process?

    Thank you

  18. Jing Wen May 8, 2018 at 1:30 pm #

    Hi, Jason
    What is the purpose of merge layer in alternate 2? Is that used enable the Dense layer to have knowledge on both input and output ?

    • Jason Brownlee May 8, 2018 at 2:57 pm #

      To merge the representation of the document and the text generated so far into a single representation for the basis for predicting the next word.

  19. Emna May 17, 2018 at 7:28 pm #

    Hi, Thank you for sharing this with us. Can you explain for me why we are using the vocab_size variable ? and how this involves in the AutoEncoder architecture ? If we have a text as example, the vocab_size will represent the number of unique tokens in the text ? Thank you in advance.

    • Jason Brownlee May 18, 2018 at 6:22 am #

      The vocab size is the number of words we wish to model.

      There is no autoencoder in this example.

Leave a Reply