How to Develop a Word-Level Neural Language Model and Use it to Generate Text

Last Updated on October 8, 2020

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence.

Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.

In this tutorial, you will discover how to develop a statistical language model using deep learning in Python.

After completing this tutorial, you will know:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Apr/2018: Fixed type in model description
  • Update May/2020: Fixed a typo in the expectation of the model.
How to Develop a Word-Level Neural Language Model and Use it to Generate Text

How to Develop a Word-Level Neural Language Model and Use it to Generate Text
Photo by Carlo Raso, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. The Republic by Plato
  2. Data Preparation
  3. Train Language Model
  4. Use Language Model

The Republic by Plato

The Republic is the classical Greek philosopher Plato’s most famous work.

It is structured as a dialog (e.g. conversation) on the topic of order and justice within a city state

The entire text is available for free in the public domain. It is available on the Project Gutenberg website in a number of formats.

You can download the ASCII text version of the entire book (or books) here:

Download the book text and place it in your current working directly with the filename ‘republic.txt

Open the file in a text editor and delete the front and back matter. This includes details about the book at the beginning, a long analysis, and license information at the end.

The text should begin with:


I went down yesterday to the Piraeus with Glaucon the son of Ariston,

And end with

And it shall be well with us both in this life and in the pilgrimage of a thousand years which we have been describing.

Here is a direct link to the clean version of the data file:

Save the cleaned version as ‘republic_clean.txt’ in your current working directory. The file should be about 15,802 lines of text.

Now we can develop a language model from this text.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Data Preparation

We will start by preparing the data for modeling.

The first step is to look at the data.

Review the Text

Open the text in an editor and just look at the text data.

For example, here is the first piece of dialog:


I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.

I turned round, and asked him where his master was.

There he is, said the youth, coming after you, if you will only wait.

Certainly we will, said Glaucon; and in a few minutes Polemarchus
appeared, and with him Adeimantus, Glaucon’s brother, Niceratus the son
of Nicias, and several others who had been at the procession.

Polemarchus said to me: I perceive, Socrates, that you and your
companion are already on your way to the city.

You are not far wrong, I said.

What do you see that we will need to handle in preparing the data?

Here’s what I see from a quick look:

  • Book/Chapter headings (e.g. “BOOK I.”).
  • British English spelling (e.g. “honoured”)
  • Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
  • Strange names (e.g. “Polemarchus”).
  • Some long monologues that go on for hundreds of lines.
  • Some quoted dialog (e.g. ‘…’)

These observations, and more, suggest at ways that we may wish to prepare the text data.

The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.

Language Model Design

In this tutorial, we will develop a model of the text that we can then use to generate new sequences of text.

The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word.

A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of seed text used to generate new sequences when we use the model.

There is no correct answer. With enough time and resources, we could explore the ability of the model to learn with differently sized input sequences.

Instead, we will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily.

We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.

Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

Now that we have a model design, we can look at transforming the raw text into sequences of 50 input words to 1 output word, ready to fit a model.

Load Text

The first step is to load the text into memory.

We can develop a small function to load the entire text file into memory and return it. The function is called load_doc() and is listed below. Given a filename, it returns a sequence of loaded text.

Using this function, we can load the cleaner version of the document in the file ‘republic_clean.txt‘ as follows:

Running this snippet loads the document and prints the first 200 characters as a sanity check.


I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what

So far, so good. Next, let’s clean the text.

Clean Text

We need to transform the raw text into a sequence of tokens or words that we can use as a source to train the model.

Based on reviewing the raw text (above), below are some specific operations we will perform to clean the text. You may want to explore more cleaning operations yourself as an extension.

  • Replace ‘–‘ with a white space so we can split words better.
  • Split words based on white space.
  • Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
  • Remove all words that are not alphabetic to remove standalone punctuation tokens.
  • Normalize all words to lowercase to reduce the vocabulary size.

Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.

We can implement each of these cleaning operations in this order in a function. Below is the function clean_doc() that takes a loaded document as an argument and returns an array of clean tokens.

We can run this cleaning operation on our loaded document and print out some of the tokens and statistics as a sanity check.

First, we can see a nice list of tokens that look cleaner than the raw text. We could remove the ‘Book I‘ chapter markers and more, but this is a good start.

We also get some statistics about the clean document.

We can see that there are just under 120,000 words in the clean text and a vocabulary of just under 7,500 words. This is smallish and models fit on this data should be manageable on modest hardware.

Next, we can look at shaping the tokens into sequences and saving them to file.

Save Clean Text

We can organize the long list of tokens into sequences of 50 input words and 1 output word.

That is, sequences of 51 words.

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

We will transform the tokens into space-separated strings for later storage in a file.

The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.

Running this piece creates a long list of lines.

Printing statistics on the list, we can see that we will have exactly 118,633 training patterns to fit our model.

Next, we can save the sequences to a new file for later loading.

We can define a new function for saving lines of text to a file. This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

We can call this function and save our training sequences to the file ‘republic_sequences.txt‘.

Take a look at the file with your text editor.

You will see that each line is shifted along one word, with a new word at the end to be predicted; for example, here are the first 3 lines in truncated form:

book i i … catch sight of
i i went … sight of us
i went down … of us from

Complete Example

Tying all of this together, the complete code listing is provided below.

You should now have training data stored in the file ‘republic_sequences.txt‘ in your current working directory.

Next, let’s look at how to fit a language model to this data.

Train Language Model

We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

  • It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
  • It learns the representation at the same time as learning the model.
  • It learns to predict the probability for the next word using the context of the last 100 words.

Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

Let’s start by loading our training data.

Load Sequences

We can load our training data using the load_doc() function we developed in the previous section.

Once loaded, we can split the data into separate training sequences by splitting based on new lines.

The snippet below will load the ‘republic_sequences.txt‘ data file from the current working directory.

Next, we can encode the training data.

Encode Sequences

The word embedding layer expects input sequences to be comprised of integers.

We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping.

To do this encoding, we will use the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

Words are assigned values from 1 to the total number of words (e.g. 7,409). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length.

Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.

Sequence Inputs and Output

Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.

We can do this with array slicing.

After separating, we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

Finally, we need to specify to the Embedding layer how long input sequences are. We know that there are 50 words because we designed the model, but a good generic way to specify that is to use the second dimension (number of columns) of the input data’s shape. That way, if you change the length of sequences when preparing data, you do not need to change this data loading code; it is generic.

Fit Model

We can now define and fit our language model on the training data.

The learned embedding needs to know the size of the vocabulary and the length of input sequences as previously discussed. It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space.

Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or larger values.

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

A summary of the defined network is printed as a sanity check to ensure we have constructed what we intended.

Next, the model is compiled specifying the categorical cross entropy loss needed to fit the model. Technically, the model is learning a multi-class classification and this is the suitable loss function for this type of problem. The efficient Adam implementation to mini-batch gradient descent is used and accuracy is evaluated of the model.

Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128 to speed things up.

Training may take a few hours on modern hardware without GPUs. You can speed it up with a larger batch size and/or fewer training epochs.

During training, you will see a summary of performance, including the loss and accuracy evaluated from the training data at the end of each batch update.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You will get different results, but perhaps an accuracy of just over 50% of predicting the next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a model that memorized the text), but rather a model that captures the essence of the text.

Save Model

At the end of the run, the trained model is saved to file.

Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current working directory.

Later, when we load the model to make predictions, we will also need the mapping of words to integers. This is in the Tokenizer object, and we can save that too using Pickle.

Complete Example

We can put all of this together; the complete example for fitting the language model is listed below.

Use Language Model

Now that we have a trained language model, we can use it.

In this case, we can use it to generate new sequences of text that have the same statistical properties as the source text.

This is not practical, at least not for this example, but it gives a concrete example of what the language model has learned.

We will start by loading the training sequences again.

Load Data

We can use the same code from the previous section to load the training data sequences of text.

Specifically, the load_doc() function.

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text.

The model will require 50 words as input.

Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

Load Model

We can now load the model from file.

Keras provides the load_model() function for loading the model, ready for use.

We can also load the tokenizer from file using the Pickle API.

We are ready to use the loaded model.

Generate Text

The first step in generating text is preparing a seed input.

We will select a random line of text from the input text for this purpose. Once selected, we will print it so that we have some idea of what was used.

Next, we can generate new words, one at a time.

First, the seed text must be encoded to integers using the same tokenizer that we used when training the model.

The model can predict the next word directly by calling model.predict_classes() that will return the index of the word with the highest probability.

We can then look up the index in the Tokenizers mapping to get the associated word.

We can then append this word to the seed text and repeat the process.

Importantly, the input sequence is going to get too long. We can truncate it to the desired length after the input sequence has been encoded to integers. Keras provides the pad_sequences() function that we can use to perform this truncation.

We can wrap all of this into a function called generate_seq() that takes as input the model, the tokenizer, input sequence length, the seed text, and the number of words to generate. It then returns a sequence of words generated by the model.

We are now ready to generate a sequence of new words given some seed text.

Putting this all together, the complete code listing for generating text from the learned-language model is listed below.

Running the example first prints the seed text.

when he said that a man when he grows old may learn many things for he can no more learn much than he can run much youth is the time for any extraordinary toil of course and therefore calculation and geometry and all the other elements of instruction which are a

Then 50 words of generated text are printed.

preparation for dialectic should be presented to the name of idle spendthrifts of whom the other is the manifold and the unjust and is the best and the other which delighted to be the opening of the soul of the soul and the embroiderer will have to be said at

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You can see that the text seems reasonable. In fact, the addition of concatenation would help in interpreting the seed and the generated text. Nevertheless, the generated text gets the right kind of words in the right kind of order.

Try running the example a few times to see other examples of generated text. Let me know in the comments below if you see anything interesting.


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed length (e.g. the longest sentence length).
  • Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words removed.
  • Tune Model. Tune the model, such as the size of the embedding or number of memory cells in the hidden layer, to see if you can develop a better model.
  • Deeper Model. Extend the model to have multiple LSTM hidden layers, perhaps with dropout to see if you can develop a better model.
  • Pre-Trained Word Embedding. Extend the model to use pre-trained word2vec or GloVe vectors to see if it results in a better model.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to develop a word-based language model using a word embedding and a recurrent neural network.

Specifically, you learned:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

302 Responses to How to Develop a Word-Level Neural Language Model and Use it to Generate Text

  1. Avatar
    nike November 10, 2017 at 12:59 pm #

    Thank you for providing this blog, have u use rnn to do recommend ? like use rnn recommend movies ,use the user consume movies sequences

    • Avatar
      Jason Brownlee November 11, 2017 at 9:15 am #

      I have not used RNNs for recommender systems, sorry.

      • Avatar
        Mike April 29, 2020 at 10:49 pm #

        out_word = ”
        for word, index in tokenizer.word_index.items():
        if index == yhat:
        out_word = word

        Why are you doing sequential search on a dictionary?

        • Avatar
          Jason Brownlee April 30, 2020 at 6:44 am #

          It is a reverse lookup, by value not key.

          • Avatar
            Mike April 30, 2020 at 7:14 pm #

            I see. But isn’t there a tokenizer.index_word (:: index -> word) dictionary for this purpose?

          • Avatar
            Jason Brownlee May 1, 2020 at 6:34 am #

            It might be, great tip! Perhaps it wasn’t around back when I wrote this, or I didn’t notice it.

      • Avatar
        Mike May 5, 2020 at 3:08 am #

        Hey Jason, I have a question. I do not understand why the Embedding (input) layer and output layer are +1 larger than the number of words in our vocabulary.

        I am pretty certain that the output layer (and the input layer of one-hot vectors) must be the exact size of our vocabulary so that each output value maps 1-1 with each of our vocabulary word. If we add +1 to this size, where (to which word) does the extra output value map to?

        • Avatar
          Jason Brownlee May 5, 2020 at 6:34 am #

          We add one “word” for “none” at index 0. This is for all words that we don’t know or that we want to map to “don’t know”.

  2. Avatar
    Stuart November 11, 2017 at 6:01 am #

    Outstanding article, thank you! Two questions:

    1. How much benefit is gained by removing punctuation? Wouldn’t a few more punctuation-based tokens be a fairly trivial addition to several thousand word tokens?

    2. Based on your experience, which method would you say is better at generating meaningful text on modest hardware (the cheaper gpu AWS options) in a reasonable amount of time (within several hours): word-level or character-level generation?

    Also it would be great if you could include your hardware setup, python/keras versions, and how long it took to generate your example text output.

  3. Avatar
    Raoul November 12, 2017 at 11:55 pm #

    Great tutorial and thank you for sharing!

    Am I correct in assuming that the model always spits out the same output text given a specific seed text?

    • Avatar
      Jason Brownlee November 13, 2017 at 10:17 am #

      A trained model will, yes. The model will be different each time it is trained though:

      • Avatar
        Gustavo February 15, 2019 at 3:27 am #

        Not always, you can generate new text by fitting a multinomial distribution, where you take the probability of a character occurring and not the maximum probability of a character. This allows more diversity to the generated text, and you can combine with “temperature” parameters to control this diversity.

  4. Avatar
    Sarang November 13, 2017 at 1:15 am #


    Thanks for the excellent blogs, what do you think about the future of deep learning?
    Do you think deep learning is here to stay for another 10 years?

    • Avatar
      Jason Brownlee November 13, 2017 at 10:18 am #

      Yes. The results are too valuable across a wide set of domains.

  5. Avatar
    Roger January 3, 2018 at 8:40 pm #

    Hi Jason, great blog!
    Still, I got a question when running “model.add(LSTM(100, return_sequences=True))”.
    TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.
    Could you please help? Thanks.

    • Avatar
      Roger January 3, 2018 at 9:15 pm #

      by the way I am using os system, py3.5

      • Avatar
        Kirstie January 5, 2018 at 1:03 am #

        Hi Roger. I had the same issue, updating Tensorflow with pip install –upgrade Tensorflow worked for me.

    • Avatar
      Jason Brownlee January 4, 2018 at 8:10 am #

      Sorry I have not seen this error before. Perhaps try posting to stackoverflow?

      It may be the version of your libraries? Ensure everything is up to date.

  6. Avatar
    Roger January 8, 2018 at 6:14 pm #

    Hi Jason,
    Is it possible to use your codes above as a language model instead of predicting next word?
    What I want is to judge “I am eating an apple” is more commonly used than “I an apple am eating”.
    For short sentence, may be I don’t have 50 words as input.
    Also, is it possible for Keras to output the probability with its index like the biggest probability for next word is “republic” and I want to get the index for “republic” which can be matched in tokenizer.word_index.

    • Avatar
      Roger January 8, 2018 at 8:53 pm #

      do you have any suggestions if I want to use 3 previous words as input to predict next word? Thanks.

      • Avatar
        Jason Brownlee January 9, 2018 at 5:28 am #

        You can frame your problem, prepare the data and train the model.

        3 words as input might not be enough.

      • Avatar
        Mike April 30, 2020 at 8:09 pm #

        This is a Markov Assumption. The point of a recurrent NN model is to avoid that. If you ‘re only going to use 3 words to predict the next then use an n-gram or a feedforward model (like Bengio’s). No need for a recurrent model.

    • Avatar
      Jason Brownlee January 9, 2018 at 5:25 am #

      Not sure about your application sorry.

      Keras can predict probabilities across the vocabulary and you can use argmax() to get the index of the word with the largest probability.

  7. Avatar
    Cid February 14, 2018 at 3:47 am #

    Hey Jason,
    Thanks for the post. I noticed in the extensions part you mention Sentence-Wise Modelling.
    I understand the technique of padding (after reading your other blog post). But how does it incorporate a full stop when generating text. Is it a matter of post-processing the text? Could it be possible/more convenient to tokenize a full stop prior to embedding?

    • Avatar
      Jason Brownlee February 14, 2018 at 8:23 am #

      I generally recommend removing punctuation. It balloons the size of the vocab and in turn slows down modeling.

      • Avatar
        Cid February 14, 2018 at 8:00 pm #

        OK thanks for the advise, how could I incorporate a sentence structure into my model then?

        • Avatar
          Jason Brownlee February 15, 2018 at 8:40 am #

          Each sentence could be one “sample” or sequence of words as input.

  8. Avatar
    Maria February 16, 2018 at 10:32 pm #

    Hi Jason, I tried to use your model and train it with a corpus I had, everything seemed to work fine, but at the and I have this error:
    34 sequences = array(sequences)
    35 #print(sequences)
    —> 36 X, y = sequences[:,:-1], sequences[:,-1]
    37 # print(sequences[:,:-1])
    38 # X, y = sequences[:-1], sequences[-1]

    IndexError: too many indices for array

    regarding the sliding of the sequences. Do you know how to fix it?
    Thanks so much!

    • Avatar
      Jason Brownlee February 17, 2018 at 8:45 am #

      Perhaps double check your loaded data has the shape that you expect?

      • Avatar
        Ray March 1, 2018 at 7:31 am #

        Hi Jason and Maria.
        I am having the exact same problem too.
        I was hoping Jason might have better suggestion
        here is the error:

        IndexError Traceback (most recent call last)
        in ()
        1 # separate into input and output
        2 sequences = array(sequences)
        —-> 3 X, y = sequences[:,:-1], sequences[:,-1]
        4 y = to_categorical(y, num_classes=vocab_size)
        5 seq_length = X.shape[1]

        IndexError: too many indices for array

        • Avatar
          Jason Brownlee March 1, 2018 at 3:05 pm #

          Are you able to confirm that your Python 3 environment is up to date, including Keras and tensorflow?

          For example, here are the current versions I’m running with:

          • Avatar
            Doron Ben Elazar March 20, 2018 at 8:14 am #

            It means that your input is not even and np.array doesn’t parse it properly (the author created paragraphs of 50 tokens each), a possible fix would be:

            original_sequences = tokenizer.texts_to_sequences(text_chunk)

            vocab_size = len(tokenizer.word_index) + 1

            aligned_sequneces = []
            for sequence in original_sequences:
            aligned_sequence = np.zeros(max_len, dtype=np.int64)
            aligned_sequence[:len(sequence)] = np.array(sequence, dtype=np.int64)

            sequences = np.array(aligned_sequneces)

          • Avatar
            Cathy October 18, 2019 at 10:30 pm #

            Do you use gensim for generating this code?
            If you ys gensim, where are the gensim commands You used in your code?

          • Avatar
            Jason Brownlee October 19, 2019 at 6:37 am #

            Gensim is not used in this tutorial.

        • Avatar
          Basil June 28, 2018 at 6:36 pm #

          Don’t know if it too late to respond. the issue arises because u have by mistake typed

          tokenizer.fit_on_sequences before instead of tokenizer.texts_to_sequences.

          Hope this helps others who come to this page in the future!

          Thanks Jason.

    • Avatar
      Naetmul October 28, 2018 at 1:06 pm #

      The main problem is that tokenizer.texts_to_sequences(lines) returns List of List, not a guaranteed rectangular 2D list,
      which means that sequences may have this form:
      [[1, 2, 3], [2, 3, 4, 5]]
      but the example in this article assumes that sequences should be a rectangular shaped list like:
      [[1, 2, 3, 4], [2, 3, 4, 5]]

      If you used a custom clean_doc function, you may need to use custom filter parameter for the Tokenizer(), like tokenizer = Tokenizer(filters='\n').

      The constructor of Tokenizer() has an optional parameter
      filters: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ‘ character.

      The example already removed all the punctuation and whitespaces, so it will be not a problem in the example.
      However, if you used a custom one, then it can be a problem.

  9. Avatar
    vikas dixit March 5, 2018 at 9:26 pm #

    Sir, i have a vocabulary size of 12000 and when i use to_categorical system throws Memory Error as shown below:

    /usr/local/lib/python2.7/dist-packages/keras/utils/np_utils.pyc in to_categorical(y, num_classes)
    22 num_classes = np.max(y) + 1
    23 n = y.shape[0]
    —> 24 categorical = np.zeros((n, num_classes))
    25 categorical[np.arange(n), y] = 1
    26 return categorical


    How to solve this error??

    • Avatar
      Jason Brownlee March 6, 2018 at 6:12 am #

      Perhaps try running the code on a machine with more RAM, such as on S3?

      Perhaps try mapping the vocab to integers manually with more memory efficient code?

  10. Avatar
    Eli Mshomi March 7, 2018 at 9:40 am #

    Can you generate a article based on the other related articles which and be human readable ?

    • Avatar
      Jason Brownlee March 7, 2018 at 3:05 pm #

      Yes, I believe so. The model will have to be large and carefully trained.

  11. Avatar
    Adam March 12, 2018 at 9:52 pm #

    Hi Jason,
    I just tried out your code here with my own text sample (blog posts from a tumblr blog) and trained it and it’s now gotten to the point where text is no longer “generated”, but rather just sent back verbatim.
    The sample set I’m using is rather small (each individual post is fairly small, so I made each sequence 10 input 1 output) giving me around 25,000 sequences. Everything else was kept the same code from your tutorial – after around 150 epochs, the accuracy was around 0.99 so I cut it off to try generation.

    When I change the seed text from something to the sample to something else from the vocabulary (ie not a full line but a “random” line) then the text is fairly random which is what I wanted. When the seed text is changed to something outside of the vocabulary, the same text is generated each time.

    What should I do if I want something more random? Should I change something like the memory cells in the LSTM layers? Reduce the sequence length? Just stop training at a higher loss/lower accuracy?

    Thanks a ton for your tutorials, I’ve learned a lot through them.

    • Avatar
      Jason Brownlee March 13, 2018 at 6:28 am #

      Perhaps the model is overfit and could be trained less?

      Perhaps a larger dataset is required?

      • Avatar
        Adam March 13, 2018 at 9:17 am #

        I think it might just be overfit. Sadly, there isn’t more data that I can grab (at least that i know of currently) so I can’t grab much more data which sucks – that’s why I reduced the sequence length to 10.

        I checkpointed every epoch so that I can play around with what gives the best results. Thanks for your advice!

        • Avatar
          Jason Brownlee March 13, 2018 at 3:04 pm #

          Perhaps also try a blend/ensemble of some of the checkpointed models or models from multiple runs to see if it can give a small lift.

  12. Avatar
    Prateek March 30, 2018 at 4:05 pm #

    Hi Jason,

    I check the amount of accuracy and loss on Tensorflow. I would like to know what exactly do you means in accuracy in NLP?

    In computer vision, if we wish to predict cat and the predicted out of the model is cat then we can say that the accuracy of the model is greater than 95%.

    I want to understand physically what do we mean by accuracy in NLP models. Can you please explain?

  13. Avatar
    Jin Zhou April 14, 2018 at 12:10 pm #

    Hi, I have a question about evaluating this model. As I know, perplexity is a very popular method. For BLEU and perplexity, which one do you think is better? Could you give an example about how to evaluate this model in keras.

  14. Avatar
    Jin April 23, 2018 at 7:11 pm #

    Hi Jason, I am a little confused, you told us that we need 100 words as input, but your X_train is only 50 words per line. Could you explain that a little?

  15. Avatar
    jason May 1, 2018 at 9:54 am #


    You have an embedding layer as the part of the model. The embedding weights will be trained along with other layers. Why not separate and train it independently? In other words, first using Embedding alone to train word vectors/embedding weights. Then run your model with embedding layer initializer with embedding weights and setting trainable=false. I see most people use your approach to train a model. But this is kind of against the purpose of embedding because the output is not word context but 0/1 labels. Why not replace embedding with an ordinary layer with linear activation?

    Another jason

  16. Avatar
    johhnybravo May 19, 2018 at 2:33 am #

    ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array with shape (1,) while predicting output

  17. Avatar
    Nitin Mukesh May 30, 2018 at 6:33 pm #

    Hi, I want to develop Image Captioning in keras. What are the pre requisite for this? I have done your tutorials for object detection using CNN. What should I do next?

  18. Avatar
    mel_dagh June 1, 2018 at 1:59 am #

    Hi Jason,

    1.) Can we use this approach to predict if a word in a given sequence of the training data is highly odd..i.e. that it does not occur in that position of the sequence with high probability.

    2.) Is there a way to integrate pre-trained word embeddings (glove/word2vec) in the embedding layer?

  19. Avatar
    Tanaya June 8, 2018 at 7:29 pm #

    Hello, this is simply an amazing post for beginners in NLP like me.

    I have generated a language model, which is further generating text like I want it to.
    My question is, after this generation, how do I filter out all the text that does not make sense, syntactically or semantically?

    • Avatar
      Jason Brownlee June 9, 2018 at 6:49 am #


      You might need another model to classify text, correct text, or filter text prior to using it.

      • Avatar
        Tanaya June 12, 2018 at 4:40 pm #

        Will this also be using Keras? Do you recommend using nltk or SpaCy?

  20. Avatar
    Venkat June 12, 2018 at 4:38 am #

    Hi Jason,

    Thanks for the amazing post.

    I’m working on words correction in a sentence. Ideally the model should generate number of output words equal to input words with correction of error word in the sentence.

    Does language model help me for the same.? if it does please leave some hints on the model.

    • Avatar
      Jason Brownlee June 12, 2018 at 6:48 am #

      Sounds like a fun project. A language model might be useful. Sorry, I don’t have a worked example. I recommend checking the literature.

  21. Avatar
    Amar June 20, 2018 at 8:53 pm #

    Hi Jason,

    I am working on a highly imbalanced text classification problem.
    Ex : Classify a “Genuine Email ” vs “SPAM ” based on text in body of email.

    Genuine email text = 30 k lines
    Spam email text = 1k lines

    I need to classify whether the next email is Genuine or SPAM.

    I train the model using the same example as above.
    The training data I feed to the model is only “Genuine text”.

    Will I be able to classify the next sentence I feed to the model, from the generated text and probability of words : as “GENUINE Email” vs “SPAM”?

    (I am assuming that the model has never seen SPAM data, and hence the probability of the generated text will be very less.)

    Do you see any way I can achieve this with Language model? What would be an alternative otherwise when it comes to rare event scenario for NLP use cases.

    Thank you!!

  22. Avatar
    musa June 29, 2018 at 2:36 am #

    Hi Jason,

    Could you comment on overfitting when training language models. I’ve built a sentence based LSTM language model and have split training: validation into a 80:20 split. I’m not seeing any imrovments in my validation data whilst the accuracy of the model seems to be improving.


    • Avatar
      Jason Brownlee June 29, 2018 at 6:13 am #

      Overfitting a language model really only matters in the context of the problem for which you are using the model.

      A language model used alone is not really that useful, so overfitting doesn’t matter. It may be purely a descriptive model rather than predictive.

      Often, they are used in the input or output of another model, in which case, overfitting may limit predictive skill.

  23. Avatar
    Marc June 30, 2018 at 8:42 am #

    Would would the implications of returning the hidden and cell states be here?

    I’d imagine if the input sequence was long enough it wouldn’t matter too much as the temporal relationships would be captured, but if we had shorter sequences or really long documents we would consider doing this to improve the model’s ability to learn.. am I thinking about this correctly?

    What would the drawback be to returning the hidden sequence and cell state and piping that into the next observation?

    • Avatar
      Jason Brownlee July 1, 2018 at 6:21 am #

      I don’t follow, why would you return the cell state externally at all? It has no meaning outside of the network.

      • Avatar
        star July 5, 2018 at 2:25 am #

        Hope you are doing well. I have a question which returns to my understanding from embedding vectors.
        For example if I have this sentence “ the weather is nice” and the goal of my model is predicting “nice”, when I want to use pre trained google word embedding model, I must search embedding google matrix and find the embedding vector related to words “the” “weather” “is” “nice” and feed them as input to my model? Am I right?

  24. Avatar
    Fatemeh July 4, 2018 at 2:34 am #

    Hello, Thank you for nice description. I want to use pre-trained google word embedding vectors, So I think I don’t need to do sequence encoding, for example if I want to create sequences with the length 10, I have to search embedding matrix and find the equivalence embedding vector for each of the 10 words, right?

    • Avatar
      Jason Brownlee July 4, 2018 at 8:28 am #

      Correct. You must map each word to its distributed representation when preparing the embedding or the encoding.

      • Avatar
        Fatemeh July 5, 2018 at 2:43 am #

        Thank you, do you have sample code for that in your book? I purchased your professional package

        • Avatar
          Jason Brownlee July 5, 2018 at 8:01 am #

          All sample code is provided with the PDF in the code/ directory.

          • Avatar
            fatemeh July 6, 2018 at 1:57 am #

            Thank you. when I want to use , I have to specify X and y and I have used pre trained google embedidng matrix. so every word has mapped to a a vector, and my inputs are actually the sentences (with the length 4). now I don’t understand the equivalent values for X. for example imagine the first sentence is “the weather is nice” so the X will be “the weather is” and the y is “nice”. When I want to convert X to integers, every word in X will be mapped to one vector? for example if the equivalent vectors for the words in sentence in google model are :”the”= 0.9,0.6,0.8 and “weather”=0.6,0.5,0.2 and “is”=0.3,0.1,0.5 , and “nice”=0.4,0.3,0.5 the input X will be :[[0.9,0.6,0.8],[0.6,0.5,0.2],[0.3,0.1,0.5]] and the output y will be [0.4,0.3,0.5]?

          • Avatar
            Jason Brownlee July 6, 2018 at 6:45 am #

            You can specify or learn the mapping, but after that the model will map integers to their vectors.

  25. Avatar
    mh July 5, 2018 at 4:19 am #

    Hello Jason, why you didn’t convert X input to hot vector format? you only did this dot y output.

  26. Avatar
    Anshuman Mahapatra July 5, 2018 at 10:43 pm #

    Hi..First of all would like to thank for the detailed explaination on the concept. I was executing this model step by step to have a better understanding but am stuck while doing the predictions
    yhat = model.predict_classes(encoded, verbose=0)
    I am getting the below error:-
    ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got array with shape (1,)

  27. Avatar
    NLP_enthusiast August 3, 2018 at 5:08 pm #

    Hello, nice tutorial!

    When callling:, y, batch_size=128, epochs=100)
    what is X.shape and y.shape at this point?

    I’m getting the error:
    Error when checking input: expected embedding_1_input to have shape (500,) but got array with shape (1,)

    In my case:
    X.shape = (500,) # 500 samples i my case
    y.shape = (500, 200) # this is after y= to_categorical(y, num_classes=vocab_size)

    Using Theano backend.

      • Avatar
        NLP_enthusiast August 4, 2018 at 7:46 am #

        thank you. Potentially useful to others: X.shape should be of the form (a, b) where a is the length of “sequences” and b is the input sequence length to make forward predictions. Note that modification/omission of string.maketrans() is likely necessary if using Python 2.x (instead of Python 3.x) and that Theanos backend may also alleviate potential dimension errors from Tensorflow.

        • Avatar
          Jason Brownlee August 5, 2018 at 5:21 am #


        • Avatar
          Mike April 30, 2020 at 8:15 pm #

          I don’t understand why the length of each sequence must be the same (i.e. fixed)? Isn’t the point of RNNs to handle variable length inputs by taking as input one word at a time and have the rest represented in the hidden state.

          Is it used for optimization when we happen to use the same size for all but it’s not actually necessary for us to do so? That would make more sense.

          • Avatar
            Jason Brownlee May 1, 2020 at 6:36 am #

            You are correct and an dynamic RNN can do this.

            Most of my implementations use a fixed length input/output for efficiency and zero padding with masking to ignore the padding.

    • Avatar
      usman September 4, 2018 at 5:35 pm #

      @NLP_enthusiast How did you solve this error?

      • Avatar
        Carlos Aguayo November 20, 2018 at 10:49 am #

        You can just replace these:

        table = string.maketrans(string.punctuation, ‘ ‘)
        tokens = [‘w.translate(table)’ for w in tokens]

        with this:

        tokens = [‘ ‘ if w in string.punctuation else w for w in tokens]

  28. Avatar
    Eric August 20, 2018 at 9:15 pm #

    Does anyone have an example of how predict based on a user provided text string instead of random sample data. I’m kind of new to this so my apologies in advance for such a simple question. Thank you!

    • Avatar
      Jason Brownlee August 21, 2018 at 6:15 am #

      The new input data must be prepared in the same way as the training data for the model.

      E.g. the same data preparation steps and tokenizer.

  29. Avatar
    Santosh August 21, 2018 at 6:28 am #

    Can we implement this thing in Android platform to run trained model for given set of words from user.

    • Avatar
      Jason Brownlee August 21, 2018 at 6:38 am #

      Maybe, I don’t have any examples on Android, sorry.

  30. Avatar
    anirban August 31, 2018 at 11:29 pm #

    One of the extensions suggested in the blog is
    Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed length (e.g. the longest sentence length).
    So if I do a sentence wise splitting then do I retain the punctuations or remove it?

  31. Avatar
    wu.zheng September 6, 2018 at 1:40 pm #

    # separate into input and output
    sequences = array(sequences)
    X, y = sequences[:,:-1], sequences[:,-1]

    is this model only use the 50 words before to predict last world ?

    the context length is fixed ?

  32. Avatar
    Anshul Patel October 3, 2018 at 5:55 pm #

    Hello Jason, I am working on word predictor using RNN’s too. However I have been encountering the same problem faced by many others i.e. INDEXERROR : Too many Indices

    lines = training_set.split(‘\n’)
    tokenizer = Tokenizer()
    sequences = tokenizer.texts_to_sequences(lines)

    vocab_size = len(tokenizer.word_index) + 1

    sequences = array(sequences)
    X_train = sequences[:, :-1]
    y_train = sequences[:,-1]

    I have went through all the comments related to this error, However none of them solve my issue. I wonder if there is any problem with the text I imported it is Pride and Prejudice book from Gutenberg.

    Please help me : THANKS in Advance !!!

    • Avatar
      Jason Brownlee October 4, 2018 at 6:12 am #

      I have not seen this error, are you able to confirm that your libraries are up to date and that you copied all of the code from the tutorial?

  33. Avatar
    Andrew October 16, 2018 at 4:51 pm #

    Very nice article.

    How long did it take to train model like this on 118633 sequences of 50 symbols from 7410 elements dictionary?

    Can you share on what machine did you train the model? ram, cpu, os?

  34. Avatar
    ervin October 20, 2018 at 3:36 am #

    Great article!

    2 questions:

    1. In your ‘Extension’ section — you mentioned to try dropout. Since there is not validation/holdout set, why should we use dropout? Isn’t that just a regularization technique and doesn’t help with training data accuracy?

    2. Speaking of accuracy — I trained my model to 99% accuracy. When I generate text and use exact lines from PLATO as seed text, I should be getting almost an exact replica of PLATO right? Since my model has 99% of getting the next word right in my seed text. I’m finding that this is not the case. Am I interpreting this 99% accuracy right? What’s keeping it from making a replica.

    • Avatar
      Jason Brownlee October 20, 2018 at 5:59 am #

      Yes, it is a regularization method. Yes, it can help as the model is trained using supervised learning.

      Accuracy is not a valid measure for a language model.

      We don’t want the model to memorize Plato, we want the model to learn how to make Plato-like text.

  35. Avatar
    OkRo November 29, 2018 at 9:17 pm #

    Hi , Great article!
    I have a question – How can I use the mode to get probability of a word in a sentence?
    e.g. P(Piraeus| went down yesterday to the?

    • Avatar
      Jason Brownlee November 30, 2018 at 6:31 am #

      I don’t think the model can be used in this way.

  36. Avatar
    Emi December 10, 2018 at 12:36 pm #

    I am a big fan of your tutorials and I used to search your tutorials first when I want to learn things in deep learning.

    I currently have a problem as follows.

    I have a huge corpus of unstructured text (where I have already cleaned and tokenised as; word1, word2, word3, … wordN). I also have a target word list which should be outputted by analysing these cleaned text (e.g., word88, word34, word 48, word10002, … , word8).

    I want to build a language model, that correcly output these target words using my cleaned text. I am just wondering if it is possible to do using deep learning techniques? Please let me know your thoughts.

    If you have tutorial related to above senario please share it with me (as I could not find).

    Thank you for the great tutorials.


    • Avatar
      Jason Brownlee December 10, 2018 at 2:19 pm #

      Almost sounds like a translation or text summarization task?

      Perhaps models used for those problems would be a good starting point?

      • Avatar
        Emi December 10, 2018 at 4:36 pm #

        Thank you for your reply. No, it is not translation or summarisation. I have given an example below with more details 🙂


        My current input preparation process is as follows:
        Unstructured text -> cleaning the data -> get only the informative words -> calculate different features

        Example (consider we have only 5 words):
        Informative words = {“Deep Learning”, “SVM”, “LSTM”, “Data Mining”, ‘Python’}

        For each word I also have features (consider we have only 3 words)
        Features = {Frequency, TF-IDF, MI}
        However, I am not sure if I need these feautres when constructing deep learning model.


        My output is a ranked list of informative words.
        Target output = {‘SVM’, ‘Data Mining’, ‘Deep Learning’, ‘Python’, ‘LSTM’}

        I am just trying to figure out if there is a way to obtain my target output using deep learning or ML model? Please let me know your thoughts.


        • Avatar
          Jason Brownlee December 11, 2018 at 7:39 am #

          Sounds like you want a model to output the same input, but ordered in some way.

          Perhaps try an encoder-decoder?

          • Avatar
            Emi December 11, 2018 at 12:48 pm #

            Hi Jason,

            Thank you very much for your suggestion. I followed the following tutorial of yours today related to encoder-decorder:

            It is very well explained and I would like to use it for my task. However, I got one small problem.

            In your example, you are using 100000 trainging examples as mentioned below.

            X=[22, 17, 23, 5, 29, 11] y=[23, 17, 22]
            X=[28, 2, 46, 12, 21, 6] y=[46, 2, 28]
            X=[12, 20, 45, 28, 18, 42] y=[45, 20, 12]
            X=[3, 43, 45, 4, 33, 27] y=[45, 43, 3]

            X=[34, 50, 21, 20, 11, 6] y=[21, 50, 34]

            However, in my task I only have one input sequence and target sequence as shown below (same input, but ordered in different way).

            Informative words = {“Deep Learning”, “SVM”, “LSTM”, “Data Mining”, ‘Python’}
            Target output = {‘SVM’, ‘Data Mining’, ‘Deep Learning’, ‘Python’, ‘LSTM’}

            Would it be a problem?

            PS: However, my input and target sequence are very long in my real dataset (around 10000 words of length).

          • Avatar
            Jason Brownlee December 11, 2018 at 2:33 pm #

            You will have to adapt the model to your problem and run tests in order to _discover_ whether the model is suitable or not.

  37. Avatar
    Emi December 11, 2018 at 10:38 am #

    Thank you very much for your valuable suggestion. i truely appreciate it 🙂

    I have followed your tutorial of ‘Encorder-Decorder LSTM’ for time-series analysis. Do you have any tutorial of ‘encorder-decorder’ that is close to my task? If so, can you please share the link with me?

  38. Avatar
    Jin December 14, 2018 at 8:23 pm #

    Hi Jason, I have a question about the pre-trained word vectors. I know I should set the embedding layer with weights=[pre_embedding], but how should decide the order of pre_embedding? Like, which word does the vector represent in a certain row. Also, should the first row always be all zeros?

    • Avatar
      Jason Brownlee December 15, 2018 at 6:11 am #

      The index of each vector must match the encoded integer of each word in the vocabulary.

      That is why we collect the vectors needed for each word in the vocab incrementally.

  39. Avatar
    Lee January 9, 2019 at 1:42 pm #

    I am running the model as described here and in the book, but the loss goes to nan frequently. My computer (18 cores + Vega 64 graphics card) also takes much longer to run an epoch than shown here. All cpu leads to 1 hour finishing time. Encoding as int8 and using the GPU via PlaidML speeds it up to ~376 seconds, but nan’s out.

    Any advice? The code is exactly as is used both here and the book, but I just can’t get it to finish a run.

    • Avatar
      Jason Brownlee January 10, 2019 at 7:46 am #

      I have two ideas:

      Did you try running the code file provided with the book?
      Are you able to confirm that your libraries are up to date?

  40. Avatar
    Palani January 20, 2019 at 6:11 pm #

    Great tutorial for a beginner Jason! Thanks!

  41. Avatar
    Matt Lust February 5, 2019 at 9:17 am #

    Im running this in Google Colab (albeit with a different and larger data set), The Colab System crashes and my runtimes are basically reset.

    I broke down each section of as you do in this example and found that it crashes at this code

    # separate into input and output
    sequences = array(sequences)
    X, y = sequences[:,:-1], sequences[:,-1]
    y = to_categorical(y, num_classes=vocab_size)
    seq_length = X.shape[1]

    I think the issue is that my dataset might be too large but I’m not sure.

    heres a link to the Colab notebook.

    • Avatar
      Jason Brownlee February 5, 2019 at 2:20 pm #

      Sorry, I don’t have any experience with that environment.

      I recommend running all code on your own workstation.

    • Avatar
      Valerie May 2, 2019 at 11:06 pm #

      Same problem(((( google colab doesnt have enough RAM for such a big matrix.

  42. Avatar
    vijay February 13, 2019 at 4:49 pm #

    Hey Jason,
    I have two questions,

    1.what will happen when we test with new sequence,instead of trying out same sequence already there in training data?

    2.why could you use voc_size-1?

    • Avatar
      Jason Brownlee February 14, 2019 at 8:38 am #

      Good question, not sure. The model is tailored to the specific dataset, it might just generate garbage or iterate back to what it knows.

      Not sure about your second question, what are you referring to exactly?

  43. Avatar
    saria March 13, 2019 at 5:19 am #

    Hi Jason, I hope you can help me with my confusion.
    So when we feed the data into LSTM, one is about the feature and another about timestamp.
    If I am interested to keep the context as one paragraph, and the longest paragraph I have is 200 words, so I should set the timestamp to 200.
    But what will happen to the feature?
    what will be the feature fed to the model?
    Are the features here are words? or paragraphs?
    Sorry, this confused me a lot, I am not sure how to prepare my text data.
    If my features here are words, then why even I need to split by paragraph? in that case I can split by words and having timestamp 200, so the context of 200 will be kept for my case.
    Am I right?

    Thanks for taking the time:)

    • Avatar
      Jason Brownlee March 13, 2019 at 8:01 am #

      If you are feeding words in, a feature will be one word, either one hot encoded or encoded using a word embedding.

  44. Avatar
    islam March 29, 2019 at 11:23 pm #

    Thanks for every other informative website. Where else may
    I get that kind of info written in such a perfect means? I
    have a mission that I am just now working on, and I’ve been on the look
    out for such information.

  45. Avatar
    torr March 31, 2019 at 3:51 pm #

    Can you help me with code and good article for grammar checker.

    • Avatar
      Jason Brownlee April 1, 2019 at 7:47 am #

      Sorry, I don’t have an example of a grammar checker.

  46. Avatar
    Aksha April 6, 2019 at 3:45 am #

    I am new to NLP realm. If you have an input text “The price of orange has increased” and output text “Increase the production of orange”. Can we make our RNN model to predict the output text? Or what algorithm should I use? Could you please let me know what algorithm to use for mapping input sentence to output sentence.

  47. Avatar
    Thomas L. Packer April 27, 2019 at 7:48 am #

    For those who want to use a neural language model to calculate probabilities of sentences, look here:

  48. Avatar
    Thomas L. Packer April 30, 2019 at 3:51 am #

    Instead of writing a loop:

    why not use the tokenizer’s other dict:

    • Avatar
      Thomas L. Packer April 30, 2019 at 3:52 am #

      I’m new to this website. Who do you mark a code block in a comment? How do you add a profile picture?

      • Avatar
        Jason Brownlee April 30, 2019 at 7:03 am #

        You can use pre HTML tags (I fixed up your prior comment for you).

        Profile pictures are based on gravatar, like any wordpress blog you might come across:

    • Avatar
      Jason Brownlee April 30, 2019 at 7:02 am #

      Sounds good off the cuff, does it work?

      • Avatar
        Thomas L. Packer May 3, 2019 at 3:13 am #

        Thanks. I did try the other dict and it seemed to both work and run faster.

  49. Avatar
    Shivam Bhati June 5, 2019 at 10:37 pm #

    First of all, thank you for such a great project.
    I have worked on this project and I got stuck at predicting the values.

    the error says,
    ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got array with shape (1,)

    Can you please help.

    • Avatar
      Jason Brownlee June 6, 2019 at 6:30 am #

      The error suggests a mismatch between your data and the model’s expectation.

      You can change your data or change the expectations of the model.

  50. Avatar
    Sidharth June 18, 2019 at 6:11 pm #

    Hi! Thanks for your code
    Is there a way to convert keras model to trite or tensorflow model as on official documentation it shows that tflite does not support LSTM layers

    Thanks !

  51. Avatar
    Shashank June 24, 2019 at 10:35 pm #

    sir please please help me . I’m working on Text Summarization . Can I do it using Language modelling because I dont have much knowledge about Neural Networks , or if you have any suggestions , ideas please tell me . I have around 20 days to complete the project .
    Thanks a lot!

  52. Avatar
    Shashank June 24, 2019 at 11:40 pm #

    Sir , how does the language model numeric data like money , date and all ? .
    If there’s a problem , how to approach this problem sir ? I’m working on text summarization and such numeric data may be important for summarizatio. how can i address them?
    thanks a lot for the blog! love your posts. Seriously, very very , very helpful!

    • Avatar
      Jason Brownlee June 25, 2019 at 6:22 am #

      For small datasets, it is a good idea to normalize these to a single form, e.g. words.

  53. Avatar
    Kakoli August 1, 2019 at 4:12 am #

    Thanks again for the wonderful post.
    I tried word-level modelling as given here in Alice’s Adventures in Wonderland from Project Gutenberg. But the loss was too big starting at 6.39 and did not reduce much. Was 6.18-6.19 for the first 10 epochs. Any suggestions?

  54. Avatar
    Kuro August 29, 2019 at 6:33 am #

    I modified clean_doc so that it generates a stand-alone tokens for punctuations, except when a single quote is used as apostrophe as in “don’t”, “Tom’s”.

    On my first run of model making, I changed the batch_size and epochs parameters to 512 and 25, thinking that it might speed up the process. It ended in 2 hours on MacBook Pro but running the sequence generation program generated the text that mostly repeats “and the same, ” like this:

    to be the best , and the same , and the same , and the same , and the same , and the same , and the same , and the same , and the same , and the same , and the same , and the same , and

    I changed back batch_size and epochs to the original values (125, 100) and ran the model building program over night. Then the generated sequences looks more reasonable. For example:

    be a fortress to be justice and not yours , as we were saying that the just man will be enough to support the laws and loves among nature which we have been describing , and the disregard which he saw the concupiscent and be the truest pleasures , and

    Is there an intuitive interpretation of the bad result of my first try? The loss value at the last epoch was 4.125 for the first try and 2.2130 for the second (good result). I forgot to record the accuracy.

    • Avatar
      Jason Brownlee August 29, 2019 at 1:30 pm #

      Hmmm, off the cuff, adding a lot more tokens may require an order of magnitude more data, training and a larger model.

  55. Avatar
    Kuro August 31, 2019 at 8:36 am #

    I’m trying to apply these programs to a collection of song lyrics of a certain genre. A song typically made of 50 to 200 words. Since they are from the same genre, the vocabulary size is relatively small (talking about lost loves, soul etc.). In this case, would it make sense to reduce the sequence size from 50 ? I’m thinking of something like 20.
    The goal of my experiment is to generate a lyric by giving first 5 – 10 words, just for fun.

    • Avatar
      Jason Brownlee September 1, 2019 at 5:33 am #

      Sounds fun.

      Perhaps experiment and see what works best for your specific dataset.

      I’d love to see what you discover.

  56. Avatar
    Aly September 1, 2019 at 5:59 am #

    I’m implementing this on a corpus of Arabic text, but whenever it runs I am getting the same word repeated for the entire text generation process. I’m training on ~50,000 words with ~16,000 unique words. I’ve messed around with number of layers, more data (but an issue as the number of unique words also increases, an interesting find which feels like an error between Tokenizer and not English), and epochs

    Any way to fix this?

  57. Avatar
    Irfan Danish October 2, 2019 at 1:51 am #

    is there any way we can generate 2 or 3 different sample text from a single seed.
    For example we input “Hello I’m” and model gives us
    Hello I’m interested in
    Hello I’m a bit confused
    Hello I’m not sure
    Instead of generating just one output it gives 2 to 3 best outputs.
    Actually I trained your model instead of 50 I just used sequence length of three words, now I want that when I input a seed of three words instead of just one sequence of three words I want to generate 2 to 3 sequences which are correlated to that seed. Is it possible. Please let me know, I would be very thankful to you!

  58. Avatar
    Arjun October 17, 2019 at 4:38 pm #

    what would be the X and y be like?
    and could it be done by splitting the X and y into training and testing?
    also when the model is created what would be the inputs for embedding layer?
    then fit the model on X_train?

    • Avatar
      Jason Brownlee October 18, 2019 at 5:46 am #

      Sorry, I don’t understand. What are you referring to exactly?

      • Avatar
        Arjun October 18, 2019 at 3:26 pm #

        while fitting the model i seem to get an error:
        ValueError: Error when checking input: expected embedding_input to have 2 dimensions, but got array with shape (264, 5, 1)
        but the input should be 3D right?
        so i had a doubt maybe something was wrong with the input i had given in the embedding layer?
        if so what are all the inputs to be given in the embedding layer?

        • Avatar
          Jason Brownlee October 19, 2019 at 6:26 am #

          Input to the embedding is 2d, each sample is a vector of integers that map to a word.

          • Avatar
            Arjun October 21, 2019 at 4:52 pm #

            ok, so after the model is formed then we make the X_train 3D? when fitting?

          • Avatar
            Jason Brownlee October 22, 2019 at 5:42 am #

            No. Input to an embedding is 2d.

  59. Avatar
    Arjun October 21, 2019 at 5:00 pm #

    hi, i would like to know if you have any idea about neural melody composition from lyrics using RNN?
    from a paperwork published it says that two encoders are used and given to a single decoder.
    i wonder if you could provide any insight on it?
    this is the paperwork:

  60. Avatar
    Arjun October 23, 2019 at 3:50 pm #

    It was not completely specific to my doubt, but even though thank you for helping.

  61. Avatar
    Arjun October 24, 2019 at 4:23 pm #

    hi, if i had two sequences as input and i have training and testing for both sequence inputs.
    i managed to concatenate both the inputs and create a model.
    but when it comes to fitting the model, how is it possible to give X and y of two sequences ?

    • Avatar
      Jason Brownlee October 25, 2019 at 6:35 am #

      If you have a multi-input model, then the fit() function will take a list with each array of samples.


      X1 = …
      X2 = …
      X = [X1, X2], y, ….)

  62. Avatar
    Arjun October 28, 2019 at 3:23 pm #

    What if we had two different inputs and we need a model with both these inputs aligned?
    Could we give both these inputs in a single model or create two model with corresponding inputs and then combine both models at the end?

    • Avatar
      Jason Brownlee October 29, 2019 at 5:19 am #

      Sure, there are many different ways to solve a problem. Perhaps explore a few framings and see what works best for your dataset?

  63. Avatar
    Arjun October 29, 2019 at 3:12 pm #

    Thank you..

  64. Avatar
    Arjun October 31, 2019 at 3:30 pm #

    how can we know if two inputs have been aligned ?
    can we do it by merging two models?
    either way we can know the result only after testing it right?

    • Avatar
      Jason Brownlee November 1, 2019 at 5:25 am #


      The idea is to build trust in your model beforehand using verification.

      • Avatar
        Arjun November 1, 2019 at 3:47 pm #

        verification in the sense?

        • Avatar
          Jason Brownlee November 2, 2019 at 6:38 am #

          Confirming the model produces sensible outputs for a test set.

  65. Avatar
    Arjun November 4, 2019 at 4:46 pm #

    Hi Jason,
    I was working on a text generation problem.
    Seems I had a problem while I was fitting X_train and y_train.
    It was related to incompatible shapes. But then I set the batch size to 1 and it ran.
    But when it reached the evaluation part, it showed the previous error.

    InvalidArgumentError: 2 root error(s) found.
    (0) Invalid argument: Incompatible shapes: [32,5,5] vs. [32,5]
    [[{{node metrics/mean_absolute_error/sub}}]]
    (1) Invalid argument: Incompatible shapes: [32,5,5] vs. [32,5]
    [[{{node metrics/mean_absolute_error/sub}}]]
    0 successful operations.
    0 derived errors ignored.
    What could be the possible reason behind this?

  66. Avatar
    Arjun November 5, 2019 at 3:25 pm #

    I am just confused why it would run while fitting but does not run while evaluating?
    I mean we can’t do much tweaking with the arguments in evaluation?

    • Avatar
      Jason Brownlee November 6, 2019 at 6:30 am #

      I don’t undertand your question sorry, perhaps you can elaborate?

  67. Avatar
    Arjun November 6, 2019 at 3:09 pm #

    ooh sorry my reply was based on the previous comment.

    InvalidArgumentError: 2 root error(s) found.
    (0) Invalid argument: Incompatible shapes: [32,5,5] vs. [32,5]
    [[{{node metrics/mean_absolute_error/sub}}]]
    (1) Invalid argument: Incompatible shapes: [32,5,5] vs. [32,5]
    [[{{node metrics/mean_absolute_error/sub}}]]
    0 successful operations.
    0 derived errors ignored.

    This error was found when i was fitting the model. But when I passed the batch size as 1, the model fitted without any problem.
    But when I tried to evaluate it the same previous error showed up.
    Do you have any idea why it would work while fitting but not while evaluating..?

    • Avatar
      Jason Brownlee November 7, 2019 at 6:34 am #

      Sorry, I have not seen this error.

      Perhaps try posting your code and error to stackoverflow?

      • Avatar
        Arjun November 7, 2019 at 3:25 pm #

        ok sure..
        Thank you

  68. Avatar
    Arjun November 6, 2019 at 5:13 pm #

    Hi jason,
    Could you please give some insight on attention mechanism in keras?

    • Avatar
      Jason Brownlee November 7, 2019 at 6:35 am #

      Keras does not support attention at this stage.

      • Avatar
        Arjun November 7, 2019 at 3:17 pm #

        Is there any other way to implement attention mechanism?

        • Avatar
          Jason Brownlee November 8, 2019 at 6:38 am #


          – implement it manually
          – use a 3rd party implementation
          – use tensorflow directly
          – use pytorch

  69. Avatar
    Arjun November 7, 2019 at 8:50 pm #

    Is there any reason why the validation accuracy decreases while the training accuracy increases?
    Is that a case of overfitting?

    • Avatar
      Jason Brownlee November 8, 2019 at 6:40 am #

      Accuracy is noisy.

      Look at train and validation loss.

      • Avatar
        Arjun November 8, 2019 at 3:19 pm #

        even the validation loss seem to be fluctuating.
        So that might be a case of overfiiting right?
        If so how can we solve this problem?

        • Avatar
          Jason Brownlee November 9, 2019 at 6:10 am #

          Or the case that the validation dataset is too small and/or not representative of the training dataset.

          • Avatar
            Arjun November 11, 2019 at 3:20 pm #

            I was training on the imdb dataset for sentiment analysis. I was training on 100000 words.
            Everything seems to be going okay until the training part where the loss and accuracy keeps on fluctuating.
            The validation dataset is split from the whole dataset, so i dont think thats the issue.

          • Avatar
            Jason Brownlee November 12, 2019 at 6:33 am #

            Perhaps try an alternate model?
            Perhaps try an alternate data preparation?
            Perhaps try an alternate configuration?

  70. Avatar
    Arjun November 13, 2019 at 11:41 pm #

    how can we know the total number of words in the imdb dataset?
    not the vocabulary but the size of the dataset?

    • Avatar
      Jason Brownlee November 14, 2019 at 8:03 am #

      Sum the length of all samples.

      • Avatar
        Arjun November 14, 2019 at 3:36 pm #

        One sample in this dataset is one review. And each review contains different number of words. We are considering like 10000 or 100000 words for the dataset and splitting it into training and testing. So i need to like get the total number of words.

  71. Avatar
    Augusto December 28, 2019 at 9:07 am #

    Hi Jason,

    I did the exercise from your post “Text Generation With LSTM Recurrent Neural Networks in Python with Keras”, but the alternative you are describing here by using a Language Model produces text with more coherence, then could you please elaborate when to use one technique over the another.

    Thanks in advance,

    • Avatar
      Jason Brownlee December 29, 2019 at 5:59 am #

      Good question.

      There are no good heuristics. Perhaps follow preference, or model skill for a specific dataset and metric.

  72. Avatar
    Fred January 3, 2020 at 6:48 am #

    Hi! I’m trying to convert this example to to make a simple proof-of-concept model to do word prediction that can do inference both backwards and forwards using the same trained model. (Without duplicating the data)

    I want to try split the text lines in the middle and have my target word there. Like this:

    X1 y X2
    1. [down yesterday to the piraeus with glaucon the son of ariston]
    2. [yesterday to the piraeus with glaucon the son of ariston that]
    3. [to the piraeus with glaucon the son of ariston that i]
    (X1 and X2 are actually 20 words each)

    I keep getting various data formatting errors and I feel like I have tried so many things but obviously there’s still plenty permutations at the correct way to do this still eludes me.

    This is roughly my code,

    X1 = X1.reshape((n_lines+1, 20))
    X2 = X2.reshape((n_lines+1, 20))
    y = y.reshape((n_lines+1, vocab_size))

    model = Sequential()
    model.add(Embedding(vocab_size, 40, input_length=seq_length))
    model.add(LSTM(100, return_sequences=True, input_shape=(20, 1)))
    model.add(Dense(100, activation=’relu’))
    model.add(Dense(vocab_size, activation=’softmax’))
    model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])[X1, X2]), np.array(y), batch_size=128, epochs=10)

    Do you think this should be close to working and do you know why I can’t seem to be able to feed both the ‘X’ features?

    Cheers, Fred

  73. Avatar
    Wei Jiang January 25, 2020 at 8:06 pm #

    I want to take everything into account, including punctuatioins, so that I comment out the following line:

    tokens = [word for word in tokens if word.isalpha()]

    But when I run the training, I get the following error:
    Traceback (most recent call last):
    File “”, line 34, in
    X, y = sequences[:,:-1], sequences[:,-1]
    IndexError: too many indices for array

    Any idea?

  74. Avatar
    sachin February 27, 2020 at 1:52 pm #

    Sir, when we are considering the context of a sentence to classify it to a class, which neural network architecture should I use.

    For eg: I want to classify like this,

    They killed many people: Non-Toxic

    I will kill them: Toxic

  75. Avatar
    Dylan Lunde March 12, 2020 at 8:05 am #

    Was the too many indices for array issue ever explicitly solved by anyone?

    IndexError Traceback (most recent call last)
    in ()
    1 sequences = np.array(sequences)
    —-> 2 X, y = sequences[:,:-1], sequences[:,-1]
    3 y = to_categorical(y, num_classes=vocab_size)
    4 seq_length = X.shape[1]

    IndexError: too many indices for array

  76. Avatar
    Carson March 27, 2020 at 2:21 pm #

    i have a question, you input 50 words into your neural nets and get one output world if i am not wrong, but how you can get a 50 words text when you only put in 50 words text?

  77. Avatar
    Arsal April 26, 2020 at 8:43 am #

    Can you make a tutorial on text generation using GANs?

    • Avatar
      Jason Brownlee April 27, 2020 at 5:23 am #

      Thanks for the suggestion.

      Language models are used for text generation, GANs are used for generating images.

  78. Avatar
    Efstathios Chatzikyriakidis May 15, 2020 at 3:35 am #

    Hi Jason,

    Two issues:

    “The model will require 100 words as input.”

    I think it is 50.


    “Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words removed.”

    This is usually done in text classification. Doing such a think in a language model and use it for text generation you will lead to bad results. Stop words are important for catching basic groups of words, eg: “I went to the”.

    • Avatar
      Jason Brownlee May 15, 2020 at 6:12 am #

      Thanks for the typo.

      Sure, change it anyway you like!

  79. Avatar
    Vipul May 30, 2020 at 5:03 pm #

    I need a deep nueral network which select a word out of predefined candidates. Please suggest me some solution.

  80. Avatar
    Prem June 1, 2020 at 9:58 am #

    For sentence-wise training, does model 2 from the following post essentially show it?

  81. Avatar
    Laura June 4, 2020 at 8:29 pm #

    Hi Jason! Thanks for your post.
    I need to build a neural network which detect anomalies in sycalls execution as well as related to the arguments these syscalls receive. Which solution could be suitable for this problem.
    Thanks in advance!

    • Avatar
      Jason Brownlee June 5, 2020 at 8:09 am #

      Perhaps start with a text input and class label output, e.g. text classification models. Test bag of words and embedding representations to see what works well.

  82. Avatar
    Laura June 5, 2020 at 12:04 am #

    Hi Jason!
    Thanks for your post!
    I need to build a neural network to detect anomalies in syscalls exection as well as in the arguments they receive. Which solution would you recommend me for this purpose?
    Thanks in advance!

    • Avatar
      Jason Brownlee June 5, 2020 at 8:14 am #

      I recommend prototyping and systematically evaluating a suite of different models and discover what works well for your dataset.

  83. Avatar
    Neha June 10, 2020 at 1:32 pm #

    Hello sir, thank you for such a nice post, but sir how to work with csv files, how to load ,save them,I am so new to deep learning, can you me idea of the syntax?

  84. Avatar
    Kaalu June 16, 2020 at 12:15 pm #

    Hi Jason,

    Thanks for your step by step tutorial with relevant explanations. I am trying to use this technique in generating snort rules for a specific malware family/ type (somehow similar to firewall rules / Intrusion detection rules). Do you think this is possible? can you give me any pointers to consider? Will it be possible since such rules need to follow a specific format or sequence with keywords.

    this is how a sample rule looks like.

    “alert tcp $HOME_NET any -> $EXTERNAL_NET $HTTP_PORTS (msg:”MALWARE-BACKDOOR Win.Backdoor.Demtranc variant outbound connection”; flow:to_server,established; content:”GET”; nocase; http_method; content:”/AES”; fast_pattern; nocase; http_uri; pcre:”/\/AES\d+O\d+\.jsp\?[a-z0-9=\x2b\x2f]{20}/iU”; metadata:policy balanced-ips drop, policy max-detect-ips drop, policy security-ips drop, service http; reference:url,; classtype:trojan-activity; sid:24115; rev:3;)

    Your advice would be highly appreciated.

    • Avatar
      Jason Brownlee June 16, 2020 at 1:40 pm #

      You’re welcome.

      Perhaps try it and see?

      My best advice is to review the literature and see how others that have come before addressed the same type of problem. It will save a ton of time and likely give good results quickly.

      • Avatar
        Kaalu June 16, 2020 at 11:37 pm #

        Hi Jason,

        Thanks very much. Sadly haven’t found any literature where they have anything similar 🙁 . That’s why I reached out to you.

        Will keep searching..

        • Avatar
          Jason Brownlee June 17, 2020 at 6:24 am #

          Hang in there, perhaps search for another project that is “close enough” and mine it for ideas.

          • Avatar
            Ebenezer A. Laryea June 19, 2020 at 1:36 am #


  85. Avatar
    Hilal Ozer August 26, 2020 at 7:54 am #

    Hi Jason,

    Thanks for the great post. I used your code for morpheme prediction. At first I implemented it with fixed sequence length correctly but then I have to make it with variable sequence length. So, I used stateful LSTM with batch size 1 and set sequence length None.

    I tried to fit the model one sample at a time. However I got the “ValueError: Input arrays should have the same number of samples as target arrays. Found 1 input samples and 113 target samples.”

    The input and output sample sizes are actually equal and “113” is the one hot vector’s size of the output. The target output implementation is totally same with your code and runs correctly in my first implementation with fixed sequence.
    Do you have any idea why the model does not recognize one hot encoding?

    Thanks in advance.

    • Avatar
      Jason Brownlee August 26, 2020 at 1:42 pm #

      You’re welcome.

      If you are using a stateful LSTM you may need to make the target 3d instead of 2d, e.g. [samples, timesteps, features]. I could be wrong, but I recall that might be an issue to consider.

  86. Avatar
    Hilal Ozer August 27, 2020 at 7:14 am #

    Thank you for your response. When I make it 2d, it ran successfully. It was 1d by mistake.

  87. Avatar
    riyaz September 5, 2020 at 9:59 pm #

    If you want to learn more, you can also check out the Keras Team’s text generation implementation on GitHub:

    have a look on this code.. its well presented

  88. Avatar
    Minura Punchihewa September 13, 2020 at 4:09 am #

    Hi Jason,

    I have used a similar RNN architecture to develop a language model. What I would like to do now is, when a complete sentence is provided to the model, to be able to generate the probability of it.

    Note: This is the probability of the entire sentence that I am referring to, not just the next word.

    Do you happen to know how to do this?

    From what I have gathered, this mechanism is used in the implementation of speech recognition software.

    • Avatar
      Jason Brownlee September 13, 2020 at 6:12 am #

      Good question, I don’t have an example.

      Perhaps calculate how to do this manually then implement it?
      Perhaps check the literature to see if anyone has done this before and if so how?
      Perhaps check open source libraries to see if they offer this capability and see how they did it?

  89. Avatar
    Minura Punchihewa September 14, 2020 at 5:34 am #

    I’ve been checking, still struggling to find an answer.

  90. Avatar
    Pranav September 28, 2020 at 1:04 am #

    Hello Jason Sir,
    Thank you for providing such an amazing and informative blog on Text-generation . First on reading the title I thought its going to be difficult , but explainations as well as the code were concise and easy to grasp . Looking foward to read more blogs from you!!

  91. Avatar
    John Bueno October 7, 2020 at 2:43 pm #

    I’ve followed the steps and am almost finished but am stuck on this error

    Traceback (most recent call last):
    File “C:/Users/andya/PycharmProjects/finalfinalFINALCHAIN/venv/Scripts/”, line 45, in
    yhat = model.predict_classes(encoded)
    File “C:\Users\andya\PycharmProjects\finalfinalFINALCHAIN\venv\lib\site-packages\keras\”, line 1138, in predict_classes
    File “C:\Users\andya\PycharmProjects\finalfinalFINALCHAIN\venv\lib\site-packages\keras\”, line 1025, in predict
    File “C:\Users\andya\PycharmProjects\finalfinalFINALCHAIN\venv\lib\site-packages\keras\engine\”, line 1830, in predict
    File “C:\Users\andya\PycharmProjects\finalfinalFINALCHAIN\venv\lib\site-packages\keras\engine\”, line 129, in _standardize_input_data
    ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array with shape (51, 1)

    For reference I explicitly used the same versions of just about everything that you did. Everything works except for the first line to state

    yhat = model.predict_classes(encoded, verbose=0)

    I’ve tinkered with the code but sadly I am not quite mathmatically and software inclined enough to find a proper solution. You may want to keep in mind that I have altered the text cleaner to keep numbers and punctuation albeit when reverting it back to normal it doesn’t appear to fix anything. It may also be worth noting that for testing purposes I’ve set the epoch count to 1 but I doubt that should affect anything. Outside of that there shouldn’t be any important deviations.

  92. Avatar
    Payton F October 27, 2020 at 9:18 am #

    Hi Jason,

    What would your approach be to building a model trained on multiple different sources of text. For example, if I want to train a model on speech transcripts so I can generate text in the style of a certain speaker, would I store all the speeches in a single .txt file? I worry that if I do this, I will have some misleading sequences such as when the sequence begins with words from one speech and ends with words from the beginning of the next speech. Would it be better to somehow train the model on one speech at a time rather than on a larger file of all speeches combined?

    • Avatar
      Jason Brownlee October 27, 2020 at 1:01 pm #

      Perhaps fit a separate model on each source then use an ensemble of the models / stacking to combine.

  93. Avatar
    123 November 6, 2020 at 6:43 am #

    I am now not sure where you’re getting your information, but good topic.
    I must spend a while learning much more or understanding more.
    Thanks for wonderful info I used to be in search of this information for my mission.

  94. Avatar
    cnsn8 April 6, 2021 at 12:43 am #

    thanks for great tutorials Jason. How can i add a simple control to this language model ? Such as positive- negative text generation.

    • Avatar
      Jason Brownlee April 6, 2021 at 5:18 am #

      You’re welcome.

      Perhaps develop one model for each?

  95. Avatar
    Eric April 27, 2021 at 8:35 pm #

    Hi Jason,

    At this step, I receive these codes.

    # define model
    model = Sequential()
    model.add(Embedding(vocab_size, 50, input_length=seq_length))
    model.add(LSTM(100, return_sequences=True))
    model.add(Dense(100, activation=’relu’))
    model.add(Dense(vocab_size, activation=’softmax’))

    Total Sequences: 118633
    2021-04-27 06:24:25.190966: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
    2021-04-27 06:24:25.191304: I tensorflow/stream_executor/cuda/] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    2021-04-27 06:24:33.866815: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library nvcuda.dll
    2021-04-27 06:24:34.937609: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties:
    pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
    coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 104.43GiB/s
    2021-04-27 06:24:34.940037: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
    2021-04-27 06:24:34.941955: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cublas64_11.dll’; dlerror: cublas64_11.dll not found
    2021-04-27 06:24:34.943931: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cublasLt64_11.dll’; dlerror: cublasLt64_11.dll not found
    2021-04-27 06:24:34.945872: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cufft64_10.dll’; dlerror: cufft64_10.dll not found
    2021-04-27 06:24:34.947770: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘curand64_10.dll’; dlerror: curand64_10.dll not found
    2021-04-27 06:24:34.949522: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cusolver64_11.dll’; dlerror: cusolver64_11.dll not found
    2021-04-27 06:24:34.951167: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cusparse64_11.dll’; dlerror: cusparse64_11.dll not found
    2021-04-27 06:24:34.952449: W tensorflow/stream_executor/platform/default/] Could not load dynamic library ‘cudnn64_8.dll’; dlerror: cudnn64_8.dll not found
    2021-04-27 06:24:34.952766: W tensorflow/core/common_runtime/gpu/] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.
    Skipping registering GPU devices…
    2021-04-27 06:24:34.954395: I tensorflow/core/platform/] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    2021-04-27 06:24:34.955743: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
    2021-04-27 06:24:34.956035: I tensorflow/core/common_runtime/gpu/]
    Traceback (most recent call last):
    model.add(LSTM(100, return_sequences=True))
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\training\tracking\”, line 522, in _method_wrapper
    result = method(self, *args, **kwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\engine\”, line 223, in add
    output_tensor = layer(self.outputs[0])
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 660, in __call__
    return super(RNN, self).__call__(inputs, **kwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\engine\”, line 945, in __call__
    return self._functional_construction_call(inputs, args, kwargs,
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\engine\”, line 1083, in _functional_construction_call
    outputs = self._keras_tensor_symbolic_call(
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\engine\”, line 816, in _keras_tensor_symbolic_call
    return self._infer_output_signature(inputs, args, kwargs, input_masks)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\engine\”, line 856, in _infer_output_signature
    outputs = call_fn(inputs, *args, **kwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 1139, in call
    inputs, initial_state, _ = self._process_inputs(inputs, initial_state, None)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 860, in _process_inputs
    initial_state = self.get_initial_state(inputs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 642, in get_initial_state
    init_state = get_initial_state_fn(
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 2508, in get_initial_state
    return list(_generate_zero_filled_state_for_cell(
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 2990, in _generate_zero_filled_state_for_cell
    return _generate_zero_filled_state(batch_size, cell.state_size, dtype)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 3006, in _generate_zero_filled_state
    return tf.nest.map_structure(create_zeros, state_size)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\util\”, line 867, in map_structure
    structure[0], [func(*x) for x in entries],
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\util\”, line 867, in
    structure[0], [func(*x) for x in entries],
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\keras\layers\”, line 3003, in create_zeros
    return tf.zeros(init_state_size, dtype=dtype)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\util\”, line 206, in wrapper
    return target(*args, **kwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\ops\”, line 2911, in wrapped
    tensor = fun(*args, **kwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\ops\”, line 2960, in zeros
    output = _constant_if_small(zero, shape, dtype, name)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\ops\”, line 2896, in _constant_if_small
    if < 1000:
    File "”, line 5, in prod
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\numpy\core\”, line 3030, in prod
    return _wrapreduction(a, np.multiply, ‘prod’, axis, dtype, out,
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\numpy\core\”, line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
    File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-packages\tensorflow\python\framework\”, line 867, in __array__
    raise NotImplementedError(
    NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a numpy array. This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported

    My libraries are :
    Python 3.9
    Theano 1.0.5
    Numpy 1.20.2
    pip 21.1
    keras 2.4.3
    tensorflow 2.4.1
    matplotlib 3.4.1
    pandas 1.2.4

    Using PyCharm. After seeing so many error messages, I wanted to wait before running the script from outside pycharm.

    I got a GPU on my computer, a DELL G3 with NVIDIA Discrete Graphic, but I haven’t done any settings or config with it, not thinking I need this to run this tutorial.

    Your help will be very appreciated!

  96. Avatar
    lp October 7, 2021 at 10:55 pm #

    Thanks for the tutorial, would be great to propose a TF2 version of the decoder, predict_classes() got deprecated


    • Avatar
      Adrian Tam October 12, 2021 at 12:19 am #

      Thanks for your suggestion. That should be using np.argmax(predict_class()) instead.

      • Avatar
        Amrit July 10, 2022 at 12:52 pm #

        Tried to replace but code is not working.
        model.predict_classes(encoded, verbose=0) -> np.argmax(model.predict_class(encoded))

        Any solution to this. Thanks..

        • Avatar
          James Carmichael July 11, 2022 at 4:24 am #

          Hi Amrit…What error did you receive? This will allow us to better assist you.

  97. Avatar
    Manuela November 10, 2021 at 3:04 am #

    How much does it take to run this example in a “simple” laptop with CPU 1.10 GHz ? Is it feasible ?

    • Avatar
      Adrian Tam November 14, 2021 at 1:14 pm #

      I believe so. Unless you have really little memory.

  98. Avatar
    John LI January 28, 2022 at 12:58 pm #

    hello Jason, I have a question here. Hope you can get some time to answer. Regarding this example, the 51st word in every training cycle was predicted when training, how at the end a continuous word string with some meaning can be predicted out?

    • Avatar
      James Carmichael January 31, 2022 at 11:08 am #

      Hi John…Please reword or elaborate on your question so that I may better assist you.

      • Avatar
        John LI January 31, 2022 at 1:17 pm #

        Thank you for your prompt reply, James.
        Let me rephrase my question as below:
        1. On training stage, we sent lines of 50-word-text-string for each line to the LSTM and compared the output (which is an one word) with the target (the factual 51st word of the 50-word-string), and looped back and forth till the weights and bias are set to the most optimized. So briefly speaking, one string of words was fed to the model, one word was predicted.
        2. But on ‘Use Language Model’ stage, we give a seed text, we generate a 50 words text.

        3. So when training: text -> word, when model applied: text -> text. My quesiton is: from 1 to 2, how can this be achieved? I am not native English speaker. Hope my question is clear. Thank you so much for your time and patience.

  99. Avatar
    Aniket Saxena March 23, 2022 at 3:43 am #

    Hi Jason,

    I tried to explore the possibility of using Pre-trained Word Embedding (GloVe vectors) to improve the accuracy for the Text Generation task (The Republic by Plato dataset) you’ve discussed here. However, when I’d downloaded the zip file (, I found that more than 90 words of the aforementioned dataset had no learned embedding vectors in any of the available text files (glove.6B.50d.txt, glove.6B.100d.txt, glove.6B.200d.txt, and glove.6B.300d.txt).

    One workaround for this problem could be to remove all of these 90 words from the dataset altogether and then train the model by freezing the Embedding Layer, but it doesn’t seem to solve the core problem. Since the pre-trained GloVe vectors zip file has no presence of pre-learned embedding vectors for these 90 words, I don’t have any clue as to how to extend the model to use these pre-trained GloVe vectors for the entire The Republic by Plato dataset.

    Could you please suggest a way to perform transfer learning on this dataset in order to improve the accuracy (as you’ve advised to do so under the Extensions section of this blog) in case there is a possibility to achieve so?

    Aniket Saxena

  100. Avatar
    Feron July 1, 2022 at 12:27 am #

    Hi, do you think the flatten layer will be any use or improving the accuracy of the model? I had someone that always remind me to used flatten layer after the last layer of each model.

  101. Avatar
    Fati Taherin August 16, 2022 at 11:43 pm #

    Thank you for your great tutorials. Love them all.
    I implemented the model in Keras for a Persian dataset and it worked great. Then tried to implement the Pytorch version. The model doesn’t learn anything. I could not figure out the problem but I think it might be the LSTM layers. Just need to say that I kind of made my own tokenizer and integer dictionary for both Keras and Pytorch using dictionary structure and lists. I would appreciate any help.
    Thank you very much.
    Here is my model…

    class seqModel(nn.Module):
    def __init__(self, vocab_size1=vocab_size, embedding_dim=50, hidden_size=100):
    super(seqModel, self).__init__()
    self.encoder = nn.Embedding(vocab_size1, embedding_dim)
    self.lstm1 = nn.LSTM(embedding_dim, hidden_size)
    self.lstm2 = nn.LSTM(hidden_size, hidden_size)
    self.linear1 = nn.Linear(hidden_size, hidden_size)
    self.linear2 = nn.Linear(hidden_size, vocab_size1)
    self.act1 = nn.ReLU()
    self.act2 = nn.Softmax(dim=1)

    def forward(self, x):

    output = self.encoder(x)

    output, _ = self.lstm1(output)

    output = output[:,-1,:]

    output, _ = self.lstm2(output)

    output = self.linear1(output)

    output = self.act1(output)

    output = self.linear2(output)

    output = self.act2(output)

    return output

    • Avatar
      James Carmichael August 17, 2022 at 6:20 am #

      Hi Fati…You are very welcome! Please provide more detail on what you experiencing with your model’s performance so that we may better assist you. Perhaps your model would benefit from hyperparameter optimization.

  102. Avatar
    Fati Taherin August 18, 2022 at 4:33 pm #

    Thanks for the reply. Here is my training loop:

    def train3(train_loader, sModel, optimizer, criterion, epochs):
    loss_track = []
    leastLoss = 100.000
    for epoch in range(epochs):
    epoch_loss = []
    for x_batch, y_batch in train_loader:
    x_batch =
    y_batch =
    x_batch =

    pred = model(x_batch)

    y_batch =


    loss = criterion(pred,y_batch)
    # Backpropagation


    print(” Epoch {} | Train Cross Entropy Loss: “.format(epoch),np.mean(epoch_loss))


    if ((np.mean(epoch_loss)) < leastLoss):, '/path/Model/')

    Training the model will result like below and it goes like this up to 20 epochs without any improvements.(just struggles within a small range)

    Epoch 0 | Train Cross Entropy Loss: 9.027385948332194

    Epoch 1 | Train Cross Entropy Loss: 9.027385958870033

    Epoch 2 | Train Cross Entropy Loss: 9.027385963260798

    Epoch 3 | Train Cross Entropy Loss: 9.027385970944637

    Epoch 4 | Train Cross Entropy Loss: 9.027385972481406

    Epoch 5 | Train Cross Entropy Loss: 9.027385962821722

    Epoch 6 | Train Cross Entropy Loss: 9.027385953381575

    Epoch 7 | Train Cross Entropy Loss: 9.027385970944637


    Is there a problem with my loss function? I mean that printing the layer's size, I found that the output has to be reshaped from (16,7,100) to (16,100) where 16 is the batch size, 7 is the sequence length and 100 is the output of the lstm layer, after the lstm layers. At the end of the model, the shape of the output is (16,vocab_size) which matched the y_batch shape. Is it possible that eliminating a dimension from the matrix in the model is causing this?

    I have also tried using one lstm layer and detaching it's states in every batch and initializing the states at the beginning of each epoch as below:

    for epoch in range(epochs):
    states = (torch.zeros(num_layers = 1, batch_size, hidden_size).to(device),
    torch.zeros(num_layers, batch_size, hidden_size).to(device))
    epoch_loss = []
    for x_batch, y_batch in train_loader:
    if (x_batch.size(0) < 16):
    x_batch =
    y_batch =
    x_batch =

    states = detach(states)
    #the states are passed to the lstm layer like
    #output , (h,c) = self.lstm(output,states)
    pred, states = model(x_batch, states)

    y_batch =


    loss = criterion(pred,y_batch)
    # Backpropagation
    nn.utils.clip_grad_norm_(model.parameters(), 0.5)

    This also has the same loss for the number of epochs I have trained it for.
    * One more thing that i have batch_first = true in my lstm layers.

    Thanks again.

  103. Avatar
    Nada September 7, 2022 at 1:09 am #

    Hello, thank you so much for this tutorial.
    I tried to replicat the same steps, but the the model predict the same word everytime, I don’t now what could be the problem. can you please give me a hint??

Leave a Reply