How to Develop Word-Based Neural Language Models in Python with Keras

Last Updated on September 3, 2020

Language modeling involves predicting the next word in a sequence given the sequence of words already present.

A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.

In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.

After completing this tutorial, you will know:

  • The challenge of developing a good framing of a word-based language model for a given application.
  • How to develop one-word, two-word, and line-based framings for word-based language models.
  • How to generate sequences using a fit language model.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Develop Word-Based Neural Language Models in Python with Keras

How to Develop Word-Based Neural Language Models in Python with Keras
Photo by Stephanie Chapman, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Framing Language Modeling
  2. Jack and Jill Nursery Rhyme
  3. Model 1: One-Word-In, One-Word-Out Sequences
  4. Model 2: Line-by-Line Sequence
  5. Model 3: Two-Words-In, One-Word-Out Sequence

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.

Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.

Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence

Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words.

There are many ways to frame the sequences from a source text for language modeling.

In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library.

There is no single best approach, just different framings that may suit different applications.

Jack and Jill Nursery Rhyme

Jack and Jill is a simple nursery rhyme.

It is comprised of 4 lines, as follows:

Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after

We will use this as our source text for exploring different framings of a word-based language model.

We can define this text in Python as follows:

Model 1: One-Word-In, One-Word-Out Sequences

We can start with a very simple model.

Given one word as input, the model will learn to predict the next word in the sequence.

For example:

The first step is to encode the text as integers.

Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.

The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word_index attribute.

Running this example, we can see that the size of the vocabulary is 21 words.

We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

Running this piece shows that we have a total of 24 input-output pairs to train the network.

We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.

Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

We are now ready to define the neural network model.

The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.

The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

The structure of the network can be summarized as follows:

We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.

Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.

The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

This process could then be repeated a few times to build up a generated sequence of words.

To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.

We can time all of this together. The complete code listing is provided below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example prints the loss and accuracy each training epoch.

We can see that the model does not memorize the source sequences, likely because there is some ambiguity in the input sequences, for example:

And so on.

At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.

We get a reasonable sequence as output that has some elements of the source.

This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.

Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.

For example:

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.

In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.

Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.

First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

Next, we can pad the prepared sequences. We can do this using the pad_sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

Next, we can split the sequences into input and output elements, much like before.

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

We can use the model to generate new sequences as before. The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.

Tying all of this together, the complete code example is provided below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.

There are still two lines of text that start with ‘Jack‘ that may still be a problem for the network.

At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.

The first generated line looks good, directly matching the source text. The second is a bit strange. This makes sense, because the network only ever saw ‘Jill‘ within an input sequence, not at the beginning of the sequence, so it has forced an output to use the word ‘Jill‘, i.e. the last line of the rhyme.

This was a good example of how the framing may result in better new lines, but not good partial lines of input.

Model 3: Two-Words-In, One-Word-Out Sequence

We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.

This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.

We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:

The complete example is listed below

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example again gets a good fit on the source text at around 95% accuracy.

We look at 4 generation examples, two start of line cases and two starting mid line.

The first start of line case generated correctly, but the second did not. The second case was an example from the 4th line, which is ambiguous with content from the first line. Perhaps a further expansion to 3 input words would be better.

The two mid-line generation examples were generated correctly, matching the source text.

We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. That careful design is required when using language models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.


This section lists some ideas for extending the tutorial that you may wish to explore.

  • Whole Rhyme as Sequence. Consider updating one of the above examples to build up the entire rhyme as an input sequence. The model should be able to generate the entire thing given the seed of the first word, demonstrate this.
  • Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding instead of learning the embedding as part of the model. This would not be required on such a small source text, but could be good practice.
  • Character Models. Explore the use of a character-based language model for the source text instead of the word-based approach demonstrated in this tutorial.

Further Reading

This section provides more resources on the topic if you are looking go deeper.


In this tutorial, you discovered how to develop different word-based language models for a simple nursery rhyme.

Specifically, you learned:

  • The challenge of developing a good framing of a word-based language model for a given application.
  • How to develop one-word, two-word, and line-based framings for word-based language models.
  • How to generate sequences using a fit language model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

93 Responses to How to Develop Word-Based Neural Language Models in Python with Keras

  1. Aaron November 3, 2017 at 5:46 am #

    Hi Jason – Thanks for this. How can a language model be used to “score” different text sentences. Suppose there is a speech recognition engine that outputs real words but they don’t make sense when combined together as a sentence. Could we use a language model to “score” each sentence to see which is more likely to occur? Thanks!

    • Jason Brownlee November 3, 2017 at 2:14 pm #

      Great question.

      Rather than score, the language model can take the raw input and predict the expected sequence or sequences and these outcomes can then be explored using a beam search.

      • Aaron November 4, 2017 at 1:02 am #

        Thanks, I’d love to see an example of this as an appendix to this post. By the way – I really enjoy your blog, can’t thank you enough for these examples.

        • Jason Brownlee November 4, 2017 at 5:32 am #

          Thanks. I have a post on beam search scheduled.

      • Mezgebu Abebe November 20, 2019 at 11:11 pm #

        Hello there, i’m trying to develop next word prediction model with GUI using python 3.x but i can’t. Can anyone help me ?

        Thanks alot!

        • Jason Brownlee November 21, 2019 at 6:06 am #

          Perhaps develop a language model and get it working standalone, then integrate it into your app later.

  2. had November 3, 2017 at 9:38 am #

    Whats mean the second argument in embedding?
    Did I understand correctly that each word is encoded as a number from 0 to 10?

    I created a network for predicting the words with a large number of words, the loss decreases too slowly, so I think I did something wrong.

    Maybe it should be, I don’t know (in by char generation it was a lot faster), I would be grateful for advice.

    • Jason Brownlee November 3, 2017 at 2:16 pm #

      The second argument is the dimensionality of the embedding, the number of dimensions for the encoded vector representation of each word.

      Common values are 50, 100, 500 or 1000.

  3. Nadeem Pasha November 3, 2017 at 3:57 pm #

    How to do with base means how to extract transcriptions from the timit database in python.

    • Jason Brownlee November 4, 2017 at 5:25 am #

      Sorry, I don’t have examples of working with the TIMIT dataset.

  4. Anubhab Majumdar November 9, 2017 at 2:17 am #

    Thanks for the amazing post. A novice query – I have a large dataset of books and I want to train a LSTM on that. However, I am getting memoryerror when I try to use the entire dataset for training at once. Is there a way to break up the data and train the model using the parts? Or do I have to throw hardware at the problem?

    • Jason Brownlee November 9, 2017 at 10:03 am #

      You can use progressive loading in Keras to only load or yield one batch of data at a time.

      I have a post scheduled on this, but until then, read up on Keras data generators.

  5. Christoph Aurnhammer November 23, 2017 at 5:39 am #

    Dear Jason,
    Thank you very much for this post. I am trying to use your “Model 2: Line-by-Line Sequence” and scale it up to create an RNN language model. I have two questions about the way the data is represented:

    1. Is there a more efficient way to train an Embedding+RNN language model than splitting up a single sentence into several instances, with a single word added at each step?

    2. In this representation we need to feed part of the same sequence into the model over and over again. By presenting the words at the beginning of the sentence more often (as X), do we bias the model towards knowing sentence-initial-parts better than words occurring more frequently at the end of sentences?

    Kind regards and thank you,

    • Jason Brownlee November 23, 2017 at 10:41 am #

      I’d encourage you to explore alternate framings and see how they compare. There is no one true way.

      It may bias the model, perhaps you could test this.

  6. Carl S January 26, 2018 at 9:30 am #

    Hi Jason, what if you have multiple sentences to train in batches? In that case, your input would be 3 dimensional and the fit would return an error because the embedding layer only accepts 2 dimensions.
    Is there an efficient way to deal with it other than send the training set in batches with 1 sentence at a time?

    I could of course act as if all words were part of 1 sentence but how would the LSTM detect the end of a sentence?

    Thank you!

    • Jason Brownlee January 27, 2018 at 5:49 am #

      You could provide each sentence as a sample, group samples into a batch and the LSTM will reset states at the end of each batch.

      • Carl S January 30, 2018 at 9:00 am #

        Thank you for your reply Jason! I understand that the LSTM will rest states at the end of the batch, but shouldn’t we make it reset states after each sentence/ sample in each batch?

        • Jason Brownlee January 30, 2018 at 10:01 am #

          Perhaps. Try it and see if it lifts model skill.

          I find it has much less effect that one would expect.

          • Carl S January 31, 2018 at 5:12 am #

            I am not able to do it as there will be a dimensionality issue preventing the Keras Embedding layer from giving correct output. If you have a workaround I would love to see your code.

  7. Onjule March 19, 2018 at 5:18 am #

    Amazing post! But I was working on something which requires an rnn language model built without libraries. Can the Keras functionalities used in the code here be replaced with self-written code, and has someone already done this? Is there any Github repository for the same?

    • Jason Brownlee March 19, 2018 at 6:08 am #

      It would require a lot of work, re-implementing systems that already are fast and reliable. Sounds like a bad idea.

      What is your motivation exactly?

      • Anjali Bhavan March 25, 2018 at 8:12 pm #

        Never mind, sir, I myself realized how bad a idea that is. Thank you for this amazing article tho!

  8. Dilip March 25, 2018 at 5:37 pm #

    How do i implement the same script to return me all possible sentences for a particular context.

    ex : If my data set contains a list of places i visited.

    I have visited India , I have visited USA,I have visited Germany ..

    The above script returns me the first possible match . how do i make the script return all the places ?

    is it possible ?

  9. Husam April 20, 2018 at 2:25 pm #


    I appreciate if you can share ideas about how I can improve the model or the parameters to predict words form larger text, say a novel. Is adding another LSTM layer or more will be good idea? or is it enough to increase the size of LSTM?

    Thank you again for all your posts, very helpful

  10. Yasir Hussain May 20, 2018 at 2:54 pm #

    How can we calculate cross_entropy and perplexity?

    • Jason Brownlee May 21, 2018 at 6:26 am #

      Keras can calculate cross entropy.

      Sorry, I do not have an example of calculating perplexity.

  11. Baron May 21, 2018 at 7:56 am #

    Hi Mr. Jason how can I calculate the perplexity measure in this algorithm?.

    • Jason Brownlee May 21, 2018 at 2:29 pm #

      Sorry, I don’t have an example of calculating perplexity.

  12. Talat May 24, 2018 at 5:55 pm #

    Hi, i tried to save my model as:

    # serialize model to JSON
    model_json = model.to_json()
    with open(“new_model_OneinOneOut.json”, “w”) as json_file:
    # serialize weights to HDF5
    print(“Saved model to disk”)

    But i couldnt load it and use it. How can i do that? Am i saving it right?

    • Jason Brownlee May 25, 2018 at 9:20 am #

      You must load the json and h5.

      What problem did you have exactly?

  13. Raghav May 31, 2018 at 9:07 pm #


    You seem to use one hot vector for the output vectors. This would be a huge problem in case of a very large vocabulary size. What do you suggest we should do instead?

    • Jason Brownlee June 1, 2018 at 8:20 am #

      Not as big a problem as you would think, it does scale to 10K and 100K vocabs fine.

      You can use search methods on the resulting probability vectors to get multiple different output sequences.

      You can also use hierarchical versions of softmax to improve efficiency.

  14. Jamil June 26, 2018 at 12:25 am #

    Hi Jason,

    Thanks for the great post. I have two questions. The corpus I’m working with has sentences of varying lengths, some 3 words long and others 30 words long. I want to train a sentence based language model, i.e. training data should not try to combine two or more sentences from the corpus.

    I’m slightly confused as to how to set up the training data. At the moment I have pre-padded with 0’s the shorter sentences so as to to match the size of the longest sentence. Example:

    sentence : I like fish – this sentence would be split up as follows:

    0 0 0 —-> I
    00 I —-> like
    0 I like —->fish
    I like fish —->

    This approach gives me roughly 110,000 training points, yet with an architecture an LSTM with 100 nodes my accuracy converges to 50%. Do you think I’ve incorrectly set up my data?

    A second point is could you advise us how to combine pretrained word embeddings with an LSTM language model in keras.


  15. Hoang Cuong August 10, 2018 at 11:42 am #


    I was wondering why we need to use:

    print(generate_seq(model, tokenizer, **max_length-1**, ‘Jack and’, 5))

    instead of

    print(generate_seq(model, tokenizer, **max_length**, ‘Jack and’, 5))

    at test time. Without doing minus 1 it does not work indeed. Why is it the case?

    Many thanks!

    • Jason Brownlee August 10, 2018 at 2:18 pm #

      As explained in the post:

      The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

  16. Nikolas August 30, 2018 at 6:59 pm #

    Hi Jason

    Is it possible to use these models for punctuation or article prediction (LSTM neural network, where the y(punctuation/article/something else) depend on specific number of previous/next words? What is your advise about this task?

    Thank you!

    • Jason Brownlee August 31, 2018 at 8:09 am #

      Sure, an LSTM would be a great approach.

      • Nikolas August 31, 2018 at 6:01 pm #

        Do you make X_test X_train split for tasks like this? If there will be a words in the new text (X_test here) which are not tokenized in keras for X_train, how to deal with this (applying a trained model for text with new words)?

        • Jason Brownlee September 1, 2018 at 6:16 am #

          Yes. You need to ensure that your training dataset is representative of the problem, as in all ml problems.

      • Kowsher January 2, 2021 at 7:57 am #

        Hello this is nice,
        For a particular task i’m facing one problems.
        For example i have one data in my training set

        ‘Hey jack are you going to College ‘

        Now i have a sequence of text
        ‘Hey jack are you…’
        I have 2 options
        1. going
        2. coming

        I have to find the probability of the next word going and coming. Obviously the probability of going is 1 and coming is zero
        How can i check the next word probability from my options

        • Jason Brownlee January 2, 2021 at 12:05 pm #

          I would recommend a word-based language model that gives the probability for each word in the vocab for the next word in the sequence.

  17. bhb October 4, 2018 at 12:29 am #

    Dear Dr. Jason, I have been followed your tutorial, and it is so interesting.
    now, I have the following questions on the topic of OCR.
    1. could you give me a simple example how to implement CNN + LSTM +CTC for scanned text image recognition( e.g if the scanned image is ” near the door” and the equivalent text is ‘near the door’ the how to give the image and the text for training?) please?

  18. Ishay November 6, 2018 at 6:50 am #

    Hi Jason,

    I have two questions:
    1. I have a project of next-word prediction, and I want to use your examples as the basis for it.
    My data includes multiple documents. One approach I thought of is to concatenate all documents to one list of tokens (with beginning-of-sentence token), and then cut slices in fixed size as an input for the model. Second aproach is to work on each sentence separately using padding. Which approach would work better?

    2. If I want to use this language model for other purposes later on, how does it work? Do I use it like pre-trained embedding (like word2vec for instance)? Do you have an example for it? How does the input look like? (for example in pre-trained embedding the input is a vector for each word)

    Thank you

    • Jason Brownlee November 6, 2018 at 2:16 pm #

      Perhaps try both approaches and see what works best for your data and model.

      Yes, you could save the model weights and load them later and use them as part of an input or output language model.

  19. Ishay November 7, 2018 at 1:38 am #

    Hi Jason,

    Why are we converting the y to one-hot-encoding (to_categorical)? Is it a must? Why don’t we just leave it as an integer? I have a big vocabulary and it gives me a memry error..

    And also – why do we add ‘+1’ to the length of the word_index when creating the vocab_size?

    Thanks a lot. The post is really helpful

    • Jason Brownlee November 7, 2018 at 6:08 am #

      So we can predict the probability of each word and chose the next word as the word with the highest probability.

      It is not required, you could predict integers for words, but one hot encoding often works better.

      I add +1 to make room for “0” which is “I don’t know” or “unknown”.

  20. John December 12, 2018 at 5:10 pm #

    What exactly is this for:

    for word, index in tokenizer.word_index.items():
    if index == yhat:
    out_word = word

    Are you looping over the dictionary here every time you made a prediction, to look up the word corresponding to the index? Why not just reverse the dictionary once and look up the value??

    • Jason Brownlee December 13, 2018 at 7:43 am #

      Yes. Yes, I’m sure there are more efficient ways to write this, perhaps you could share some?

      • Barry DeCicco February 6, 2020 at 5:52 am #

        Jason, I’ve been following an article at:,
        by Ashu Prasad. At one point, he does this (search for ‘We reverse the dictionary containing the encoded words using a helper function which facilitates us to plot the embeddings.’).

        [I can’t print the code because it’s an image. ]

        It relies on pulling the weights from the model; I’ve tried to duplicate it, but have failed.

        If somebody can get it working, it’s probably what people are looking for here.

        If you do, please let me know: [email protected]

        • Jason Brownlee February 6, 2020 at 8:33 am #

          Perhaps contact the authors of the article directly?

  21. Ali January 7, 2019 at 5:17 am #

    Hi Jason,

    Thank you for the great article. I have 2 questions:

    1- If I have the model trained and after that I need to add new words to is, what is the best way to do that without retrain from the beginning?

    2- if I have trained the model with a wrong sentence. For example I used ‘Hi Jason, hooo are you?’ but the correct is ‘Hi Jason, how are you?’ and I wants to fix that without retrain from the beginning. what is the best way to do that kind of reinforcement learning?

    • Jason Brownlee January 7, 2019 at 6:41 am #

      The easiest way: mark the new words as “unknown”.

      Another approach is to use the model weights as a starting point and re-train the model with a small learning rate and new/updated data.

  22. Suraj Chandrasekhar January 10, 2019 at 10:47 am #

    Hello Jason,

    This was a very well done article thank you.

    1. I was wondering, is their a way to generate text using an RNN/LSTM model without giving in the 1st input word like you have in the generate_seq method, similar to the markovify package, specificially the make_sentence()/make_short_sentence(char_length) functions.

    2. Also, would using word embeddings such as Word2Vec or GloVe embeddings allow us to use words not in the training corpus?

    • Jason Brownlee January 11, 2019 at 7:37 am #

      Yes, you can frame the problem any way you wish, e.g. feed one word and get a sentence or paragraph.

      The model can be only be trained on words in the training corpus. New works are marked “unknown”.

  23. kokimoshida March 7, 2019 at 2:29 am #

    i wanna build a article recommendation system based on article titles and abstract, how can i use language modeling to measure the similarity between a user profile and the articles,
    Thank you

    • Jason Brownlee March 7, 2019 at 6:56 am #

      I don’t have an example of this. Perhaps the sum of the difference between the word vectors?

  24. Kirill March 16, 2019 at 12:18 am #

    Jason, very good post! I’m making the same model to predict future words in a text, but faced with the problem of validation loss increasing. I split my data into train and test and while train loss increasing, validation loss is increasing. So, I think it means overfit. Even in your example if we add validation_split param into fit method we will see that validation loss is increasing too. I think it’s not ok. What is your opinion ?

  25. Alex March 30, 2019 at 8:43 pm #

    Hello, Jason!

    Thank you for such a detailed article. I have two questions:

    1. What effect will the change COUNT of LSTM(units=COUNT) have for this neural network of word prediction?

    2. Do I understand correctly that if I delete sequences with the same inputs and output, making a list with a unique set of sequences, it will reduce the number of patterns to be learned and will not affect the final result? (optimization of training time)

  26. hardik April 5, 2019 at 2:03 am #

    how can i extract car vin number from the image of vin having other information too

    • Jason Brownlee April 5, 2019 at 6:19 am #

      Perhaps use classic computer vision techniques to isolate the text, then extract the text.

      I think opencv in python might be a good place to start.

  27. Sravan Malla April 5, 2019 at 5:43 pm #

    Instead of one prediction, how can I make it to have couple of predictions and allow user to pick one among them

    • Jason Brownlee April 6, 2019 at 6:43 am #

      Some ideas:

      Perhaps you can sample the output probabilities in order to generate a few different outputs.
      Perhaps you can try running the model a few times to get different outputs?
      Perhaps you can train and use a few parallel models to get different outputs?

  28. Mars Wayne August 21, 2019 at 7:28 pm #

    If I want to predict the first 3 most probable word after inputting two words, how do i make change in the code?. This model generates the next word and and considers the whole string for the next word prediction. Currently I’m working on making a keyboard out of this.

    For example:
    If I input “I read”,the model should generate like “it”, “book” and “your”.

    • Jason Brownlee August 22, 2019 at 6:27 am #

      You could look at the probabilities for the next word, and select those 3 words with the highest probability.

  29. Pijush Biswas October 16, 2019 at 1:15 am #

    Hi, it is really a good article, I have gone through each examples and started liking it.
    Would you please provide a syntax for ‘previous word’ sequence which can be trained ? Most of the examples I get on web is next word predictor. My requirement is to have previous word, you mentioned already to use LSTM, but would be help if you can provide a X , y sequence

    • Jason Brownlee October 16, 2019 at 8:07 am #

      I don’t understand, sorry. Can you elaborate?

      • Pijush Biswas October 18, 2019 at 11:17 pm #

        Hi , I was looking for model 2:

        X, y
        _, _, _, _, _, Jack, and
        _, _, _, _, Jack, and Jill
        _, _, _, Jack, and, Jill, went
        _, _, Jack, and, Jill, went, up
        _, Jack, and, Jill, went, up, the
        Jack, and, Jill, went, up, the, hill
        sequences = list()
        for line in data.split(‘\n’):
        encoded = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(encoded)):
        sequence = encoded[:i+1]

        What doubt I have here is, how can I write these to predict “previous” word. y becomes
        the previous word:

        Will it work ? or how should it work ?

        X, y
        Jack,and, Jill, went, up, the, hill newline
        and, Jill, went, up, the, hill, _ Jack
        Jill, went, up, the, hill, _, _ and
        went, up, the, _ , _, _ Jill
        up,the,_, _ , _, _ went
        the,_, _ , _, _,_ up

        • Jason Brownlee October 19, 2019 at 6:39 am #

          No need to predict the previous word as it is already available.

          If you want to learn how to predict a prior word given no other information, you can simply reverse the order of the input sequences when training.

          • Pijush Biswas October 25, 2019 at 4:48 am #

            Thanks Jason for help. Thinking that that would help. Would you like to see what is my exact question ?

            The problem I am trying to solve is:

            I have line:

            Line1: Jack and Jill went up the hill
            Line2: To fetch a pail of water
            Line3: Jack fell down and broke his crown
            Line4: And Jill came tumbling after

            Now I want to rewrite line4, with a rhyming work “water”. In my case, “mother” will be right word.
            Line4 : And _, _ I love my mother

            Or I want to change the word “tumbling”, what is the best fit at that position
            Line4: And Jill came “_” after

            If I have to achieve that, I can reverse the line and train the model. And then I have to keep another model for next word prediction.

            I want to understand, if do we have any inbuilt features in any layer/technique for both next/prior word predictor. I have not fully understood the LSTM, I just thought LSTM can take care of remembering of previous word ?

          • Jason Brownlee October 25, 2019 at 6:52 am #


            There are many ways to frame the problem.

            A simple/naive way – that might work – would be to input the text as is and the output of the model is to predict the missing word or words directly.

            Try that as a first step. Use a special token to represent missing words.

  30. Yonatan October 27, 2019 at 3:11 am #

    What is the vocabulary size if we use tokenizer with num words?

    If I use the Tokenizer with num_words:
    tokenizer = Tokenizer(num_words=num_words, lower=True)

    Now we have this line:
    y = to_categorical(y, num_classes=vocab_size)

    Should I call it with:
    y = to_categorical(y, num_classes=num_words)

    That’s because the actual words number should be smaller.

    I have a vocabulary size of ~ 800K words and the pad_sequences always gets MemoryError. That’s why i’m asking.

    • Jason Brownlee October 27, 2019 at 5:47 am #

      You might gave the terms around the wrong way?

      The vocab size will be much smaller than the number of words, as the number of words includes duplicates.

  31. Efstathios Chatzikyriakidis May 14, 2020 at 6:15 am #

    It is overkill to use LSTM in One-Word-In, One-Word-Out framing since no sequence is used (the length is 1).

    We can use just a Flatten layer after Embedding and connect it to a Dense layer.

    • Jason Brownlee May 14, 2020 at 1:23 pm #

      Sure. Think of the example as a starting point for your own projects.

  32. Efstathios Chatzikyriakidis May 14, 2020 at 6:48 am #

    One-Word-In -> One-Word-Out implementation creates also the following 2-grams:



    Jack and Jill went up the hill
    To fetch a pail of water

    Which is incorrect.

    We need to create 2-grams per line.

    Also, if the text is a paragraph we need to segment the paragraph in sentences and then do the 2-grams extraction for the dataset.

  33. Siddharth July 11, 2020 at 11:40 pm #

    Hi Jason,

    Your write-up is pretty clean and understandable. I followed this article and created the next word/sequence prediction model. I am facing an issue w.r.t outputs inferred via model.

    Example, if I feed to the model – “Where can I buy”, I get outputs – “Where can I buy a bicycle” & “Where can I buy spare parts for my bicycle”. These 2 are perfect.
    I also get a couple of grammatically incorrect outputs – “Where can I buy of bicycle”, “Where can I buy went to bicycle”.
    Do you have any ideas on how to filter out the grammatically incorrect outputs so that we are left with only good sentences in output? Thanks for your help.

    FYI – Training Data Creation –
    The approach I followed is trigrams in the input. For example, For sentence, “I am reading this article”, I used below data for training.

    (I, am, reading) > (this)
    (am, reading, this) > (article)

    • Jason Brownlee July 12, 2020 at 5:52 am #


      Not really, other than train a better model that makes fewer errors.

  34. seraj June 6, 2021 at 5:17 am #

    Hi Jason
    Thanks for informative tutorial.

    I was wondering about this :
    model.add(Embedding(vocab_size, 10, input_length=1))
    should the single hidden LSTM layer with – 50 units – is equal the length of Embedding layer , I mean sequence input_length?

    • Jason Brownlee June 6, 2021 at 5:52 am #

      The size of the embedding and the number of units in the LSTM layer are not related.

  35. Karina August 3, 2021 at 10:16 am #

    Hi, how are u?, sorry I have a query, I am using your example to predict the next word with a corpus of data (“sentences”), which I concatenate to form a single text and perform the procedure, however my network is not training, the accuracy it starts at “0.04” and the epochs are almost the same, I have checked everything and even the word processing is fine …. I don’t know what to do

  36. Mona August 17, 2021 at 10:44 am #

    How can i use the presented language model to correct the speech recognizer results?

    • Adrian Tam August 17, 2021 at 11:51 am #

      Speech recognizer is a different topic but you may consider the recognizer is not recognizing one word but multiple words in different probabilities. Normally you take the single one with highest probability as the output, but with the language model, you can base on the highest probability in the sequence as the output, with the words before the current one taken into consideration.

Leave a Reply