How to Develop Word-Based Neural Language Models in Python with Keras

Language modeling involves predicting the next word in a sequence given the sequence of words already present.

A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.

In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.

After completing this tutorial, you will know:

  • The challenge of developing a good framing of a word-based language model for a given application.
  • How to develop one-word, two-word, and line-based framings for word-based language models.
  • How to generate sequences using a fit language model.

Let’s get started.

How to Develop Word-Based Neural Language Models in Python with Keras

How to Develop Word-Based Neural Language Models in Python with Keras
Photo by Stephanie Chapman, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

  1. Framing Language Modeling
  2. Jack and Jill Nursery Rhyme
  3. Model 1: One-Word-In, One-Word-Out Sequences
  4. Model 2: Line-by-Line Sequence
  5. Model 3: Two-Words-In, One-Word-Out Sequence

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.

Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.

Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence

Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words.

There are many ways to frame the sequences from a source text for language modeling.

In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library.

There is no single best approach, just different framings that may suit different applications.

Jack and Jill Nursery Rhyme

Jack and Jill is a simple nursery rhyme.

It is comprised of 4 lines, as follows:

Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after

We will use this as our source text for exploring different framings of a word-based language model.

We can define this text in Python as follows:

Model 1: One-Word-In, One-Word-Out Sequences

We can start with a very simple model.

Given one word as input, the model will learn to predict the next word in the sequence.

For example:

The first step is to encode the text as integers.

Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.

The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word_index attribute.

Running this example, we can see that the size of the vocabulary is 21 words.

We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

Running this piece shows that we have a total of 24 input-output pairs to train the network.

We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.

Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

We are now ready to define the neural network model.

The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.

The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

The structure of the network can be summarized as follows:

We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.

Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.

The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

This process could then be repeated a few times to build up a generated sequence of words.

To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.

We can time all of this together. The complete code listing is provided below.

Running the example prints the loss and accuracy each training epoch.

We can see that the model does not memorize the source sequences, likely because there is some ambiguity in the input sequences, for example:

And so on.

At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.

We get a reasonable sequence as output that has some elements of the source.

This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.

Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.

For example:

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.

In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.

Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.

First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

Next, we can pad the prepared sequences. We can do this using the pad_sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

Next, we can split the sequences into input and output elements, much like before.

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

We can use the model to generate new sequences as before. The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.

Tying all of this together, the complete code example is provided below.

Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.

There are still two lines of text that start with ‘Jack‘ that may still be a problem for the network.

At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.

The first generated line looks good, directly matching the source text. The second is a bit strange. This makes sense, because the network only ever saw ‘Jill‘ within an input sequence, not at the beginning of the sequence, so it has forced an output to use the word ‘Jill‘, i.e. the last line of the rhyme.

This was a good example of how the framing may result in better new lines, but not good partial lines of input.

Model 3: Two-Words-In, One-Word-Out Sequence

We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.

This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.

We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:

The complete example is listed below

Running the example again gets a good fit on the source text at around 95% accuracy.

We look at 4 generation examples, two start of line cases and two starting mid line.

The first start of line case generated correctly, but the second did not. The second case was an example from the 4th line, which is ambiguous with content from the first line. Perhaps a further expansion to 3 input words would be better.

The two mid-line generation examples were generated correctly, matching the source text.

We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. That careful design is required when using language models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Whole Rhyme as Sequence. Consider updating one of the above examples to build up the entire rhyme as an input sequence. The model should be able to generate the entire thing given the seed of the first word, demonstrate this.
  • Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding instead of learning the embedding as part of the model. This would not be required on such a small source text, but could be good practice.
  • Character Models. Explore the use of a character-based language model for the source text instead of the word-based approach demonstrated in this tutorial.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered how to develop different word-based language models for a simple nursery rhyme.

Specifically, you learned:

  • The challenge of developing a good framing of a word-based language model for a given application.
  • How to develop one-word, two-word, and line-based framings for word-based language models.
  • How to generate sequences using a fit language model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


43 Responses to How to Develop Word-Based Neural Language Models in Python with Keras

  1. Aaron November 3, 2017 at 5:46 am #

    Hi Jason – Thanks for this. How can a language model be used to “score” different text sentences. Suppose there is a speech recognition engine that outputs real words but they don’t make sense when combined together as a sentence. Could we use a language model to “score” each sentence to see which is more likely to occur? Thanks!

    • Jason Brownlee November 3, 2017 at 2:14 pm #

      Great question.

      Rather than score, the language model can take the raw input and predict the expected sequence or sequences and these outcomes can then be explored using a beam search.

      • Aaron November 4, 2017 at 1:02 am #

        Thanks, I’d love to see an example of this as an appendix to this post. By the way – I really enjoy your blog, can’t thank you enough for these examples.

        • Jason Brownlee November 4, 2017 at 5:32 am #

          Thanks. I have a post on beam search scheduled.

  2. had November 3, 2017 at 9:38 am #

    Whats mean the second argument in embedding?
    Did I understand correctly that each word is encoded as a number from 0 to 10?

    I created a network for predicting the words with a large number of words, the loss decreases too slowly, so I think I did something wrong.

    Maybe it should be, I don’t know (in by char generation it was a lot faster), I would be grateful for advice.

    https://pastebin.com/PPWiuMXf

    • Jason Brownlee November 3, 2017 at 2:16 pm #

      The second argument is the dimensionality of the embedding, the number of dimensions for the encoded vector representation of each word.

      Common values are 50, 100, 500 or 1000.

  3. Nadeem Pasha November 3, 2017 at 3:57 pm #

    How to do with base means how to extract transcriptions from the timit database in python.

    • Jason Brownlee November 4, 2017 at 5:25 am #

      Sorry, I don’t have examples of working with the TIMIT dataset.

  4. Anubhab Majumdar November 9, 2017 at 2:17 am #

    Thanks for the amazing post. A novice query – I have a large dataset of books and I want to train a LSTM on that. However, I am getting memoryerror when I try to use the entire dataset for training at once. Is there a way to break up the data and train the model using the parts? Or do I have to throw hardware at the problem?

    • Jason Brownlee November 9, 2017 at 10:03 am #

      You can use progressive loading in Keras to only load or yield one batch of data at a time.

      I have a post scheduled on this, but until then, read up on Keras data generators.

  5. Christoph Aurnhammer November 23, 2017 at 5:39 am #

    Dear Jason,
    Thank you very much for this post. I am trying to use your “Model 2: Line-by-Line Sequence” and scale it up to create an RNN language model. I have two questions about the way the data is represented:

    1. Is there a more efficient way to train an Embedding+RNN language model than splitting up a single sentence into several instances, with a single word added at each step?

    2. In this representation we need to feed part of the same sequence into the model over and over again. By presenting the words at the beginning of the sentence more often (as X), do we bias the model towards knowing sentence-initial-parts better than words occurring more frequently at the end of sentences?

    Kind regards and thank you,
    Christoph

    • Jason Brownlee November 23, 2017 at 10:41 am #

      I’d encourage you to explore alternate framings and see how they compare. There is no one true way.

      It may bias the model, perhaps you could test this.

  6. Carl S January 26, 2018 at 9:30 am #

    Hi Jason, what if you have multiple sentences to train in batches? In that case, your input would be 3 dimensional and the fit would return an error because the embedding layer only accepts 2 dimensions.
    Is there an efficient way to deal with it other than send the training set in batches with 1 sentence at a time?

    I could of course act as if all words were part of 1 sentence but how would the LSTM detect the end of a sentence?

    Thank you!

    • Jason Brownlee January 27, 2018 at 5:49 am #

      You could provide each sentence as a sample, group samples into a batch and the LSTM will reset states at the end of each batch.

      • Carl S January 30, 2018 at 9:00 am #

        Thank you for your reply Jason! I understand that the LSTM will rest states at the end of the batch, but shouldn’t we make it reset states after each sentence/ sample in each batch?

        • Jason Brownlee January 30, 2018 at 10:01 am #

          Perhaps. Try it and see if it lifts model skill.

          I find it has much less effect that one would expect.

          • Carl S January 31, 2018 at 5:12 am #

            I am not able to do it as there will be a dimensionality issue preventing the Keras Embedding layer from giving correct output. If you have a workaround I would love to see your code.

  7. Onjule March 19, 2018 at 5:18 am #

    Amazing post! But I was working on something which requires an rnn language model built without libraries. Can the Keras functionalities used in the code here be replaced with self-written code, and has someone already done this? Is there any Github repository for the same?

    • Jason Brownlee March 19, 2018 at 6:08 am #

      It would require a lot of work, re-implementing systems that already are fast and reliable. Sounds like a bad idea.

      What is your motivation exactly?

      • Anjali Bhavan March 25, 2018 at 8:12 pm #

        Never mind, sir, I myself realized how bad a idea that is. Thank you for this amazing article tho!

  8. Dilip March 25, 2018 at 5:37 pm #

    How do i implement the same script to return me all possible sentences for a particular context.

    ex : If my data set contains a list of places i visited.

    I have visited India , I have visited USA,I have visited Germany ..

    The above script returns me the first possible match . how do i make the script return all the places ?

    is it possible ?

  9. Husam April 20, 2018 at 2:25 pm #

    Awesome!!!

    I appreciate if you can share ideas about how I can improve the model or the parameters to predict words form larger text, say a novel. Is adding another LSTM layer or more will be good idea? or is it enough to increase the size of LSTM?

    Thank you again for all your posts, very helpful

  10. Yasir Hussain May 20, 2018 at 2:54 pm #

    How can we calculate cross_entropy and perplexity?

    • Jason Brownlee May 21, 2018 at 6:26 am #

      Keras can calculate cross entropy.

      Sorry, I do not have an example of calculating perplexity.

  11. Baron May 21, 2018 at 7:56 am #

    Hi Mr. Jason how can I calculate the perplexity measure in this algorithm?.

    • Jason Brownlee May 21, 2018 at 2:29 pm #

      Sorry, I don’t have an example of calculating perplexity.

  12. Talat May 24, 2018 at 5:55 pm #

    Hi, i tried to save my model as:

    # serialize model to JSON
    model_json = model.to_json()
    with open(“new_model_OneinOneOut.json”, “w”) as json_file:
    json_file.write(model_json)
    # serialize weights to HDF5
    model.save_weights(“weights_OneinOneOut.h5”)
    print(“Saved model to disk”)

    But i couldnt load it and use it. How can i do that? Am i saving it right?

    • Jason Brownlee May 25, 2018 at 9:20 am #

      You must load the json and h5.

      What problem did you have exactly?

  13. Raghav May 31, 2018 at 9:07 pm #

    Hi,

    You seem to use one hot vector for the output vectors. This would be a huge problem in case of a very large vocabulary size. What do you suggest we should do instead?

    • Jason Brownlee June 1, 2018 at 8:20 am #

      Not as big a problem as you would think, it does scale to 10K and 100K vocabs fine.

      You can use search methods on the resulting probability vectors to get multiple different output sequences.

      You can also use hierarchical versions of softmax to improve efficiency.

  14. Jamil June 26, 2018 at 12:25 am #

    Hi Jason,

    Thanks for the great post. I have two questions. The corpus I’m working with has sentences of varying lengths, some 3 words long and others 30 words long. I want to train a sentence based language model, i.e. training data should not try to combine two or more sentences from the corpus.

    I’m slightly confused as to how to set up the training data. At the moment I have pre-padded with 0’s the shorter sentences so as to to match the size of the longest sentence. Example:

    sentence : I like fish – this sentence would be split up as follows:

    0 0 0 —-> I
    00 I —-> like
    0 I like —->fish
    I like fish —->

    This approach gives me roughly 110,000 training points, yet with an architecture an LSTM with 100 nodes my accuracy converges to 50%. Do you think I’ve incorrectly set up my data?

    A second point is could you advise us how to combine pretrained word embeddings with an LSTM language model in keras.

    Thanks

  15. Hoang Cuong August 10, 2018 at 11:42 am #

    Hi,

    I was wondering why we need to use:

    print(generate_seq(model, tokenizer, **max_length-1**, ‘Jack and’, 5))

    instead of

    print(generate_seq(model, tokenizer, **max_length**, ‘Jack and’, 5))

    at test time. Without doing minus 1 it does not work indeed. Why is it the case?

    Many thanks!

    • Jason Brownlee August 10, 2018 at 2:18 pm #

      As explained in the post:

      The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

  16. Nikolas August 30, 2018 at 6:59 pm #

    Hi Jason

    Is it possible to use these models for punctuation or article prediction (LSTM neural network, where the y(punctuation/article/something else) depend on specific number of previous/next words? What is your advise about this task?

    Thank you!

    • Jason Brownlee August 31, 2018 at 8:09 am #

      Sure, an LSTM would be a great approach.

      • Nikolas August 31, 2018 at 6:01 pm #

        Do you make X_test X_train split for tasks like this? If there will be a words in the new text (X_test here) which are not tokenized in keras for X_train, how to deal with this (applying a trained model for text with new words)?

        • Jason Brownlee September 1, 2018 at 6:16 am #

          Yes. You need to ensure that your training dataset is representative of the problem, as in all ml problems.

  17. bhb October 4, 2018 at 12:29 am #

    Dear Dr. Jason, I have been followed your tutorial, and it is so interesting.
    now, I have the following questions on the topic of OCR.
    1. could you give me a simple example how to implement CNN + LSTM +CTC for scanned text image recognition( e.g if the scanned image is ” near the door” and the equivalent text is ‘near the door’ the how to give the image and the text for training?) please?

Leave a Reply