Last Updated on September 3, 2020
Language modeling involves predicting the next word in a sequence given the sequence of words already present.
A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.
In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.
After completing this tutorial, you will know:
- The challenge of developing a good framing of a word-based language model for a given application.
- How to develop one-word, two-word, and line-based framings for word-based language models.
- How to generate sequences using a fit language model.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.

How to Develop Word-Based Neural Language Models in Python with Keras
Photo by Stephanie Chapman, some rights reserved.
Tutorial Overview
This tutorial is divided into 5 parts; they are:
- Framing Language Modeling
- Jack and Jill Nursery Rhyme
- Model 1: One-Word-In, One-Word-Out Sequences
- Model 2: Line-by-Line Sequence
- Model 3: Two-Words-In, One-Word-Out Sequence
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Framing Language Modeling
A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.
Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.
Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.
Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence
Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words.
There are many ways to frame the sequences from a source text for language modeling.
In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library.
There is no single best approach, just different framings that may suit different applications.
Jack and Jill Nursery Rhyme
Jack and Jill is a simple nursery rhyme.
It is comprised of 4 lines, as follows:
Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after
We will use this as our source text for exploring different framings of a word-based language model.
We can define this text in Python as follows:
1 2 3 4 5 |
# source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ |
Model 1: One-Word-In, One-Word-Out Sequences
We can start with a very simple model.
Given one word as input, the model will learn to predict the next word in the sequence.
For example:
1 2 3 4 5 |
X, y Jack, and and, Jill Jill, went ... |
The first step is to encode the text as integers.
Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.
Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.
1 2 3 4 |
# integer encode text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] |
We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.
The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word_index attribute.
1 2 3 |
# determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) |
Running this example, we can see that the size of the vocabulary is 21 words.
We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.
Next, we need to create sequences of words to fit the model with one word as input and one word as output.
1 2 3 4 5 6 |
# create word -> word sequences sequences = list() for i in range(1, len(encoded)): sequence = encoded[i-1:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) |
Running this piece shows that we have a total of 24 input-output pairs to train the network.
1 |
Total Sequences: 24 |
We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.
1 2 3 |
# split into X and y elements sequences = array(sequences) X, y = sequences[:,0],sequences[:,1] |
We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.
Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.
1 2 |
# one hot encode outputs y = to_categorical(y, num_classes=vocab_size) |
We are now ready to define the neural network model.
The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.
The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.
1 2 3 4 5 6 |
# define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) |
The structure of the network can be summarized as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 1, 10) 220 _________________________________________________________________ lstm_1 (LSTM) (None, 50) 12200 _________________________________________________________________ dense_1 (Dense) (None, 22) 1122 ================================================================= Total params: 13,542 Trainable params: 13,542 Non-trainable params: 0 _________________________________________________________________ |
We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.
Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.
The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.
1 2 3 4 |
# compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) |
After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.
1 2 3 4 5 6 7 8 9 |
# evaluate in_text = 'Jack' print(in_text) encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) yhat = model.predict_classes(encoded, verbose=0) for word, index in tokenizer.word_index.items(): if index == yhat: print(word) |
This process could then be repeated a few times to build up a generated sequence of words.
To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# generate a sequence from the model def generate_seq(model, tokenizer, seed_text, n_words): in_text, result = seed_text, seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) # predict a word in the vocabulary yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text, result = out_word, result + ' ' + out_word return result |
We can time all of this together. The complete code listing is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from the model def generate_seq(model, tokenizer, seed_text, n_words): in_text, result = seed_text, seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] encoded = array(encoded) # predict a word in the vocabulary yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text, result = out_word, result + ' ' + out_word return result # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # integer encode text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] # determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # create word -> word sequences sequences = list() for i in range(1, len(encoded)): sequence = encoded[i-1:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # split into X and y elements sequences = array(sequences) X, y = sequences[:,0],sequences[:,1] # one hot encode outputs y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate print(generate_seq(model, tokenizer, 'Jack', 6)) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example prints the loss and accuracy each training epoch.
1 2 3 4 5 6 7 8 9 10 11 |
... Epoch 496/500 0s - loss: 0.2358 - acc: 0.8750 Epoch 497/500 0s - loss: 0.2355 - acc: 0.8750 Epoch 498/500 0s - loss: 0.2352 - acc: 0.8750 Epoch 499/500 0s - loss: 0.2349 - acc: 0.8750 Epoch 500/500 0s - loss: 0.2346 - acc: 0.8750 |
We can see that the model does not memorize the source sequences, likely because there is some ambiguity in the input sequences, for example:
1 2 |
jack => and jack => fell |
And so on.
At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.
We get a reasonable sequence as output that has some elements of the source.
1 |
Jack and jill came tumbling after down |
This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.
Model 2: Line-by-Line Sequence
Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.
For example:
1 2 3 4 5 6 7 |
X, y _, _, _, _, _, Jack, and _, _, _, _, Jack, and Jill _, _, _, Jack, and, Jill, went _, _, Jack, and, Jill, went, up _, Jack, and, Jill, went, up, the Jack, and, Jill, went, up, the, hill |
This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.
In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.
Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.
First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.
1 2 3 4 5 6 7 8 |
# create line-based sequences sequences = list() for line in data.split('\n'): encoded = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(encoded)): sequence = encoded[:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) |
Next, we can pad the prepared sequences. We can do this using the pad_sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.
1 2 3 4 |
# pad input sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) |
Next, we can split the sequences into input and output elements, much like before.
1 2 3 4 |
# split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) |
The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.
1 2 3 4 5 6 7 8 9 10 |
# define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) |
We can use the model to generate new sequences as before. The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text |
Tying all of this together, the complete code example is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # prepare the tokenizer on the source text tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) # determine the vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # create line-based sequences sequences = list() for line in data.split('\n'): encoded = tokenizer.texts_to_sequences([line])[0] for i in range(1, len(encoded)): sequence = encoded[:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # pad input sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) # split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate model print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4)) print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4)) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.
There are still two lines of text that start with ‘Jack‘ that may still be a problem for the network.
1 2 3 4 5 6 7 8 9 10 11 |
... Epoch 496/500 0s - loss: 0.1039 - acc: 0.9524 Epoch 497/500 0s - loss: 0.1037 - acc: 0.9524 Epoch 498/500 0s - loss: 0.1035 - acc: 0.9524 Epoch 499/500 0s - loss: 0.1033 - acc: 0.9524 Epoch 500/500 0s - loss: 0.1032 - acc: 0.9524 |
At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.
The first generated line looks good, directly matching the source text. The second is a bit strange. This makes sense, because the network only ever saw ‘Jill‘ within an input sequence, not at the beginning of the sequence, so it has forced an output to use the word ‘Jill‘, i.e. the last line of the rhyme.
1 2 |
Jack fell down and broke Jill jill came tumbling after |
This was a good example of how the framing may result in better new lines, but not good partial lines of input.
Model 3: Two-Words-In, One-Word-Out Sequence
We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.
This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.
We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:
1 2 3 4 5 |
# encode 2 words -> 1 word sequences = list() for i in range(2, len(encoded)): sequence = encoded[i-2:i+1] sequences.append(sequence) |
The complete example is listed below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 |
from numpy import array from keras.preprocessing.text import Tokenizer from keras.utils import to_categorical from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from keras.layers import Embedding # generate a sequence from a language model def generate_seq(model, tokenizer, max_length, seed_text, n_words): in_text = seed_text # generate a fixed number of words for _ in range(n_words): # encode the text as integer encoded = tokenizer.texts_to_sequences([in_text])[0] # pre-pad sequences to a fixed length encoded = pad_sequences([encoded], maxlen=max_length, padding='pre') # predict probabilities for each word yhat = model.predict_classes(encoded, verbose=0) # map predicted word index to word out_word = '' for word, index in tokenizer.word_index.items(): if index == yhat: out_word = word break # append to input in_text += ' ' + out_word return in_text # source text data = """ Jack and Jill went up the hill\n To fetch a pail of water\n Jack fell down and broke his crown\n And Jill came tumbling after\n """ # integer encode sequences of words tokenizer = Tokenizer() tokenizer.fit_on_texts([data]) encoded = tokenizer.texts_to_sequences([data])[0] # retrieve vocabulary size vocab_size = len(tokenizer.word_index) + 1 print('Vocabulary Size: %d' % vocab_size) # encode 2 words -> 1 word sequences = list() for i in range(2, len(encoded)): sequence = encoded[i-2:i+1] sequences.append(sequence) print('Total Sequences: %d' % len(sequences)) # pad sequences max_length = max([len(seq) for seq in sequences]) sequences = pad_sequences(sequences, maxlen=max_length, padding='pre') print('Max Sequence Length: %d' % max_length) # split into input and output elements sequences = array(sequences) X, y = sequences[:,:-1],sequences[:,-1] y = to_categorical(y, num_classes=vocab_size) # define model model = Sequential() model.add(Embedding(vocab_size, 10, input_length=max_length-1)) model.add(LSTM(50)) model.add(Dense(vocab_size, activation='softmax')) print(model.summary()) # compile network model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # fit network model.fit(X, y, epochs=500, verbose=2) # evaluate model print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5)) print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3)) print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5)) print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5)) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example again gets a good fit on the source text at around 95% accuracy.
1 2 3 4 5 6 7 8 9 10 11 |
... Epoch 496/500 0s - loss: 0.0685 - acc: 0.9565 Epoch 497/500 0s - loss: 0.0685 - acc: 0.9565 Epoch 498/500 0s - loss: 0.0684 - acc: 0.9565 Epoch 499/500 0s - loss: 0.0684 - acc: 0.9565 Epoch 500/500 0s - loss: 0.0684 - acc: 0.9565 |
We look at 4 generation examples, two start of line cases and two starting mid line.
1 2 3 4 |
Jack and jill went up the hill And Jill went up the fell down and broke his crown and pail of water jack fell down and |
The first start of line case generated correctly, but the second did not. The second case was an example from the 4th line, which is ambiguous with content from the first line. Perhaps a further expansion to 3 input words would be better.
The two mid-line generation examples were generated correctly, matching the source text.
We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. That careful design is required when using language models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- Whole Rhyme as Sequence. Consider updating one of the above examples to build up the entire rhyme as an input sequence. The model should be able to generate the entire thing given the seed of the first word, demonstrate this.
- Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding instead of learning the embedding as part of the model. This would not be required on such a small source text, but could be good practice.
- Character Models. Explore the use of a character-based language model for the source text instead of the word-based approach demonstrated in this tutorial.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
- Jack and Jill on Wikipedia
- Language Model on Wikpedia
- Keras Embedding Layer API
- Keras Text Processing API
- Keras Sequence Processing API
- Keras Utils API
Summary
In this tutorial, you discovered how to develop different word-based language models for a simple nursery rhyme.
Specifically, you learned:
- The challenge of developing a good framing of a word-based language model for a given application.
- How to develop one-word, two-word, and line-based framings for word-based language models.
- How to generate sequences using a fit language model.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Hi Jason – Thanks for this. How can a language model be used to “score” different text sentences. Suppose there is a speech recognition engine that outputs real words but they don’t make sense when combined together as a sentence. Could we use a language model to “score” each sentence to see which is more likely to occur? Thanks!
Great question.
Rather than score, the language model can take the raw input and predict the expected sequence or sequences and these outcomes can then be explored using a beam search.
Thanks, I’d love to see an example of this as an appendix to this post. By the way – I really enjoy your blog, can’t thank you enough for these examples.
Thanks. I have a post on beam search scheduled.
Hello there, i’m trying to develop next word prediction model with GUI using python 3.x but i can’t. Can anyone help me ?
Thanks alot!
Perhaps develop a language model and get it working standalone, then integrate it into your app later.
Whats mean the second argument in embedding?
Did I understand correctly that each word is encoded as a number from 0 to 10?
I created a network for predicting the words with a large number of words, the loss decreases too slowly, so I think I did something wrong.
Maybe it should be, I don’t know (in by char generation it was a lot faster), I would be grateful for advice.
https://pastebin.com/PPWiuMXf
The second argument is the dimensionality of the embedding, the number of dimensions for the encoded vector representation of each word.
Common values are 50, 100, 500 or 1000.
How to do with base means how to extract transcriptions from the timit database in python.
Sorry, I don’t have examples of working with the TIMIT dataset.
Thanks for the amazing post. A novice query – I have a large dataset of books and I want to train a LSTM on that. However, I am getting memoryerror when I try to use the entire dataset for training at once. Is there a way to break up the data and train the model using the parts? Or do I have to throw hardware at the problem?
You can use progressive loading in Keras to only load or yield one batch of data at a time.
I have a post scheduled on this, but until then, read up on Keras data generators.
Dear Jason,
Thank you very much for this post. I am trying to use your “Model 2: Line-by-Line Sequence” and scale it up to create an RNN language model. I have two questions about the way the data is represented:
1. Is there a more efficient way to train an Embedding+RNN language model than splitting up a single sentence into several instances, with a single word added at each step?
2. In this representation we need to feed part of the same sequence into the model over and over again. By presenting the words at the beginning of the sentence more often (as X), do we bias the model towards knowing sentence-initial-parts better than words occurring more frequently at the end of sentences?
Kind regards and thank you,
Christoph
I’d encourage you to explore alternate framings and see how they compare. There is no one true way.
It may bias the model, perhaps you could test this.
Hi Jason, what if you have multiple sentences to train in batches? In that case, your input would be 3 dimensional and the fit would return an error because the embedding layer only accepts 2 dimensions.
Is there an efficient way to deal with it other than send the training set in batches with 1 sentence at a time?
I could of course act as if all words were part of 1 sentence but how would the LSTM detect the end of a sentence?
Thank you!
You could provide each sentence as a sample, group samples into a batch and the LSTM will reset states at the end of each batch.
Thank you for your reply Jason! I understand that the LSTM will rest states at the end of the batch, but shouldn’t we make it reset states after each sentence/ sample in each batch?
Perhaps. Try it and see if it lifts model skill.
I find it has much less effect that one would expect.
I am not able to do it as there will be a dimensionality issue preventing the Keras Embedding layer from giving correct output. If you have a workaround I would love to see your code.
Amazing post! But I was working on something which requires an rnn language model built without libraries. Can the Keras functionalities used in the code here be replaced with self-written code, and has someone already done this? Is there any Github repository for the same?
It would require a lot of work, re-implementing systems that already are fast and reliable. Sounds like a bad idea.
What is your motivation exactly?
Never mind, sir, I myself realized how bad a idea that is. Thank you for this amazing article tho!
No problem.
How do i implement the same script to return me all possible sentences for a particular context.
ex : If my data set contains a list of places i visited.
I have visited India , I have visited USA,I have visited Germany ..
The above script returns me the first possible match . how do i make the script return all the places ?
is it possible ?
Sounds like you might be interested in entity extraction:
https://en.wikipedia.org/wiki/Named-entity_recognition
Awesome!!!
I appreciate if you can share ideas about how I can improve the model or the parameters to predict words form larger text, say a novel. Is adding another LSTM layer or more will be good idea? or is it enough to increase the size of LSTM?
Thank you again for all your posts, very helpful
Good question.
I have general advice about tuning deep learning models here:
https://machinelearningmastery.com/improve-deep-learning-performance/
I have advice on best practices for model config here:
https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
How can we calculate cross_entropy and perplexity?
Keras can calculate cross entropy.
Sorry, I do not have an example of calculating perplexity.
Hi Mr. Jason how can I calculate the perplexity measure in this algorithm?.
Sorry, I don’t have an example of calculating perplexity.
Hi, i tried to save my model as:
# serialize model to JSON
model_json = model.to_json()
with open(“new_model_OneinOneOut.json”, “w”) as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights(“weights_OneinOneOut.h5”)
print(“Saved model to disk”)
But i couldnt load it and use it. How can i do that? Am i saving it right?
You must load the json and h5.
What problem did you have exactly?
Hi,
You seem to use one hot vector for the output vectors. This would be a huge problem in case of a very large vocabulary size. What do you suggest we should do instead?
Not as big a problem as you would think, it does scale to 10K and 100K vocabs fine.
You can use search methods on the resulting probability vectors to get multiple different output sequences.
You can also use hierarchical versions of softmax to improve efficiency.
Hi Jason,
Thanks for the great post. I have two questions. The corpus I’m working with has sentences of varying lengths, some 3 words long and others 30 words long. I want to train a sentence based language model, i.e. training data should not try to combine two or more sentences from the corpus.
I’m slightly confused as to how to set up the training data. At the moment I have pre-padded with 0’s the shorter sentences so as to to match the size of the longest sentence. Example:
sentence : I like fish – this sentence would be split up as follows:
0 0 0 —-> I
00 I —-> like
0 I like —->fish
I like fish —->
This approach gives me roughly 110,000 training points, yet with an architecture an LSTM with 100 nodes my accuracy converges to 50%. Do you think I’ve incorrectly set up my data?
A second point is could you advise us how to combine pretrained word embeddings with an LSTM language model in keras.
Thanks
Padding is the way to go, then use a masking layer to ignore the zero padding.
I have many examples of using pre-trained word embeddings, here is a good start:
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
Hi,
I was wondering why we need to use:
print(generate_seq(model, tokenizer, **max_length-1**, ‘Jack and’, 5))
instead of
print(generate_seq(model, tokenizer, **max_length**, ‘Jack and’, 5))
at test time. Without doing minus 1 it does not work indeed. Why is it the case?
Many thanks!
As explained in the post:
Hi Jason
Is it possible to use these models for punctuation or article prediction (LSTM neural network, where the y(punctuation/article/something else) depend on specific number of previous/next words? What is your advise about this task?
Thank you!
Sure, an LSTM would be a great approach.
Do you make X_test X_train split for tasks like this? If there will be a words in the new text (X_test here) which are not tokenized in keras for X_train, how to deal with this (applying a trained model for text with new words)?
Yes. You need to ensure that your training dataset is representative of the problem, as in all ml problems.
Hello this is nice,
For a particular task i’m facing one problems.
For example i have one data in my training set
‘Hey jack are you going to College ‘
Now i have a sequence of text
‘Hey jack are you…’
I have 2 options
1. going
2. coming
I have to find the probability of the next word going and coming. Obviously the probability of going is 1 and coming is zero
How can i check the next word probability from my options
I would recommend a word-based language model that gives the probability for each word in the vocab for the next word in the sequence.
Dear Dr. Jason, I have been followed your tutorial, and it is so interesting.
now, I have the following questions on the topic of OCR.
1. could you give me a simple example how to implement CNN + LSTM +CTC for scanned text image recognition( e.g if the scanned image is ” near the door” and the equivalent text is ‘near the door’ the how to give the image and the text for training?) please?
Sorry, I don’t have such a specific example.
Hi Jason,
I have two questions:
1. I have a project of next-word prediction, and I want to use your examples as the basis for it.
My data includes multiple documents. One approach I thought of is to concatenate all documents to one list of tokens (with beginning-of-sentence token), and then cut slices in fixed size as an input for the model. Second aproach is to work on each sentence separately using padding. Which approach would work better?
2. If I want to use this language model for other purposes later on, how does it work? Do I use it like pre-trained embedding (like word2vec for instance)? Do you have an example for it? How does the input look like? (for example in pre-trained embedding the input is a vector for each word)
Thank you
Perhaps try both approaches and see what works best for your data and model.
Yes, you could save the model weights and load them later and use them as part of an input or output language model.
Hi Jason,
Why are we converting the y to one-hot-encoding (to_categorical)? Is it a must? Why don’t we just leave it as an integer? I have a big vocabulary and it gives me a memry error..
And also – why do we add ‘+1’ to the length of the word_index when creating the vocab_size?
Thanks a lot. The post is really helpful
So we can predict the probability of each word and chose the next word as the word with the highest probability.
It is not required, you could predict integers for words, but one hot encoding often works better.
I add +1 to make room for “0” which is “I don’t know” or “unknown”.
I had the same problem:
Instead of predicting integers, we can use the ‘sparse_categorical_crossentropy’ loss, and then we do not have to one-hot encode y in this way and saves you from having to deal with the memory error.
We sure can!
I don’t do this myself out of old habits I guess.
More on this here:
https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
What exactly is this for:
for word, index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break
Are you looping over the dictionary here every time you made a prediction, to look up the word corresponding to the index? Why not just reverse the dictionary once and look up the value??
Yes. Yes, I’m sure there are more efficient ways to write this, perhaps you could share some?
Jason, I’ve been following an article at: https://towardsdatascience.com/natural-language-processing-with-tensorflow-e0a701ef5cef,
by Ashu Prasad. At one point, he does this (search for ‘We reverse the dictionary containing the encoded words using a helper function which facilitates us to plot the embeddings.’).
[I can’t print the code because it’s an image. ]
It relies on pulling the weights from the model; I’ve tried to duplicate it, but have failed.
If somebody can get it working, it’s probably what people are looking for here.
If you do, please let me know: [email protected]
Perhaps contact the authors of the article directly?
Hi Jason,
Thank you for the great article. I have 2 questions:
1- If I have the model trained and after that I need to add new words to is, what is the best way to do that without retrain from the beginning?
2- if I have trained the model with a wrong sentence. For example I used ‘Hi Jason, hooo are you?’ but the correct is ‘Hi Jason, how are you?’ and I wants to fix that without retrain from the beginning. what is the best way to do that kind of reinforcement learning?
The easiest way: mark the new words as “unknown”.
Another approach is to use the model weights as a starting point and re-train the model with a small learning rate and new/updated data.
Hello Jason,
This was a very well done article thank you.
1. I was wondering, is their a way to generate text using an RNN/LSTM model without giving in the 1st input word like you have in the generate_seq method, similar to the markovify package, specificially the make_sentence()/make_short_sentence(char_length) functions.
2. Also, would using word embeddings such as Word2Vec or GloVe embeddings allow us to use words not in the training corpus?
Yes, you can frame the problem any way you wish, e.g. feed one word and get a sentence or paragraph.
The model can be only be trained on words in the training corpus. New works are marked “unknown”.
Hello,
i wanna build a article recommendation system based on article titles and abstract, how can i use language modeling to measure the similarity between a user profile and the articles,
Thank you
I don’t have an example of this. Perhaps the sum of the difference between the word vectors?
Jason, very good post! I’m making the same model to predict future words in a text, but faced with the problem of validation loss increasing. I split my data into train and test and while train loss increasing, validation loss is increasing. So, I think it means overfit. Even in your example if we add validation_split param into fit method we will see that validation loss is increasing too. I think it’s not ok. What is your opinion ?
My best advice for diagnosing and improving a deep learning model is here:
https://machinelearningmastery.com/start-here/#better
Hello, Jason!
Thank you for such a detailed article. I have two questions:
1. What effect will the change COUNT of LSTM(units=COUNT) have for this neural network of word prediction?
2. Do I understand correctly that if I delete sequences with the same inputs and output, making a list with a unique set of sequences, it will reduce the number of patterns to be learned and will not affect the final result? (optimization of training time)
Good question, more nodes means the model has more capacity, I explain this here:
https://machinelearningmastery.com/how-to-control-neural-network-model-capacity-with-nodes-and-layers/
It may, sounds like a fun experiment Alex.
how can i extract car vin number from the image of vin having other information too
Perhaps use classic computer vision techniques to isolate the text, then extract the text.
I think opencv in python might be a good place to start.
Instead of one prediction, how can I make it to have couple of predictions and allow user to pick one among them
Some ideas:
Perhaps you can sample the output probabilities in order to generate a few different outputs.
Perhaps you can try running the model a few times to get different outputs?
Perhaps you can train and use a few parallel models to get different outputs?
Hey,
If I want to predict the first 3 most probable word after inputting two words, how do i make change in the code?. This model generates the next word and and considers the whole string for the next word prediction. Currently I’m working on making a keyboard out of this.
For example:
If I input “I read”,the model should generate like “it”, “book” and “your”.
You could look at the probabilities for the next word, and select those 3 words with the highest probability.
Hi, it is really a good article, I have gone through each examples and started liking it.
Would you please provide a syntax for ‘previous word’ sequence which can be trained ? Most of the examples I get on web is next word predictor. My requirement is to have previous word, you mentioned already to use LSTM, but would be help if you can provide a X , y sequence
I don’t understand, sorry. Can you elaborate?
Hi , I was looking for model 2:
X, y
_, _, _, _, _, Jack, and
_, _, _, _, Jack, and Jill
_, _, _, Jack, and, Jill, went
_, _, Jack, and, Jill, went, up
_, Jack, and, Jill, went, up, the
Jack, and, Jill, went, up, the, hill
————
sequences = list()
for line in data.split(‘\n’):
encoded = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(encoded)):
sequence = encoded[:i+1]
sequences.append(sequence)
—
What doubt I have here is, how can I write these to predict “previous” word. y becomes
the previous word:
Will it work ? or how should it work ?
X, y
Jack,and, Jill, went, up, the, hill newline
and, Jill, went, up, the, hill, _ Jack
Jill, went, up, the, hill, _, _ and
went, up, the, _ , _, _ Jill
up,the,_, _ , _, _ went
the,_, _ , _, _,_ up
No need to predict the previous word as it is already available.
If you want to learn how to predict a prior word given no other information, you can simply reverse the order of the input sequences when training.
Thanks Jason for help. Thinking that that would help. Would you like to see what is my exact question ?
The problem I am trying to solve is:
I have line:
Line1: Jack and Jill went up the hill
Line2: To fetch a pail of water
Line3: Jack fell down and broke his crown
Line4: And Jill came tumbling after
Now I want to rewrite line4, with a rhyming work “water”. In my case, “mother” will be right word.
Line4 : And _, _ I love my mother
Or I want to change the word “tumbling”, what is the best fit at that position
Line4: And Jill came “_” after
If I have to achieve that, I can reverse the line and train the model. And then I have to keep another model for next word prediction.
I want to understand, if do we have any inbuilt features in any layer/technique for both next/prior word predictor. I have not fully understood the LSTM, I just thought LSTM can take care of remembering of previous word ?
Thanks.
There are many ways to frame the problem.
A simple/naive way – that might work – would be to input the text as is and the output of the model is to predict the missing word or words directly.
Try that as a first step. Use a special token to represent missing words.
What is the vocabulary size if we use tokenizer with num words?
If I use the Tokenizer with num_words:
tokenizer = Tokenizer(num_words=num_words, lower=True)
Now we have this line:
y = to_categorical(y, num_classes=vocab_size)
Should I call it with:
y = to_categorical(y, num_classes=num_words)
?
That’s because the actual words number should be smaller.
I have a vocabulary size of ~ 800K words and the pad_sequences always gets MemoryError. That’s why i’m asking.
Thanks!
You might gave the terms around the wrong way?
The vocab size will be much smaller than the number of words, as the number of words includes duplicates.
It is overkill to use LSTM in One-Word-In, One-Word-Out framing since no sequence is used (the length is 1).
We can use just a Flatten layer after Embedding and connect it to a Dense layer.
Sure. Think of the example as a starting point for your own projects.
One-Word-In -> One-Word-Out implementation creates also the following 2-grams:
hill->to
from:
Jack and Jill went up the hill
To fetch a pail of water
Which is incorrect.
We need to create 2-grams per line.
Also, if the text is a paragraph we need to segment the paragraph in sentences and then do the 2-grams extraction for the dataset.
Hi Jason,
Your write-up is pretty clean and understandable. I followed this article and created the next word/sequence prediction model. I am facing an issue w.r.t outputs inferred via model.
Example, if I feed to the model – “Where can I buy”, I get outputs – “Where can I buy a bicycle” & “Where can I buy spare parts for my bicycle”. These 2 are perfect.
I also get a couple of grammatically incorrect outputs – “Where can I buy of bicycle”, “Where can I buy went to bicycle”.
Do you have any ideas on how to filter out the grammatically incorrect outputs so that we are left with only good sentences in output? Thanks for your help.
FYI – Training Data Creation –
The approach I followed is trigrams in the input. For example, For sentence, “I am reading this article”, I used below data for training.
(I, am, reading) > (this)
(am, reading, this) > (article)
Thanks.
Not really, other than train a better model that makes fewer errors.
Hi Jason
Thanks for informative tutorial.
I was wondering about this :
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(50))
should the single hidden LSTM layer with – 50 units – is equal the length of Embedding layer , I mean sequence input_length?
The size of the embedding and the number of units in the LSTM layer are not related.
Hi, how are u?, sorry I have a query, I am using your example to predict the next word with a corpus of data (“sentences”), which I concatenate to form a single text and perform the procedure, however my network is not training, the accuracy it starts at “0.04” and the epochs are almost the same, I have checked everything and even the word processing is fine …. I don’t know what to do
Perhaps these tutorials will help you improve performance:
https://machinelearningmastery.com/start-here/#better
apply the above, but I still have problems with the model.
Hi,
How can i use the presented language model to correct the speech recognizer results?
Speech recognizer is a different topic but you may consider the recognizer is not recognizing one word but multiple words in different probabilities. Normally you take the single one with highest probability as the output, but with the language model, you can base on the highest probability in the sequence as the output, with the words before the current one taken into consideration.