How to Develop Word-Based Neural Language Models in Python with Keras

By Jason Brownlee on September 3, 2020 in Deep Learning for Natural Language Processing 95

Language modeling involves predicting the next word in a sequence given the sequence of words already present.

A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.

In this tutorial, you will discover how the framing of a language model affects the skill of the model when generating short sequences from a nursery rhyme.

After completing this tutorial, you will know:

The challenge of developing a good framing of a word-based language model for a given application.
How to develop one-word, two-word, and line-based framings for word-based language models.
How to generate sequences using a fit language model.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Develop Word-Based Neural Language Models in Python with Keras
Photo by Stephanie Chapman, some rights reserved.

Tutorial Overview

This tutorial is divided into 5 parts; they are:

Framing Language Modeling
Jack and Jill Nursery Rhyme
Model 1: One-Word-In, One-Word-Out Sequences
Model 2: Line-by-Line Sequence
Model 3: Two-Words-In, One-Word-Out Sequence

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence.

Language models are a key component in larger models for challenging natural language processing problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence.

Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence

Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words.

There are many ways to frame the sequences from a source text for language modeling.

In this tutorial, we will explore 3 different ways of developing word-based language models in the Keras deep learning library.

There is no single best approach, just different framings that may suit different applications.

Jack and Jill Nursery Rhyme

Jack and Jill is a simple nursery rhyme.

It is comprised of 4 lines, as follows:

Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after

We will use this as our source text for exploring different framings of a word-based language model.

We can define this text in Python as follows:

# source text
data = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """

# source text

data = """ Jack and Jill went up the hill\n

To fetch a pail of water\n

Jack fell down and broke his crown\n

And Jill came tumbling after\n """

Model 1: One-Word-In, One-Word-Out Sequences

We can start with a very simple model.

Given one word as input, the model will learn to predict the next word in the sequence.

For example:

X,		y
Jack, 	and
and,	Jill
Jill,	went
...

X, y

Jack, and

and, Jill

Jill, went

...

The first step is to encode the text as integers.

Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts_to_sequences() function.

# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

# integer encode text

tokenizer = Tokenizer()

tokenizer.fit_on_texts([data])

encoded = tokenizer.texts_to_sequences([data])[0]

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding.

The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word_index attribute.

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

# determine the vocabulary size

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

Running this example, we can see that the size of the vocabulary is 21 words.

We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
	sequence = encoded[i-1:i+1]
	sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

# create word -> word sequences

sequences = list()

for i in range(1, len(encoded)):

sequence = encoded[i-1:i+1]

sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

Running this piece shows that we have a total of 24 input-output pairs to train the network.

Total Sequences: 24

1	Total Sequences: 24

We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]

# split into X and y elements

sequences = array(sequences)

X, y = sequences[:,0],sequences[:,1]

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.

Keras provides the to_categorical() function that we can use to convert the integer to a one hot encoding while specifying the number of classes as the vocabulary size.

# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

1 2	# one hot encode outputs y = to_categorical(y, num_classes=vocab_size)

We are now ready to define the neural network model.

The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input_length=1.

The model has a single hidden LSTM layer with 50 units. This is far more than is needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

# define model

model = Sequential()

model.add(Embedding(vocab_size, 10, input_length=1))

model.add(LSTM(50))

model.add(Dense(vocab_size, activation='softmax'))

print(model.summary())

The structure of the network can be summarized as follows:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_1 (Embedding)      (None, 1, 10)             220
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200
_________________________________________________________________
dense_1 (Dense)              (None, 22)                1122
=================================================================
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

embedding_1 (Embedding) (None, 1, 10) 220

_________________________________________________________________

lstm_1 (LSTM) (None, 50) 12200

_________________________________________________________________

dense_1 (Dense) (None, 22) 1122

=================================================================

Total params: 13,542

Trainable params: 13,542

Non-trainable params: 0

_________________________________________________________________

We will use this same general network structure for each example in this tutorial, with minor changes to the learned embedding layer.

Next, we can compile and fit the network on the encoded text data. Technically, we are modeling a multi-class classification problem (predict the word in the vocabulary), therefore using the categorical cross entropy loss function. We use the efficient Adam implementation of gradient descent and track accuracy at the end of each epoch. The model is fit for 500 training epochs, again, perhaps more than is needed.

The network configuration was not tuned for this and later experiments; an over-prescribed configuration was chosen to ensure that we could focus on the framing of the language model.

# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(X, y, epochs=500, verbose=2)

After the model is fit, we test it by passing it a given word from the vocabulary and having the model predict the next word. Here we pass in ‘Jack‘ by encoding it and calling model.predict_classes() to get the integer output for the predicted word. This is then looked up in the vocabulary mapping to give the associated word.

# evaluate
in_text = 'Jack'
print(in_text)
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = array(encoded)
yhat = model.predict_classes(encoded, verbose=0)
for word, index in tokenizer.word_index.items():
	if index == yhat:
		print(word)

# evaluate

in_text = 'Jack'

print(in_text)

encoded = tokenizer.texts_to_sequences([in_text])[0]

encoded = array(encoded)

yhat = model.predict_classes(encoded, verbose=0)

for word, index in tokenizer.word_index.items():

if index == yhat:

print(word)

This process could then be repeated a few times to build up a generated sequence of words.

To make this easier, we wrap up the behavior in a function that we can call by passing in our model and the seed word.

# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
	in_text, result = seed_text, seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		encoded = array(encoded)
		# predict a word in the vocabulary
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text, result = out_word, result + ' ' + out_word
	return result

# generate a sequence from the model

def generate_seq(model, tokenizer, seed_text, n_words):

in_text, result = seed_text, seed_text

# generate a fixed number of words

for _ in range(n_words):

# encode the text as integer

encoded = tokenizer.texts_to_sequences([in_text])[0]

encoded = array(encoded)

# predict a word in the vocabulary

yhat = model.predict_classes(encoded, verbose=0)

# map predicted word index to word

out_word = ''

for word, index in tokenizer.word_index.items():

if index == yhat:

out_word = word

break

# append to input

in_text, result = out_word, result + ' ' + out_word

return result

We can time all of this together. The complete code listing is provided below.

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
	in_text, result = seed_text, seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		encoded = array(encoded)
		# predict a word in the vocabulary
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text, result = out_word, result + ' ' + out_word
	return result

# source text
data = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
	sequence = encoded[i-1:i+1]
	sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]
# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate
print(generate_seq(model, tokenizer, 'Jack', 6))

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.utils import to_categorical

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.layers import Embedding

# generate a sequence from the model

def generate_seq(model, tokenizer, seed_text, n_words):

in_text, result = seed_text, seed_text

# generate a fixed number of words

for _ in range(n_words):

# encode the text as integer

encoded = tokenizer.texts_to_sequences([in_text])[0]

encoded = array(encoded)

# predict a word in the vocabulary

yhat = model.predict_classes(encoded, verbose=0)

# map predicted word index to word

out_word = ''

for word, index in tokenizer.word_index.items():

if index == yhat:

out_word = word

break

# append to input

in_text, result = out_word, result + ' ' + out_word

return result

# source text

data = """ Jack and Jill went up the hill\n

To fetch a pail of water\n

Jack fell down and broke his crown\n

And Jill came tumbling after\n """

# integer encode text

tokenizer = Tokenizer()

tokenizer.fit_on_texts([data])

encoded = tokenizer.texts_to_sequences([data])[0]

# determine the vocabulary size

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# create word -> word sequences

sequences = list()

for i in range(1, len(encoded)):

sequence = encoded[i-1:i+1]

sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

# split into X and y elements

sequences = array(sequences)

X, y = sequences[:,0],sequences[:,1]

# one hot encode outputs

y = to_categorical(y, num_classes=vocab_size)

# define model

model = Sequential()

model.add(Embedding(vocab_size, 10, input_length=1))

model.add(LSTM(50))

model.add(Dense(vocab_size, activation='softmax'))

print(model.summary())

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(X, y, epochs=500, verbose=2)

# evaluate

print(generate_seq(model, tokenizer, 'Jack', 6))

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example prints the loss and accuracy each training epoch.

...
Epoch 496/500
0s - loss: 0.2358 - acc: 0.8750
Epoch 497/500
0s - loss: 0.2355 - acc: 0.8750
Epoch 498/500
0s - loss: 0.2352 - acc: 0.8750
Epoch 499/500
0s - loss: 0.2349 - acc: 0.8750
Epoch 500/500
0s - loss: 0.2346 - acc: 0.8750

...

Epoch 496/500

0s - loss: 0.2358 - acc: 0.8750

Epoch 497/500

0s - loss: 0.2355 - acc: 0.8750

Epoch 498/500

0s - loss: 0.2352 - acc: 0.8750

Epoch 499/500

0s - loss: 0.2349 - acc: 0.8750

Epoch 500/500

0s - loss: 0.2346 - acc: 0.8750

We can see that the model does not memorize the source sequences, likely because there is some ambiguity in the input sequences, for example:

jack => and
jack => fell

1 2	jack => and jack => fell

And so on.

At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.

We get a reasonable sequence as output that has some elements of the source.

Jack and jill came tumbling after down

1	Jack and jill came tumbling after down

This is a good first cut language model, but does not take full advantage of the LSTM’s ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.

Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up.

For example:

X,									y
_, _, _, _, _, Jack, 				and
_, _, _, _, Jack, and 				Jill
_, _, _, Jack, and, Jill,			went
_, _, Jack, and, Jill, went,		up
_, Jack, and, Jill, went, up,		the
Jack, and, Jill, went, up, the,		hill

X, y

_, _, _, _, _, Jack, and

_, _, _, _, Jack, and Jill

_, _, _, Jack, and, Jill, went

_, _, Jack, and, Jill, went, up

_, Jack, and, Jill, went, up, the

Jack, and, Jill, went, up, the, hill

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity.

In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.

Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input. This is a requirement when using Keras.

First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

# create line-based sequences
sequences = list()
for line in data.split('\n'):
	encoded = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(encoded)):
		sequence = encoded[:i+1]
		sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))

# create line-based sequences

sequences = list()

for line in data.split('\n'):

encoded = tokenizer.texts_to_sequences([line])[0]

for i in range(1, len(encoded)):

sequence = encoded[:i+1]

sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

Next, we can pad the prepared sequences. We can do this using the pad_sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)

# pad input sequences

max_length = max([len(seq) for seq in sequences])

sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')

print('Max Sequence Length: %d' % max_length)

Next, we can split the sequences into input and output elements, much like before.

# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

# split into input and output elements

sequences = array(sequences)

X, y = sequences[:,:-1],sequences[:,-1]

y = to_categorical(y, num_classes=vocab_size)

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)

# define model

model = Sequential()

model.add(Embedding(vocab_size, 10, input_length=max_length-1))

model.add(LSTM(50))

model.add(Dense(vocab_size, activation='softmax'))

print(model.summary())

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(X, y, epochs=500, verbose=2)

We can use the model to generate new sequences as before. The generate_seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.

# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# pre-pad sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text

# generate a sequence from a language model

def generate_seq(model, tokenizer, max_length, seed_text, n_words):

in_text = seed_text

# generate a fixed number of words

for _ in range(n_words):

# encode the text as integer

encoded = tokenizer.texts_to_sequences([in_text])[0]

# pre-pad sequences to a fixed length

encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')

# predict probabilities for each word

yhat = model.predict_classes(encoded, verbose=0)

# map predicted word index to word

out_word = ''

for word, index in tokenizer.word_index.items():

if index == yhat:

out_word = word

break

# append to input

in_text += ' ' + out_word

return in_text

Tying all of this together, the complete code example is provided below.

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# pre-pad sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text

# source text
data = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """
# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# create line-based sequences
sequences = list()
for line in data.split('\n'):
	encoded = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(encoded)):
		sequence = encoded[:i+1]
		sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))
print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.utils import to_categorical

from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.layers import Embedding

# generate a sequence from a language model

def generate_seq(model, tokenizer, max_length, seed_text, n_words):

in_text = seed_text

# generate a fixed number of words

for _ in range(n_words):

# encode the text as integer

encoded = tokenizer.texts_to_sequences([in_text])[0]

# pre-pad sequences to a fixed length

encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')

# predict probabilities for each word

yhat = model.predict_classes(encoded, verbose=0)

# map predicted word index to word

out_word = ''

for word, index in tokenizer.word_index.items():

if index == yhat:

out_word = word

break

# append to input

in_text += ' ' + out_word

return in_text

# source text

data = """ Jack and Jill went up the hill\n

To fetch a pail of water\n

Jack fell down and broke his crown\n

And Jill came tumbling after\n """

# prepare the tokenizer on the source text

tokenizer = Tokenizer()

tokenizer.fit_on_texts([data])

# determine the vocabulary size

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# create line-based sequences

sequences = list()

for line in data.split('\n'):

encoded = tokenizer.texts_to_sequences([line])[0]

for i in range(1, len(encoded)):

sequence = encoded[:i+1]

sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

# pad input sequences

max_length = max([len(seq) for seq in sequences])

sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')

print('Max Sequence Length: %d' % max_length)

# split into input and output elements

sequences = array(sequences)

X, y = sequences[:,:-1],sequences[:,-1]

y = to_categorical(y, num_classes=vocab_size)

# define model

model = Sequential()

model.add(Embedding(vocab_size, 10, input_length=max_length-1))

model.add(LSTM(50))

model.add(Dense(vocab_size, activation='softmax'))

print(model.summary())

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(X, y, epochs=500, verbose=2)

# evaluate model

print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))

print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))

Running the example achieves a better fit on the source data. The added context has allowed the model to disambiguate some of the examples.

There are still two lines of text that start with ‘Jack‘ that may still be a problem for the network.

...
Epoch 496/500
0s - loss: 0.1039 - acc: 0.9524
Epoch 497/500
0s - loss: 0.1037 - acc: 0.9524
Epoch 498/500
0s - loss: 0.1035 - acc: 0.9524
Epoch 499/500
0s - loss: 0.1033 - acc: 0.9524
Epoch 500/500
0s - loss: 0.1032 - acc: 0.9524

...

Epoch 496/500

0s - loss: 0.1039 - acc: 0.9524

Epoch 497/500

0s - loss: 0.1037 - acc: 0.9524

Epoch 498/500

0s - loss: 0.1035 - acc: 0.9524

Epoch 499/500

0s - loss: 0.1033 - acc: 0.9524

Epoch 500/500

0s - loss: 0.1032 - acc: 0.9524

At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.

The first generated line looks good, directly matching the source text. The second is a bit strange. This makes sense, because the network only ever saw ‘Jill‘ within an input sequence, not at the beginning of the sequence, so it has forced an output to use the word ‘Jill‘, i.e. the last line of the rhyme.

Jack fell down and broke
Jill jill came tumbling after

1 2	Jack fell down and broke Jill jill came tumbling after

This was a good example of how the framing may result in better new lines, but not good partial lines of input.

Model 3: Two-Words-In, One-Word-Out Sequence

We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequences of words as input.

This will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line.

We will use 3 words as input to predict one word as output. The preparation of the sequences is much like the first example, except with different offsets in the source sequence arrays, as follows:

# encode 2 words -> 1 word
sequences = list()
for i in range(2, len(encoded)):
	sequence = encoded[i-2:i+1]
	sequences.append(sequence)

# encode 2 words -> 1 word

sequences = list()

for i in range(2, len(encoded)):

sequence = encoded[i-2:i+1]

sequences.append(sequence)

The complete example is listed below

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

# generate a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
	in_text = seed_text
	# generate a fixed number of words
	for _ in range(n_words):
		# encode the text as integer
		encoded = tokenizer.texts_to_sequences([in_text])[0]
		# pre-pad sequences to a fixed length
		encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
		# predict probabilities for each word
		yhat = model.predict_classes(encoded, verbose=0)
		# map predicted word index to word
		out_word = ''
		for word, index in tokenizer.word_index.items():
			if index == yhat:
				out_word = word
				break
		# append to input
		in_text += ' ' + out_word
	return in_text

# source text
data = """ Jack and Jill went up the hill\n
		To fetch a pail of water\n
		Jack fell down and broke his crown\n
		And Jill came tumbling after\n """
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# encode 2 words -> 1 word
sequences = list()
for i in range(2, len(encoded)):
	sequence = encoded[i-2:i+1]
	sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2)
# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5))
print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3))
print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5))
print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5))

from numpy import array

from keras.preprocessing.text import Tokenizer

from keras.utils import to_categorical

from keras.preprocessing.sequence import pad_sequences

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.layers import Embedding

# generate a sequence from a language model

def generate_seq(model, tokenizer, max_length, seed_text, n_words):

in_text = seed_text

# generate a fixed number of words

for _ in range(n_words):

# encode the text as integer

encoded = tokenizer.texts_to_sequences([in_text])[0]

# pre-pad sequences to a fixed length

encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')

# predict probabilities for each word

yhat = model.predict_classes(encoded, verbose=0)

# map predicted word index to word

out_word = ''

for word, index in tokenizer.word_index.items():

if index == yhat:

out_word = word

break

# append to input

in_text += ' ' + out_word

return in_text

# source text

data = """ Jack and Jill went up the hill\n

To fetch a pail of water\n

Jack fell down and broke his crown\n

And Jill came tumbling after\n """

# integer encode sequences of words

tokenizer = Tokenizer()

tokenizer.fit_on_texts([data])

encoded = tokenizer.texts_to_sequences([data])[0]

# retrieve vocabulary size

vocab_size = len(tokenizer.word_index) + 1

print('Vocabulary Size: %d' % vocab_size)

# encode 2 words -> 1 word

sequences = list()

for i in range(2, len(encoded)):

sequence = encoded[i-2:i+1]

sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))

# pad sequences

max_length = max([len(seq) for seq in sequences])

sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')

print('Max Sequence Length: %d' % max_length)

# split into input and output elements

sequences = array(sequences)

X, y = sequences[:,:-1],sequences[:,-1]

y = to_categorical(y, num_classes=vocab_size)

# define model

model = Sequential()

model.add(Embedding(vocab_size, 10, input_length=max_length-1))

model.add(LSTM(50))

model.add(Dense(vocab_size, activation='softmax'))

print(model.summary())

# compile network

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit network

model.fit(X, y, epochs=500, verbose=2)

# evaluate model

print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5))

print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3))

print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5))

print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5))

Running the example again gets a good fit on the source text at around 95% accuracy.

...
Epoch 496/500
0s - loss: 0.0685 - acc: 0.9565
Epoch 497/500
0s - loss: 0.0685 - acc: 0.9565
Epoch 498/500
0s - loss: 0.0684 - acc: 0.9565
Epoch 499/500
0s - loss: 0.0684 - acc: 0.9565
Epoch 500/500
0s - loss: 0.0684 - acc: 0.9565

...

Epoch 496/500

0s - loss: 0.0685 - acc: 0.9565

Epoch 497/500

0s - loss: 0.0685 - acc: 0.9565

Epoch 498/500

0s - loss: 0.0684 - acc: 0.9565

Epoch 499/500

0s - loss: 0.0684 - acc: 0.9565

Epoch 500/500

0s - loss: 0.0684 - acc: 0.9565

We look at 4 generation examples, two start of line cases and two starting mid line.

Jack and jill went up the hill
And Jill went up the
fell down and broke his crown and
pail of water jack fell down and

Jack and jill went up the hill

And Jill went up the

fell down and broke his crown and

pail of water jack fell down and

The first start of line case generated correctly, but the second did not. The second case was an example from the 4th line, which is ambiguous with content from the first line. Perhaps a further expansion to 3 input words would be better.

The two mid-line generation examples were generated correctly, matching the source text.

We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. That careful design is required when using language models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

Whole Rhyme as Sequence. Consider updating one of the above examples to build up the entire rhyme as an input sequence. The model should be able to generate the entire thing given the seed of the first word, demonstrate this.
Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding instead of learning the embedding as part of the model. This would not be required on such a small source text, but could be good practice.
Character Models. Explore the use of a character-based language model for the source text instead of the word-based approach demonstrated in this tutorial.

Summary

In this tutorial, you discovered how to develop different word-based language models for a simple nursery rhyme.

Specifically, you learned:

The challenge of developing a good framing of a word-based language model for a given application.
How to develop one-word, two-word, and line-based framings for word-based language models.
How to generate sequences using a fit language model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

95 Responses to How to Develop Word-Based Neural Language Models in Python with Keras

Aaron November 3, 2017 at 5:46 am #

Hi Jason – Thanks for this. How can a language model be used to “score” different text sentences. Suppose there is a speech recognition engine that outputs real words but they don’t make sense when combined together as a sentence. Could we use a language model to “score” each sentence to see which is more likely to occur? Thanks!

Reply
- Jason Brownlee November 3, 2017 at 2:14 pm #
  
  Great question.
  
  Rather than score, the language model can take the raw input and predict the expected sequence or sequences and these outcomes can then be explored using a beam search.
  
  Reply
  - Aaron November 4, 2017 at 1:02 am #
    
    Thanks, I’d love to see an example of this as an appendix to this post. By the way – I really enjoy your blog, can’t thank you enough for these examples.
    
    Reply
    - Jason Brownlee November 4, 2017 at 5:32 am #
      
      Thanks. I have a post on beam search scheduled.
      
      Reply
  - Mezgebu Abebe November 20, 2019 at 11:11 pm #
    
    Hello there, i’m trying to develop next word prediction model with GUI using python 3.x but i can’t. Can anyone help me ?
    
    Thanks alot!
    
    Reply
    - Jason Brownlee November 21, 2019 at 6:06 am #
      
      Perhaps develop a language model and get it working standalone, then integrate it into your app later.
      
      Reply
had November 3, 2017 at 9:38 am #

Whats mean the second argument in embedding?
Did I understand correctly that each word is encoded as a number from 0 to 10?

I created a network for predicting the words with a large number of words, the loss decreases too slowly, so I think I did something wrong.

Maybe it should be, I don’t know (in by char generation it was a lot faster), I would be grateful for advice.

https://pastebin.com/PPWiuMXf

Reply
- Jason Brownlee November 3, 2017 at 2:16 pm #
  
  The second argument is the dimensionality of the embedding, the number of dimensions for the encoded vector representation of each word.
  
  Common values are 50, 100, 500 or 1000.
  
  Reply
Nadeem Pasha November 3, 2017 at 3:57 pm #

How to do with base means how to extract transcriptions from the timit database in python.

Reply
- Jason Brownlee November 4, 2017 at 5:25 am #
  
  Sorry, I don’t have examples of working with the TIMIT dataset.
  
  Reply
Anubhab Majumdar November 9, 2017 at 2:17 am #

Thanks for the amazing post. A novice query – I have a large dataset of books and I want to train a LSTM on that. However, I am getting memoryerror when I try to use the entire dataset for training at once. Is there a way to break up the data and train the model using the parts? Or do I have to throw hardware at the problem?

Reply
- Jason Brownlee November 9, 2017 at 10:03 am #
  
  You can use progressive loading in Keras to only load or yield one batch of data at a time.
  
  I have a post scheduled on this, but until then, read up on Keras data generators.
  
  Reply
Christoph Aurnhammer November 23, 2017 at 5:39 am #

Dear Jason,
Thank you very much for this post. I am trying to use your “Model 2: Line-by-Line Sequence” and scale it up to create an RNN language model. I have two questions about the way the data is represented:

1. Is there a more efficient way to train an Embedding+RNN language model than splitting up a single sentence into several instances, with a single word added at each step?

2. In this representation we need to feed part of the same sequence into the model over and over again. By presenting the words at the beginning of the sentence more often (as X), do we bias the model towards knowing sentence-initial-parts better than words occurring more frequently at the end of sentences?

Kind regards and thank you,
Christoph

Reply
- Jason Brownlee November 23, 2017 at 10:41 am #
  
  I’d encourage you to explore alternate framings and see how they compare. There is no one true way.
  
  It may bias the model, perhaps you could test this.
  
  Reply
Carl S January 26, 2018 at 9:30 am #

Hi Jason, what if you have multiple sentences to train in batches? In that case, your input would be 3 dimensional and the fit would return an error because the embedding layer only accepts 2 dimensions.
Is there an efficient way to deal with it other than send the training set in batches with 1 sentence at a time?

I could of course act as if all words were part of 1 sentence but how would the LSTM detect the end of a sentence?

Thank you!

Reply
- Jason Brownlee January 27, 2018 at 5:49 am #
  
  You could provide each sentence as a sample, group samples into a batch and the LSTM will reset states at the end of each batch.
  
  Reply
  - Carl S January 30, 2018 at 9:00 am #
    
    Thank you for your reply Jason! I understand that the LSTM will rest states at the end of the batch, but shouldn’t we make it reset states after each sentence/ sample in each batch?
    
    Reply
    - Jason Brownlee January 30, 2018 at 10:01 am #
      
      Perhaps. Try it and see if it lifts model skill.
      
      I find it has much less effect that one would expect.
      
      Reply
      - Carl S January 31, 2018 at 5:12 am #
        
        I am not able to do it as there will be a dimensionality issue preventing the Keras Embedding layer from giving correct output. If you have a workaround I would love to see your code.
Onjule March 19, 2018 at 5:18 am #

Amazing post! But I was working on something which requires an rnn language model built without libraries. Can the Keras functionalities used in the code here be replaced with self-written code, and has someone already done this? Is there any Github repository for the same?

Reply
- Jason Brownlee March 19, 2018 at 6:08 am #
  
  It would require a lot of work, re-implementing systems that already are fast and reliable. Sounds like a bad idea.
  
  What is your motivation exactly?
  
  Reply
  - Anjali Bhavan March 25, 2018 at 8:12 pm #
    
    Never mind, sir, I myself realized how bad a idea that is. Thank you for this amazing article tho!
    
    Reply
    - Jason Brownlee March 26, 2018 at 10:01 am #
      
      No problem.
      
      Reply
Dilip March 25, 2018 at 5:37 pm #

How do i implement the same script to return me all possible sentences for a particular context.

ex : If my data set contains a list of places i visited.

I have visited India , I have visited USA,I have visited Germany ..

The above script returns me the first possible match . how do i make the script return all the places ?

is it possible ?

Reply
- Jason Brownlee March 26, 2018 at 9:58 am #
  
  Sounds like you might be interested in entity extraction:
  https://en.wikipedia.org/wiki/Named-entity_recognition
  
  Reply
Husam April 20, 2018 at 2:25 pm #

Awesome!!!

I appreciate if you can share ideas about how I can improve the model or the parameters to predict words form larger text, say a novel. Is adding another LSTM layer or more will be good idea? or is it enough to increase the size of LSTM?

Thank you again for all your posts, very helpful

Reply
- Jason Brownlee April 21, 2018 at 6:39 am #
  
  Good question.
  
  I have general advice about tuning deep learning models here:
  https://machinelearningmastery.com/improve-deep-learning-performance/
  
  I have advice on best practices for model config here:
  https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
  
  Reply
Yasir Hussain May 20, 2018 at 2:54 pm #

How can we calculate cross_entropy and perplexity?

Reply
- Jason Brownlee May 21, 2018 at 6:26 am #
  
  Keras can calculate cross entropy.
  
  Sorry, I do not have an example of calculating perplexity.
  
  Reply
Baron May 21, 2018 at 7:56 am #

Hi Mr. Jason how can I calculate the perplexity measure in this algorithm?.

Reply
- Jason Brownlee May 21, 2018 at 2:29 pm #
  
  Sorry, I don’t have an example of calculating perplexity.
  
  Reply
Talat May 24, 2018 at 5:55 pm #

Hi, i tried to save my model as:

# serialize model to JSON
model_json = model.to_json()
with open(“new_model_OneinOneOut.json”, “w”) as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights(“weights_OneinOneOut.h5”)
print(“Saved model to disk”)

But i couldnt load it and use it. How can i do that? Am i saving it right?

Reply
- Jason Brownlee May 25, 2018 at 9:20 am #
  
  You must load the json and h5.
  
  What problem did you have exactly?
  
  Reply
Raghav May 31, 2018 at 9:07 pm #

Hi,

You seem to use one hot vector for the output vectors. This would be a huge problem in case of a very large vocabulary size. What do you suggest we should do instead?

Reply
- Jason Brownlee June 1, 2018 at 8:20 am #
  
  Not as big a problem as you would think, it does scale to 10K and 100K vocabs fine.
  
  You can use search methods on the resulting probability vectors to get multiple different output sequences.
  
  You can also use hierarchical versions of softmax to improve efficiency.
  
  Reply
Jamil June 26, 2018 at 12:25 am #

Hi Jason,

Thanks for the great post. I have two questions. The corpus I’m working with has sentences of varying lengths, some 3 words long and others 30 words long. I want to train a sentence based language model, i.e. training data should not try to combine two or more sentences from the corpus.

I’m slightly confused as to how to set up the training data. At the moment I have pre-padded with 0’s the shorter sentences so as to to match the size of the longest sentence. Example:

sentence : I like fish – this sentence would be split up as follows:

0 0 0 —-> I
00 I —-> like
0 I like —->fish
I like fish —->

This approach gives me roughly 110,000 training points, yet with an architecture an LSTM with 100 nodes my accuracy converges to 50%. Do you think I’ve incorrectly set up my data?

A second point is could you advise us how to combine pretrained word embeddings with an LSTM language model in keras.

Thanks

Reply
- Jason Brownlee June 26, 2018 at 6:40 am #
  
  Padding is the way to go, then use a masking layer to ignore the zero padding.
  
  I have many examples of using pre-trained word embeddings, here is a good start:
  https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
  
  Reply
Hoang Cuong August 10, 2018 at 11:42 am #

Hi,

I was wondering why we need to use:

print(generate_seq(model, tokenizer, **max_length-1**, ‘Jack and’, 5))

instead of

print(generate_seq(model, tokenizer, **max_length**, ‘Jack and’, 5))

at test time. Without doing minus 1 it does not work indeed. Why is it the case?

Many thanks!

Reply
- Jason Brownlee August 10, 2018 at 2:18 pm #
  
  As explained in the post:
  
  The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max_length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.
  
  Reply
Nikolas August 30, 2018 at 6:59 pm #

Hi Jason

Is it possible to use these models for punctuation or article prediction (LSTM neural network, where the y(punctuation/article/something else) depend on specific number of previous/next words? What is your advise about this task?

Thank you!

Reply
- Jason Brownlee August 31, 2018 at 8:09 am #
  
  Sure, an LSTM would be a great approach.
  
  Reply
  - Nikolas August 31, 2018 at 6:01 pm #
    
    Do you make X_test X_train split for tasks like this? If there will be a words in the new text (X_test here) which are not tokenized in keras for X_train, how to deal with this (applying a trained model for text with new words)?
    
    Reply
    - Jason Brownlee September 1, 2018 at 6:16 am #
      
      Yes. You need to ensure that your training dataset is representative of the problem, as in all ml problems.
      
      Reply
  - Kowsher January 2, 2021 at 7:57 am #
    
    Hello this is nice,
    For a particular task i’m facing one problems.
    For example i have one data in my training set
    
    ‘Hey jack are you going to College ‘
    
    Now i have a sequence of text
    ‘Hey jack are you…’
    I have 2 options
    1. going
    2. coming
    
    I have to find the probability of the next word going and coming. Obviously the probability of going is 1 and coming is zero
    How can i check the next word probability from my options
    
    Reply
    - Jason Brownlee January 2, 2021 at 12:05 pm #
      
      I would recommend a word-based language model that gives the probability for each word in the vocab for the next word in the sequence.
      
      Reply
bhb October 4, 2018 at 12:29 am #

Dear Dr. Jason, I have been followed your tutorial, and it is so interesting.
now, I have the following questions on the topic of OCR.
1. could you give me a simple example how to implement CNN + LSTM +CTC for scanned text image recognition( e.g if the scanned image is ” near the door” and the equivalent text is ‘near the door’ the how to give the image and the text for training?) please?

Reply
- Jason Brownlee October 4, 2018 at 6:19 am #
  
  Sorry, I don’t have such a specific example.
  
  Reply
Ishay November 6, 2018 at 6:50 am #

Hi Jason,

I have two questions:
1. I have a project of next-word prediction, and I want to use your examples as the basis for it.
My data includes multiple documents. One approach I thought of is to concatenate all documents to one list of tokens (with beginning-of-sentence token), and then cut slices in fixed size as an input for the model. Second aproach is to work on each sentence separately using padding. Which approach would work better?

2. If I want to use this language model for other purposes later on, how does it work? Do I use it like pre-trained embedding (like word2vec for instance)? Do you have an example for it? How does the input look like? (for example in pre-trained embedding the input is a vector for each word)

Thank you

Reply
- Jason Brownlee November 6, 2018 at 2:16 pm #
  
  Perhaps try both approaches and see what works best for your data and model.
  
  Yes, you could save the model weights and load them later and use them as part of an input or output language model.
  
  Reply
Ishay November 7, 2018 at 1:38 am #

Hi Jason,

Why are we converting the y to one-hot-encoding (to_categorical)? Is it a must? Why don’t we just leave it as an integer? I have a big vocabulary and it gives me a memry error..

And also – why do we add ‘+1’ to the length of the word_index when creating the vocab_size?

Thanks a lot. The post is really helpful

Reply
- Jason Brownlee November 7, 2018 at 6:08 am #
  
  So we can predict the probability of each word and chose the next word as the word with the highest probability.
  
  It is not required, you could predict integers for words, but one hot encoding often works better.
  
  I add +1 to make room for “0” which is “I don’t know” or “unknown”.
  
  Reply
  - Julia Rozanova March 25, 2019 at 10:31 pm #
    
    I had the same problem:
    
    Instead of predicting integers, we can use the ‘sparse_categorical_crossentropy’ loss, and then we do not have to one-hot encode y in this way and saves you from having to deal with the memory error.
    
    Reply
    - Jason Brownlee March 26, 2019 at 8:07 am #
      
      We sure can!
      
      I don’t do this myself out of old habits I guess.
      
      More on this here:
      https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/
      
      Reply
John December 12, 2018 at 5:10 pm #

What exactly is this for:

for word, index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break

Are you looping over the dictionary here every time you made a prediction, to look up the word corresponding to the index? Why not just reverse the dictionary once and look up the value??

Reply
- Jason Brownlee December 13, 2018 at 7:43 am #
  
  Yes. Yes, I’m sure there are more efficient ways to write this, perhaps you could share some?
  
  Reply
  - Barry DeCicco February 6, 2020 at 5:52 am #
    
    Jason, I’ve been following an article at: https://towardsdatascience.com/natural-language-processing-with-tensorflow-e0a701ef5cef,
    by Ashu Prasad. At one point, he does this (search for ‘We reverse the dictionary containing the encoded words using a helper function which facilitates us to plot the embeddings.’).
    
    [I can’t print the code because it’s an image. ]
    
    It relies on pulling the weights from the model; I’ve tried to duplicate it, but have failed.
    
    If somebody can get it working, it’s probably what people are looking for here.
    
    If you do, please let me know: bdecicco2001@yahoo.com
    
    Reply
    - Jason Brownlee February 6, 2020 at 8:33 am #
      
      Perhaps contact the authors of the article directly?
      
      Reply
Ali January 7, 2019 at 5:17 am #

Hi Jason,

Thank you for the great article. I have 2 questions:

1- If I have the model trained and after that I need to add new words to is, what is the best way to do that without retrain from the beginning?

2- if I have trained the model with a wrong sentence. For example I used ‘Hi Jason, hooo are you?’ but the correct is ‘Hi Jason, how are you?’ and I wants to fix that without retrain from the beginning. what is the best way to do that kind of reinforcement learning?

Reply
- Jason Brownlee January 7, 2019 at 6:41 am #
  
  The easiest way: mark the new words as “unknown”.
  
  Another approach is to use the model weights as a starting point and re-train the model with a small learning rate and new/updated data.
  
  Reply
Suraj Chandrasekhar January 10, 2019 at 10:47 am #

Hello Jason,

This was a very well done article thank you.

1. I was wondering, is their a way to generate text using an RNN/LSTM model without giving in the 1st input word like you have in the generate_seq method, similar to the markovify package, specificially the make_sentence()/make_short_sentence(char_length) functions.

2. Also, would using word embeddings such as Word2Vec or GloVe embeddings allow us to use words not in the training corpus?

Reply
- Jason Brownlee January 11, 2019 at 7:37 am #
  
  Yes, you can frame the problem any way you wish, e.g. feed one word and get a sentence or paragraph.
  
  The model can be only be trained on words in the training corpus. New works are marked “unknown”.
  
  Reply
kokimoshida March 7, 2019 at 2:29 am #

Hello,
i wanna build a article recommendation system based on article titles and abstract, how can i use language modeling to measure the similarity between a user profile and the articles,
Thank you

Reply
- Jason Brownlee March 7, 2019 at 6:56 am #
  
  I don’t have an example of this. Perhaps the sum of the difference between the word vectors?
  
  Reply
Kirill March 16, 2019 at 12:18 am #

Jason, very good post! I’m making the same model to predict future words in a text, but faced with the problem of validation loss increasing. I split my data into train and test and while train loss increasing, validation loss is increasing. So, I think it means overfit. Even in your example if we add validation_split param into fit method we will see that validation loss is increasing too. I think it’s not ok. What is your opinion ?

Reply
- Jason Brownlee March 16, 2019 at 7:54 am #
  
  My best advice for diagnosing and improving a deep learning model is here:
  https://machinelearningmastery.com/start-here/#better
  
  Reply
Alex March 30, 2019 at 8:43 pm #

Hello, Jason!

Thank you for such a detailed article. I have two questions:

1. What effect will the change COUNT of LSTM(units=COUNT) have for this neural network of word prediction?

2. Do I understand correctly that if I delete sequences with the same inputs and output, making a list with a unique set of sequences, it will reduce the number of patterns to be learned and will not affect the final result? (optimization of training time)

Reply
- Jason Brownlee March 31, 2019 at 9:29 am #
  
  Good question, more nodes means the model has more capacity, I explain this here:
  https://machinelearningmastery.com/how-to-control-neural-network-model-capacity-with-nodes-and-layers/
  
  It may, sounds like a fun experiment Alex.
  
  Reply
hardik April 5, 2019 at 2:03 am #

how can i extract car vin number from the image of vin having other information too

Reply
- Jason Brownlee April 5, 2019 at 6:19 am #
  
  Perhaps use classic computer vision techniques to isolate the text, then extract the text.
  
  I think opencv in python might be a good place to start.
  
  Reply
Sravan Malla April 5, 2019 at 5:43 pm #

Instead of one prediction, how can I make it to have couple of predictions and allow user to pick one among them

Reply
- Jason Brownlee April 6, 2019 at 6:43 am #
  
  Some ideas:
  
  Perhaps you can sample the output probabilities in order to generate a few different outputs.
  Perhaps you can try running the model a few times to get different outputs?
  Perhaps you can train and use a few parallel models to get different outputs?
  
  Reply
Mars Wayne August 21, 2019 at 7:28 pm #

Hey,
If I want to predict the first 3 most probable word after inputting two words, how do i make change in the code?. This model generates the next word and and considers the whole string for the next word prediction. Currently I’m working on making a keyboard out of this.

For example:
If I input “I read”,the model should generate like “it”, “book” and “your”.

Reply
- Jason Brownlee August 22, 2019 at 6:27 am #
  
  You could look at the probabilities for the next word, and select those 3 words with the highest probability.
  
  Reply
Pijush Biswas October 16, 2019 at 1:15 am #

Hi, it is really a good article, I have gone through each examples and started liking it.
Would you please provide a syntax for ‘previous word’ sequence which can be trained ? Most of the examples I get on web is next word predictor. My requirement is to have previous word, you mentioned already to use LSTM, but would be help if you can provide a X , y sequence

Reply
- Jason Brownlee October 16, 2019 at 8:07 am #
  
  I don’t understand, sorry. Can you elaborate?
  
  Reply
  - Pijush Biswas October 18, 2019 at 11:17 pm #
    
    Hi , I was looking for model 2:
    
    X, y
    _, _, _, _, _, Jack, and
    _, _, _, _, Jack, and Jill
    _, _, _, Jack, and, Jill, went
    _, _, Jack, and, Jill, went, up
    _, Jack, and, Jill, went, up, the
    Jack, and, Jill, went, up, the, hill
    ————
    sequences = list()
    for line in data.split(‘\n’):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
    sequence = encoded[:i+1]
    sequences.append(sequence)
    
    —
    What doubt I have here is, how can I write these to predict “previous” word. y becomes
    the previous word:
    
    Will it work ? or how should it work ?
    
    X, y
    Jack,and, Jill, went, up, the, hill newline
    and, Jill, went, up, the, hill, _ Jack
    Jill, went, up, the, hill, _, _ and
    went, up, the, _ , _, _ Jill
    up,the,_, _ , _, _ went
    the,_, _ , _, _,_ up
    
    Reply
    - Jason Brownlee October 19, 2019 at 6:39 am #
      
      No need to predict the previous word as it is already available.
      
      If you want to learn how to predict a prior word given no other information, you can simply reverse the order of the input sequences when training.
      
      Reply
      - Pijush Biswas October 25, 2019 at 4:48 am #
        
        Thanks Jason for help. Thinking that that would help. Would you like to see what is my exact question ?
        
        The problem I am trying to solve is:
        
        I have line:
        
        Line1: Jack and Jill went up the hill
        Line2: To fetch a pail of water
        Line3: Jack fell down and broke his crown
        Line4: And Jill came tumbling after
        
        Now I want to rewrite line4, with a rhyming work “water”. In my case, “mother” will be right word.
        Line4 : And _, _ I love my mother
        
        Or I want to change the word “tumbling”, what is the best fit at that position
        Line4: And Jill came “_” after
        
        If I have to achieve that, I can reverse the line and train the model. And then I have to keep another model for next word prediction.
        
        I want to understand, if do we have any inbuilt features in any layer/technique for both next/prior word predictor. I have not fully understood the LSTM, I just thought LSTM can take care of remembering of previous word ?
      - Jason Brownlee October 25, 2019 at 6:52 am #
        
        Thanks.
        
        There are many ways to frame the problem.
        
        A simple/naive way – that might work – would be to input the text as is and the output of the model is to predict the missing word or words directly.
        
        Try that as a first step. Use a special token to represent missing words.
Yonatan October 27, 2019 at 3:11 am #

What is the vocabulary size if we use tokenizer with num words?

If I use the Tokenizer with num_words:
tokenizer = Tokenizer(num_words=num_words, lower=True)

Now we have this line:
y = to_categorical(y, num_classes=vocab_size)

Should I call it with:
y = to_categorical(y, num_classes=num_words)
?

That’s because the actual words number should be smaller.

I have a vocabulary size of ~ 800K words and the pad_sequences always gets MemoryError. That’s why i’m asking.
Thanks!

Reply
- Jason Brownlee October 27, 2019 at 5:47 am #
  
  You might gave the terms around the wrong way?
  
  The vocab size will be much smaller than the number of words, as the number of words includes duplicates.
  
  Reply
Efstathios Chatzikyriakidis May 14, 2020 at 6:15 am #

It is overkill to use LSTM in One-Word-In, One-Word-Out framing since no sequence is used (the length is 1).

We can use just a Flatten layer after Embedding and connect it to a Dense layer.

Reply
- Jason Brownlee May 14, 2020 at 1:23 pm #
  
  Sure. Think of the example as a starting point for your own projects.
  
  Reply
Efstathios Chatzikyriakidis May 14, 2020 at 6:48 am #

One-Word-In -> One-Word-Out implementation creates also the following 2-grams:

hill->to

from:

Jack and Jill went up the hill
To fetch a pail of water

Which is incorrect.

We need to create 2-grams per line.

Also, if the text is a paragraph we need to segment the paragraph in sentences and then do the 2-grams extraction for the dataset.

Reply
Siddharth July 11, 2020 at 11:40 pm #

Hi Jason,

Your write-up is pretty clean and understandable. I followed this article and created the next word/sequence prediction model. I am facing an issue w.r.t outputs inferred via model.

Example, if I feed to the model – “Where can I buy”, I get outputs – “Where can I buy a bicycle” & “Where can I buy spare parts for my bicycle”. These 2 are perfect.
I also get a couple of grammatically incorrect outputs – “Where can I buy of bicycle”, “Where can I buy went to bicycle”.
Do you have any ideas on how to filter out the grammatically incorrect outputs so that we are left with only good sentences in output? Thanks for your help.

FYI – Training Data Creation –
The approach I followed is trigrams in the input. For example, For sentence, “I am reading this article”, I used below data for training.

(I, am, reading) > (this)
(am, reading, this) > (article)

Reply
- Jason Brownlee July 12, 2020 at 5:52 am #
  
  Thanks.
  
  Not really, other than train a better model that makes fewer errors.
  
  Reply
seraj June 6, 2021 at 5:17 am #

Hi Jason
Thanks for informative tutorial.

I was wondering about this :
model.add(Embedding(vocab_size, 10, input_length=1))
model.add(LSTM(50))
should the single hidden LSTM layer with – 50 units – is equal the length of Embedding layer , I mean sequence input_length?

Reply
- Jason Brownlee June 6, 2021 at 5:52 am #
  
  The size of the embedding and the number of units in the LSTM layer are not related.
  
  Reply
Karina August 3, 2021 at 10:16 am #

Hi, how are u?, sorry I have a query, I am using your example to predict the next word with a corpus of data (“sentences”), which I concatenate to form a single text and perform the procedure, however my network is not training, the accuracy it starts at “0.04” and the epochs are almost the same, I have checked everything and even the word processing is fine …. I don’t know what to do

Reply
- Jason Brownlee August 4, 2021 at 5:11 am #
  
  Perhaps these tutorials will help you improve performance:
  https://machinelearningmastery.com/start-here/#better
  
  Reply
  - Karina August 6, 2021 at 10:53 am #
    
    apply the above, but I still have problems with the model.
    
    Reply
Mona August 17, 2021 at 10:44 am #

Hi,
How can i use the presented language model to correct the speech recognizer results?

Reply
- Adrian Tam August 17, 2021 at 11:51 am #
  
  Speech recognizer is a different topic but you may consider the recognizer is not recognizing one word but multiple words in different probabilities. Normally you take the single one with highest probability as the output, but with the language model, you can base on the highest probability in the sequence as the output, with the words before the current one taken into consideration.
  
  Reply
Aklilu February 22, 2024 at 11:48 pm #

prof your post always helpful!!! I thank you very much for your effort!!!…my question is how can we adapt the above code for encoder-decoder neural net based chatbot system????

Reply
- James Carmichael February 23, 2024 at 10:24 am #
  
  Hi Aklilu…You may wish to investigate sequence to sequence models for such a purpose:
  
  https://towardsdatascience.com/generative-chatbots-using-the-seq2seq-model-d411c8738ab5
  
  Reply

Navigation

How to Develop Word-Based Neural Language Models in Python with Keras

Tutorial Overview

Need help with Deep Learning for Text Data?

Framing Language Modeling

Jack and Jill Nursery Rhyme

Model 1: One-Word-In, One-Word-Out Sequences

Model 2: Line-by-Line Sequence

Model 3: Two-Words-In, One-Word-Out Sequence

Extensions

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

95 Responses to How to Develop Word-Based Neural Language Models in Python with Keras

Leave a Reply Click here to cancel reply.