How to Develop a Word-Level Neural Language Model and Use it to Generate Text

A language model can predict the probability of the next word in the sequence, based on the words already observed in the sequence.

Neural network models are a preferred method for developing statistical language models because they can use a distributed representation where different words with similar meanings have similar representation and because they can use a large context of recently observed words when making predictions.

In this tutorial, you will discover how to develop a statistical language model using deep learning in Python.

After completing this tutorial, you will know:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Let’s get started.

  • Update Apr/2018: Fixed mismatch between 100 input words in description of the model and 50 in the actual model.
How to Develop a Word-Level Neural Language Model and Use it to Generate Text

How to Develop a Word-Level Neural Language Model and Use it to Generate Text
Photo by Carlo Raso, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. The Republic by Plato
  2. Data Preparation
  3. Train Language Model
  4. Use Language Model

The Republic by Plato

The Republic is the classical Greek philosopher Plato’s most famous work.

It is structured as a dialog (e.g. conversation) on the topic of order and justice within a city state

The entire text is available for free in the public domain. It is available on the Project Gutenberg website in a number of formats.

You can download the ASCII text version of the entire book (or books) here:

Download the book text and place it in your current working directly with the filename ‘republic.txt

Open the file in a text editor and delete the front and back matter. This includes details about the book at the beginning, a long analysis, and license information at the end.

The text should begin with:

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,

And end with


And it shall be well with us both in this life and in the pilgrimage of a thousand years which we have been describing.

Save the cleaned version as ‘republic_clean.txt’ in your current working directory. The file should be about 15,802 lines of text.

Now we can develop a language model from this text.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Data Preparation

We will start by preparing the data for modeling.

The first step is to look at the data.

Review the Text

Open the text in an editor and just look at the text data.

For example, here is the first piece of dialog:

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.

I turned round, and asked him where his master was.

There he is, said the youth, coming after you, if you will only wait.

Certainly we will, said Glaucon; and in a few minutes Polemarchus
appeared, and with him Adeimantus, Glaucon’s brother, Niceratus the son
of Nicias, and several others who had been at the procession.

Polemarchus said to me: I perceive, Socrates, that you and your
companion are already on your way to the city.

You are not far wrong, I said.

What do you see that we will need to handle in preparing the data?

Here’s what I see from a quick look:

  • Book/Chapter headings (e.g. “BOOK I.”).
  • British English spelling (e.g. “honoured”)
  • Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
  • Strange names (e.g. “Polemarchus”).
  • Some long monologues that go on for hundreds of lines.
  • Some quoted dialog (e.g. ‘…’)

These observations, and more, suggest at ways that we may wish to prepare the text data.

The specific way we prepare the data really depends on how we intend to model it, which in turn depends on how we intend to use it.

Language Model Design

In this tutorial, we will develop a model of the text that we can then use to generate new sequences of text.

The language model will be statistical and will predict the probability of each word given an input sequence of text. The predicted word will be fed in as input to in turn generate the next word.

A key design decision is how long the input sequences should be. They need to be long enough to allow the model to learn the context for the words to predict. This input length will also define the length of seed text used to generate new sequences when we use the model.

There is no correct answer. With enough time and resources, we could explore the ability of the model to learn with differently sized input sequences.

Instead, we will pick a length of 50 words for the length of the input sequences, somewhat arbitrarily.

We could process the data so that the model only ever deals with self-contained sentences and pad or truncate the text to meet this requirement for each input sequence. You could explore this as an extension to this tutorial.

Instead, to keep the example brief, we will let all of the text flow together and train the model to predict the next word across sentences, paragraphs, and even books or chapters in the text.

Now that we have a model design, we can look at transforming the raw text into sequences of 50 input words to 1 output word, ready to fit a model.

Load Text

The first step is to load the text into memory.

We can develop a small function to load the entire text file into memory and return it. The function is called load_doc() and is listed below. Given a filename, it returns a sequence of loaded text.

Using this function, we can load the cleaner version of the document in the file ‘republic_clean.txt‘ as follows:

Running this snippet loads the document and prints the first 200 characters as a sanity check.

BOOK I.

I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what

So far, so good. Next, let’s clean the text.

Clean Text

We need to transform the raw text into a sequence of tokens or words that we can use as a source to train the model.

Based on reviewing the raw text (above), below are some specific operations we will perform to clean the text. You may want to explore more cleaning operations yourself as an extension.

  • Replace ‘–‘ with a white space so we can split words better.
  • Split words based on white space.
  • Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
  • Remove all words that are not alphabetic to remove standalone punctuation tokens.
  • Normalize all words to lowercase to reduce the vocabulary size.

Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.

We can implement each of these cleaning operations in this order in a function. Below is the function clean_doc() that takes a loaded document as an argument and returns an array of clean tokens.

We can run this cleaning operation on our loaded document and print out some of the tokens and statistics as a sanity check.

First, we can see a nice list of tokens that look cleaner than the raw text. We could remove the ‘Book I‘ chapter markers and more, but this is a good start.

We also get some statistics about the clean document.

We can see that there are just under 120,000 words in the clean text and a vocabulary of just under 7,500 words. This is smallish and models fit on this data should be manageable on modest hardware.

Next, we can look at shaping the tokens into sequences and saving them to file.

Save Clean Text

We can organize the long list of tokens into sequences of 50 input words and 1 output word.

That is, sequences of 51 words.

We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.

We will transform the tokens into space-separated strings for later storage in a file.

The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.

Running this piece creates a long list of lines.

Printing statistics on the list, we can see that we will have exactly 118,633 training patterns to fit our model.

Next, we can save the sequences to a new file for later loading.

We can define a new function for saving lines of text to a file. This new function is called save_doc() and is listed below. It takes as input a list of lines and a filename. The lines are written, one per line, in ASCII format.

We can call this function and save our training sequences to the file ‘republic_sequences.txt‘.

Take a look at the file with your text editor.

You will see that each line is shifted along one word, with a new word at the end to be predicted; for example, here are the first 3 lines in truncated form:

book i i … catch sight of
i i went … sight of us
i went down … of us from

Complete Example

Tying all of this together, the complete code listing is provided below.

You should now have training data stored in the file ‘republic_sequences.txt‘ in your current working directory.

Next, let’s look at how to fit a language model to this data.

Train Language Model

We can now train a statistical language model from the prepared data.

The model we will train is a neural language model. It has a few unique characteristics:

  • It uses a distributed representation for words so that different words with similar meanings will have a similar representation.
  • It learns the representation at the same time as learning the model.
  • It learns to predict the probability for the next word using the context of the last 100 words.

Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.

Let’s start by loading our training data.

Load Sequences

We can load our training data using the load_doc() function we developed in the previous section.

Once loaded, we can split the data into separate training sequences by splitting based on new lines.

The snippet below will load the ‘republic_sequences.txt‘ data file from the current working directory.

Next, we can encode the training data.

Encode Sequences

The word embedding layer expects input sequences to be comprised of integers.

We can map each word in our vocabulary to a unique integer and encode our input sequences. Later, when we make predictions, we can convert the prediction to numbers and look up their associated words in the same mapping.

To do this encoding, we will use the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of the unique words in the data and assigns each a unique integer.

We can then use the fit Tokenizer to encode all of the training sequences, converting each sequence from a list of words to a list of integers.

We can access the mapping of words to integers as a dictionary attribute called word_index on the Tokenizer object.

We need to know the size of the vocabulary for defining the embedding layer later. We can determine the vocabulary by calculating the size of the mapping dictionary.

Words are assigned values from 1 to the total number of words (e.g. 7,409). The Embedding layer needs to allocate a vector representation for each word in this vocabulary from index 1 to the largest index and because indexing of arrays is zero-offset, the index of the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1 in length.

Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1 larger than the actual vocabulary.

Sequence Inputs and Output

Now that we have encoded the input sequences, we need to separate them into input (X) and output (y) elements.

We can do this with array slicing.

After separating, we need to one hot encode the output word. This means converting it from an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate the specific word at the index of the words integer value.

This is so that the model learns to predict the probability distribution for the next word and the ground truth from which to learn from is 0 for all words except the actual word that comes next.

Keras provides the to_categorical() that can be used to one hot encode the output words for each input-output sequence pair.

Finally, we need to specify to the Embedding layer how long input sequences are. We know that there are 50 words because we designed the model, but a good generic way to specify that is to use the second dimension (number of columns) of the input data’s shape. That way, if you change the length of sequences when preparing data, you do not need to change this data loading code; it is generic.

Fit Model

We can now define and fit our language model on the training data.

The learned embedding needs to know the size of the vocabulary and the length of input sequences as previously discussed. It also has a parameter to specify how many dimensions will be used to represent each word. That is, the size of the embedding vector space.

Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or larger values.

We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and a deeper network may achieve better results.

A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to interpret the features extracted from the sequence. The output layer predicts the next word as a single vector the size of the vocabulary with a probability for each word in the vocabulary. A softmax activation function is used to ensure the outputs have the characteristics of normalized probabilities.

A summary of the defined network is printed as a sanity check to ensure we have constructed what we intended.

Next, the model is compiled specifying the categorical cross entropy loss needed to fit the model. Technically, the model is learning a multi-class classification and this is the suitable loss function for this type of problem. The efficient Adam implementation to mini-batch gradient descent is used and accuracy is evaluated of the model.

Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128 to speed things up.

Training may take a few hours on modern hardware without GPUs. You can speed it up with a larger batch size and/or fewer training epochs.

During training, you will see a summary of performance, including the loss and accuracy evaluated from the training data at the end of each batch update.

You will get different results, but perhaps an accuracy of just over 50% of predicting the next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a model that memorized the text), but rather a model that captures the essence of the text.

Save Model

At the end of the run, the trained model is saved to file.

Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current working directory.

Later, when we load the model to make predictions, we will also need the mapping of words to integers. This is in the Tokenizer object, and we can save that too using Pickle.

Complete Example

We can put all of this together; the complete example for fitting the language model is listed below.

Use Language Model

Now that we have a trained language model, we can use it.

In this case, we can use it to generate new sequences of text that have the same statistical properties as the source text.

This is not practical, at least not for this example, but it gives a concrete example of what the language model has learned.

We will start by loading the training sequences again.

Load Data

We can use the same code from the previous section to load the training data sequences of text.

Specifically, the load_doc() function.

We need the text so that we can choose a source sequence as input to the model for generating a new sequence of text.

The model will require 100 words as input.

Later, we will need to specify the expected length of input. We can determine this from the input sequences by calculating the length of one line of the loaded data and subtracting 1 for the expected output word that is also on the same line.

Load Model

We can now load the model from file.

Keras provides the load_model() function for loading the model, ready for use.

We can also load the tokenizer from file using the Pickle API.

We are ready to use the loaded model.

Generate Text

The first step in generating text is preparing a seed input.

We will select a random line of text from the input text for this purpose. Once selected, we will print it so that we have some idea of what was used.

Next, we can generate new words, one at a time.

First, the seed text must be encoded to integers using the same tokenizer that we used when training the model.

The model can predict the next word directly by calling model.predict_classes() that will return the index of the word with the highest probability.

We can then look up the index in the Tokenizers mapping to get the associated word.

We can then append this word to the seed text and repeat the process.

Importantly, the input sequence is going to get too long. We can truncate it to the desired length after the input sequence has been encoded to integers. Keras provides the pad_sequences() function that we can use to perform this truncation.

We can wrap all of this into a function called generate_seq() that takes as input the model, the tokenizer, input sequence length, the seed text, and the number of words to generate. It then returns a sequence of words generated by the model.

We are now ready to generate a sequence of new words given some seed text.

Putting this all together, the complete code listing for generating text from the learned-language model is listed below.

Running the example first prints the seed text.

when he said that a man when he grows old may learn many things for he can no more learn much than he can run much youth is the time for any extraordinary toil of course and therefore calculation and geometry and all the other elements of instruction which are a

Then 50 words of generated text are printed.

preparation for dialectic should be presented to the name of idle spendthrifts of whom the other is the manifold and the unjust and is the best and the other which delighted to be the opening of the soul of the soul and the embroiderer will have to be said at

You will get different results. Try running the generation piece a few times.

You can see that the text seems reasonable. In fact, the addition of concatenation would help in interpreting the seed and the generated text. Nevertheless, the generated text gets the right kind of words in the right kind of order.

Try running the example a few times to see other examples of generated text. Let me know in the comments below if you see anything interesting.

Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

  • Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed length (e.g. the longest sentence length).
  • Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words removed.
  • Tune Model. Tune the model, such as the size of the embedding or number of memory cells in the hidden layer, to see if you can develop a better model.
  • Deeper Model. Extend the model to have multiple LSTM hidden layers, perhaps with dropout to see if you can develop a better model.
  • Pre-Trained Word Embedding. Extend the model to use pre-trained word2vec or GloVe vectors to see if it results in a better model.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered how to develop a word-based language model using a word embedding and a recurrent neural network.

Specifically, you learned:

  • How to prepare text for developing a word-based language model.
  • How to design and fit a neural language model with a learned embedding and an LSTM hidden layer.
  • How to use the learned language model to generate new text with similar statistical properties as the source text.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


80 Responses to How to Develop a Word-Level Neural Language Model and Use it to Generate Text

  1. nike November 10, 2017 at 12:59 pm #

    Thank you for providing this blog, have u use rnn to do recommend ? like use rnn recommend movies ,use the user consume movies sequences

    • Jason Brownlee November 11, 2017 at 9:15 am #

      I have not used RNNs for recommender systems, sorry.

  2. Stuart November 11, 2017 at 6:01 am #

    Outstanding article, thank you! Two questions:

    1. How much benefit is gained by removing punctuation? Wouldn’t a few more punctuation-based tokens be a fairly trivial addition to several thousand word tokens?

    2. Based on your experience, which method would you say is better at generating meaningful text on modest hardware (the cheaper gpu AWS options) in a reasonable amount of time (within several hours): word-level or character-level generation?

    Also it would be great if you could include your hardware setup, python/keras versions, and how long it took to generate your example text output.

  3. Raoul November 12, 2017 at 11:55 pm #

    Great tutorial and thank you for sharing!

    Am I correct in assuming that the model always spits out the same output text given a specific seed text?

  4. Sarang November 13, 2017 at 1:15 am #

    Jason,

    Thanks for the excellent blogs, what do you think about the future of deep learning?
    Do you think deep learning is here to stay for another 10 years?

    • Jason Brownlee November 13, 2017 at 10:18 am #

      Yes. The results are too valuable across a wide set of domains.

  5. Roger January 3, 2018 at 8:40 pm #

    Hi Jason, great blog!
    Still, I got a question when running “model.add(LSTM(100, return_sequences=True))”.
    TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.
    Could you please help? Thanks.

    • Roger January 3, 2018 at 9:15 pm #

      by the way I am using os system, py3.5

      • Kirstie January 5, 2018 at 1:03 am #

        Hi Roger. I had the same issue, updating Tensorflow with pip install –upgrade Tensorflow worked for me.

    • Jason Brownlee January 4, 2018 at 8:10 am #

      Sorry I have not seen this error before. Perhaps try posting to stackoverflow?

      It may be the version of your libraries? Ensure everything is up to date.

  6. Roger January 8, 2018 at 6:14 pm #

    Hi Jason,
    Is it possible to use your codes above as a language model instead of predicting next word?
    What I want is to judge “I am eating an apple” is more commonly used than “I an apple am eating”.
    For short sentence, may be I don’t have 50 words as input.
    Also, is it possible for Keras to output the probability with its index like the biggest probability for next word is “republic” and I want to get the index for “republic” which can be matched in tokenizer.word_index.
    Thanks!

    • Roger January 8, 2018 at 8:53 pm #

      do you have any suggestions if I want to use 3 previous words as input to predict next word? Thanks.

      • Jason Brownlee January 9, 2018 at 5:28 am #

        You can frame your problem, prepare the data and train the model.

        3 words as input might not be enough.

    • Jason Brownlee January 9, 2018 at 5:25 am #

      Not sure about your application sorry.

      Keras can predict probabilities across the vocabulary and you can use argmax() to get the index of the word with the largest probability.

  7. Cid February 14, 2018 at 3:47 am #

    Hey Jason,
    Thanks for the post. I noticed in the extensions part you mention Sentence-Wise Modelling.
    I understand the technique of padding (after reading your other blog post). But how does it incorporate a full stop when generating text. Is it a matter of post-processing the text? Could it be possible/more convenient to tokenize a full stop prior to embedding?

    • Jason Brownlee February 14, 2018 at 8:23 am #

      I generally recommend removing punctuation. It balloons the size of the vocab and in turn slows down modeling.

      • Cid February 14, 2018 at 8:00 pm #

        OK thanks for the advise, how could I incorporate a sentence structure into my model then?

        • Jason Brownlee February 15, 2018 at 8:40 am #

          Each sentence could be one “sample” or sequence of words as input.

  8. Maria February 16, 2018 at 10:32 pm #

    Hi Jason, I tried to use your model and train it with a corpus I had, everything seemed to work fine, but at the and I have this error:
    34 sequences = array(sequences)
    35 #print(sequences)
    —> 36 X, y = sequences[:,:-1], sequences[:,-1]
    37 # print(sequences[:,:-1])
    38 # X, y = sequences[:-1], sequences[-1]

    IndexError: too many indices for array

    regarding the sliding of the sequences. Do you know how to fix it?
    Thanks so much!

    • Jason Brownlee February 17, 2018 at 8:45 am #

      Perhaps double check your loaded data has the shape that you expect?

      • Ray March 1, 2018 at 7:31 am #

        Hi Jason and Maria.
        I am having the exact same problem too.
        I was hoping Jason might have better suggestion
        here is the error:

        IndexError Traceback (most recent call last)
        in ()
        1 # separate into input and output
        2 sequences = array(sequences)
        —-> 3 X, y = sequences[:,:-1], sequences[:,-1]
        4 y = to_categorical(y, num_classes=vocab_size)
        5 seq_length = X.shape[1]

        IndexError: too many indices for array

        • Jason Brownlee March 1, 2018 at 3:05 pm #

          Are you able to confirm that your Python 3 environment is up to date, including Keras and tensorflow?

          For example, here are the current versions I’m running with:

          • Doron Ben Elazar March 20, 2018 at 8:14 am #

            It means that your input is not even and np.array doesn’t parse it properly (the author created paragraphs of 50 tokens each), a possible fix would be:

            original_sequences = tokenizer.texts_to_sequences(text_chunk)

            vocab_size = len(tokenizer.word_index) + 1

            aligned_sequneces = []
            for sequence in original_sequences:
            aligned_sequence = np.zeros(max_len, dtype=np.int64)
            aligned_sequence[:len(sequence)] = np.array(sequence, dtype=np.int64)
            aligned_sequneces.append(aligned_sequence)

            sequences = np.array(aligned_sequneces)

        • Basil June 28, 2018 at 6:36 pm #

          Don’t know if it too late to respond. the issue arises because u have by mistake typed

          tokenizer.fit_on_sequences before instead of tokenizer.texts_to_sequences.

          Hope this helps others who come to this page in the future!

          Thanks Jason.

  9. vikas dixit March 5, 2018 at 9:26 pm #

    Sir, i have a vocabulary size of 12000 and when i use to_categorical system throws Memory Error as shown below:

    /usr/local/lib/python2.7/dist-packages/keras/utils/np_utils.pyc in to_categorical(y, num_classes)
    22 num_classes = np.max(y) + 1
    23 n = y.shape[0]
    —> 24 categorical = np.zeros((n, num_classes))
    25 categorical[np.arange(n), y] = 1
    26 return categorical

    MemoryError:

    How to solve this error??

    • Jason Brownlee March 6, 2018 at 6:12 am #

      Perhaps try running the code on a machine with more RAM, such as on S3?

      Perhaps try mapping the vocab to integers manually with more memory efficient code?

  10. Eli Mshomi March 7, 2018 at 9:40 am #

    Can you generate a article based on the other related articles which and be human readable ?

    • Jason Brownlee March 7, 2018 at 3:05 pm #

      Yes, I believe so. The model will have to be large and carefully trained.

  11. Adam March 12, 2018 at 9:52 pm #

    Hi Jason,
    I just tried out your code here with my own text sample (blog posts from a tumblr blog) and trained it and it’s now gotten to the point where text is no longer “generated”, but rather just sent back verbatim.
    The sample set I’m using is rather small (each individual post is fairly small, so I made each sequence 10 input 1 output) giving me around 25,000 sequences. Everything else was kept the same code from your tutorial – after around 150 epochs, the accuracy was around 0.99 so I cut it off to try generation.

    When I change the seed text from something to the sample to something else from the vocabulary (ie not a full line but a “random” line) then the text is fairly random which is what I wanted. When the seed text is changed to something outside of the vocabulary, the same text is generated each time.

    What should I do if I want something more random? Should I change something like the memory cells in the LSTM layers? Reduce the sequence length? Just stop training at a higher loss/lower accuracy?

    Thanks a ton for your tutorials, I’ve learned a lot through them.

    • Jason Brownlee March 13, 2018 at 6:28 am #

      Perhaps the model is overfit and could be trained less?

      Perhaps a larger dataset is required?

      • Adam March 13, 2018 at 9:17 am #

        I think it might just be overfit. Sadly, there isn’t more data that I can grab (at least that i know of currently) so I can’t grab much more data which sucks – that’s why I reduced the sequence length to 10.

        I checkpointed every epoch so that I can play around with what gives the best results. Thanks for your advice!

        • Jason Brownlee March 13, 2018 at 3:04 pm #

          Perhaps also try a blend/ensemble of some of the checkpointed models or models from multiple runs to see if it can give a small lift.

  12. Prateek March 30, 2018 at 4:05 pm #

    Hi Jason,

    I check the amount of accuracy and loss on Tensorflow. I would like to know what exactly do you means in accuracy in NLP?

    In computer vision, if we wish to predict cat and the predicted out of the model is cat then we can say that the accuracy of the model is greater than 95%.

    I want to understand physically what do we mean by accuracy in NLP models. Can you please explain?

  13. Jin Zhou April 14, 2018 at 12:10 pm #

    Hi, I have a question about evaluating this model. As I know, perplexity is a very popular method. For BLEU and perplexity, which one do you think is better? Could you give an example about how to evaluate this model in keras.

  14. Jin April 23, 2018 at 7:11 pm #

    Hi Jason, I am a little confused, you told us that we need 100 words as input, but your X_train is only 50 words per line. Could you explain that a little?

  15. jason May 1, 2018 at 9:54 am #

    Jason,

    You have an embedding layer as the part of the model. The embedding weights will be trained along with other layers. Why not separate and train it independently? In other words, first using Embedding alone to train word vectors/embedding weights. Then run your model with embedding layer initializer with embedding weights and setting trainable=false. I see most people use your approach to train a model. But this is kind of against the purpose of embedding because the output is not word context but 0/1 labels. Why not replace embedding with an ordinary layer with linear activation?

    Another jason

    • Jason Brownlee May 2, 2018 at 5:36 am #

      You can train them separately. I give examples on the blog.

      Often, I find the model has better skill when the embedding is trained with the net.

  16. johhnybravo May 19, 2018 at 2:33 am #

    ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array with shape (1,) while predicting output

  17. Nitin Mukesh May 30, 2018 at 6:33 pm #

    Hi, I want to develop Image Captioning in keras. What are the pre requisite for this? I have done your tutorials for object detection using CNN. What should I do next?

  18. mel_dagh June 1, 2018 at 1:59 am #

    Hi Jason,

    1.) Can we use this approach to predict if a word in a given sequence of the training data is highly odd..i.e. that it does not occur in that position of the sequence with high probability.

    2.) Is there a way to integrate pre-trained word embeddings (glove/word2vec) in the embedding layer?

  19. Tanaya June 8, 2018 at 7:29 pm #

    Hello, this is simply an amazing post for beginners in NLP like me.

    I have generated a language model, which is further generating text like I want it to.
    My question is, after this generation, how do I filter out all the text that does not make sense, syntactically or semantically?

    • Jason Brownlee June 9, 2018 at 6:49 am #

      Thanks.

      You might need another model to classify text, correct text, or filter text prior to using it.

      • Tanaya June 12, 2018 at 4:40 pm #

        Will this also be using Keras? Do you recommend using nltk or SpaCy?

  20. Venkat June 12, 2018 at 4:38 am #

    Hi Jason,

    Thanks for the amazing post.

    I’m working on words correction in a sentence. Ideally the model should generate number of output words equal to input words with correction of error word in the sentence.

    Does language model help me for the same.? if it does please leave some hints on the model.

    • Jason Brownlee June 12, 2018 at 6:48 am #

      Sounds like a fun project. A language model might be useful. Sorry, I don’t have a worked example. I recommend checking the literature.

  21. Amar June 20, 2018 at 8:53 pm #

    Hi Jason,

    I am working on a highly imbalanced text classification problem.
    Ex : Classify a “Genuine Email ” vs “SPAM ” based on text in body of email.

    Genuine email text = 30 k lines
    Spam email text = 1k lines

    I need to classify whether the next email is Genuine or SPAM.

    I train the model using the same example as above.
    The training data I feed to the model is only “Genuine text”.

    Will I be able to classify the next sentence I feed to the model, from the generated text and probability of words : as “GENUINE Email” vs “SPAM”?

    (I am assuming that the model has never seen SPAM data, and hence the probability of the generated text will be very less.)

    Do you see any way I can achieve this with Language model? What would be an alternative otherwise when it comes to rare event scenario for NLP use cases.

    Thank you!!

  22. musa June 29, 2018 at 2:36 am #

    Hi Jason,

    Could you comment on overfitting when training language models. I’ve built a sentence based LSTM language model and have split training: validation into a 80:20 split. I’m not seeing any imrovments in my validation data whilst the accuracy of the model seems to be improving.

    Thanks

    • Jason Brownlee June 29, 2018 at 6:13 am #

      Overfitting a language model really only matters in the context of the problem for which you are using the model.

      A language model used alone is not really that useful, so overfitting doesn’t matter. It may be purely a descriptive model rather than predictive.

      Often, they are used in the input or output of another model, in which case, overfitting may limit predictive skill.

  23. Marc June 30, 2018 at 8:42 am #

    Would would the implications of returning the hidden and cell states be here?

    I’d imagine if the input sequence was long enough it wouldn’t matter too much as the temporal relationships would be captured, but if we had shorter sequences or really long documents we would consider doing this to improve the model’s ability to learn.. am I thinking about this correctly?

    What would the drawback be to returning the hidden sequence and cell state and piping that into the next observation?

    • Jason Brownlee July 1, 2018 at 6:21 am #

      I don’t follow, why would you return the cell state externally at all? It has no meaning outside of the network.

      • star July 5, 2018 at 2:25 am #

        Hope you are doing well. I have a question which returns to my understanding from embedding vectors.
        For example if I have this sentence “ the weather is nice” and the goal of my model is predicting “nice”, when I want to use pre trained google word embedding model, I must search embedding google matrix and find the embedding vector related to words “the” “weather” “is” “nice” and feed them as input to my model? Am I right?

  24. Fatemeh July 4, 2018 at 2:34 am #

    Hello, Thank you for nice description. I want to use pre-trained google word embedding vectors, So I think I don’t need to do sequence encoding, for example if I want to create sequences with the length 10, I have to search embedding matrix and find the equivalence embedding vector for each of the 10 words, right?

    • Jason Brownlee July 4, 2018 at 8:28 am #

      Correct. You must map each word to its distributed representation when preparing the embedding or the encoding.

      • Fatemeh July 5, 2018 at 2:43 am #

        Thank you, do you have sample code for that in your book? I purchased your professional package

        • Jason Brownlee July 5, 2018 at 8:01 am #

          All sample code is provided with the PDF in the code/ directory.

          • fatemeh July 6, 2018 at 1:57 am #

            Thank you. when I want to use model.fit , I have to specify X and y and I have used pre trained google embedidng matrix. so every word has mapped to a a vector, and my inputs are actually the sentences (with the length 4). now I don’t understand the equivalent values for X. for example imagine the first sentence is “the weather is nice” so the X will be “the weather is” and the y is “nice”. When I want to convert X to integers, every word in X will be mapped to one vector? for example if the equivalent vectors for the words in sentence in google model are :”the”= 0.9,0.6,0.8 and “weather”=0.6,0.5,0.2 and “is”=0.3,0.1,0.5 , and “nice”=0.4,0.3,0.5 the input X will be :[[0.9,0.6,0.8],[0.6,0.5,0.2],[0.3,0.1,0.5]] and the output y will be [0.4,0.3,0.5]?

          • Jason Brownlee July 6, 2018 at 6:45 am #

            You can specify or learn the mapping, but after that the model will map integers to their vectors.

  25. mh July 5, 2018 at 4:19 am #

    Hello Jason, why you didn’t convert X input to hot vector format? you only did this dot y output.

  26. Anshuman Mahapatra July 5, 2018 at 10:43 pm #

    Hi..First of all would like to thank for the detailed explaination on the concept. I was executing this model step by step to have a better understanding but am stuck while doing the predictions
    yhat = model.predict_classes(encoded, verbose=0)
    I am getting the below error:-
    ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got array with shape (1,)

Leave a Reply