How to Prepare Text Data for Deep Learning with Keras

You cannot feed raw text directly into deep learning models.

Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

The Keras deep learning library provides some basic tools to help you prepare your text data.

In this tutorial, you will discover how you can use Keras to prepare your text data.

After completing this tutorial, you will know:

  • About the convenience methods that you can use to quickly prepare text data.
  • The Tokenizer API that can be fit on training data and used to encode training, validation, and test documents.
  • The range of 4 different document encoding schemes offered by the Tokenizer API.

Let’s get started.

How to Prepare Text Data for Deep Learning with Keras

How to Prepare Text Data for Deep Learning with Keras
Photo by ActiveSteve, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Split words with text_to_word_sequence.
  2. Encoding with one_hot.
  3. Hash Encoding with hashing_trick.
  4. Tokenizer API

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Split Words with text_to_word_sequence

A good first step when working with text is to split it into words.

Words are called tokens and the process of splitting text into tokens is called tokenization.

Keras provides the text_to_word_sequence() function that you can use to split text into a list of words.

By default, this function automatically does 3 things:

  • Splits words by space (split=” “).
  • Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
  • Converts text to lowercase (lower=True).

You can change any of these defaults by passing arguments to the function.

Below is an example of using the text_to_word_sequence() function to split a document (in this case a simple string) into a list of words.

Running the example creates an array containing all of the words in the document. The list of words is printed for review.

This is a good first step, but further pre-processing is required before you can work with the text.

Encoding with one_hot

It is popular to represent a document as a sequence of integer values, where each word in the document is represented as a unique integer.

Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case.

Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

As with the text_to_word_sequence() function in the previous section, the one_hot() function will make the text lower case, filter out punctuation, and split words based on white space.

In addition to the text, the vocabulary size (total words) must be specified. This could be the total number of words in the document or more if you intend to encode additional documents that contains additional words. The size of the vocabulary defines the hashing space from which words are hashed. Ideally, this should be larger than the vocabulary by some percentage (perhaps 25%) to minimize the number of collisions. By default, the ‘hash’ function is used, although as we will see in the next section, alternate hash functions can be specified when calling the hashing_trick() function directly.

We can use the text_to_word_sequence() function from the previous section to split the document into words and then use a set to represent only the unique words in the document. The size of this set can be used to estimate the size of the vocabulary for one document.

For example:

We can put this together with the one_hot() function and one hot encode the words in the document. The complete example is listed below.

The vocabulary size is increased by one-third to minimize collisions when hashing words.

Running the example first prints the size of the vocabulary as 8. The encoded document is then printed as an array of integer encoded words.

Hash Encoding with hashing_trick

A limitation of integer and count base encodings is that they must maintain a vocabulary of words and their mapping to integers.

An alternative to this approach is to use a one-way hash function to convert words to integers. This avoids the need to keep track of a vocabulary, which is faster and requires less memory.

Keras provides the hashing_trick() function that tokenizes and then integer encodes the document, just like the one_hot() function. It provides more flexibility, allowing you to specify the hash function as either ‘hash’ (the default) or other hash functions such as the built in md5 function or your own function.

Below is an example of integer encoding a document using the md5 hash function.

Running the example prints the size of the vocabulary and the integer encoded document.

We can see that the use of a different hash function results in consistent, but different integers for words as the one_hot() function in the previous section.

Tokenizer API

So far we have looked at one-off convenience methods for preparing text with Keras.

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects.

Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

For example:

Once fit, the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

  • word_counts: A dictionary of words and their counts.
  • word_docs: An integer count of the total number of documents that were used to fit the Tokenizer.
  • word_index: A dictionary of words and their uniquely assigned integers.
  • document_count: A dictionary of words and how many documents each appeared in.

For example:

Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets.

The texts_to_matrix() function on the Tokenizer can be used to create one vector per document provided per input. The length of the vectors is the total size of the vocabulary.

This function provides a suite of standard bag-of-words model text encoding schemes that can be provided via a mode argument to the function.

The modes available include:

  • binary‘: Whether or not each word is present in the document. This is the default.
  • count‘: The count of each word in the document.
  • tfidf‘: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
  • freq‘: The frequency of each word as a ratio of words within each document.

We can put all of this together with a worked example.

Running the example fits the Tokenizer with 5 small documents. The details of the fit Tokenizer are printed. Then the 5 documents are encoded using a word count.

Each document is encoded as a 9-element vector with one position for each word and the chosen encoding scheme value for each word position. In this case, a simple word count mode is used.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered how you can use the Keras API to prepare your text data for deep learning.

Specifically, you learned:

  • About the convenience methods that you can use to quickly prepare text data.
  • The Tokenizer API that can be fit on training data and used to encode training, validation, and test documents.
  • The range of 4 different document encoding schemes offered by the Tokenizer API.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


19 Responses to How to Prepare Text Data for Deep Learning with Keras

  1. Chiedu October 2, 2017 at 6:40 am #

    Hi Jason,
    Do you have any plans to cover word embeddings usine either word2vec or GloVe and how they work with keras?

    • Jason Brownlee October 2, 2017 at 9:40 am #

      Yes! I have many posts on word embeddings scheduled for the coming days/weeks.

  2. Lalit Parihar October 6, 2017 at 6:59 pm #

    Hello Jason,
    It seems the attributes mentioned for Tokenizer have been typed incorrectly, document_count and word_docs have been inter-changed.

    Thanks,
    Lalit

    • Jason Brownlee October 7, 2017 at 5:52 am #

      Thanks Lalit, in which part of the tutorial exactly?

      • rahul January 15, 2018 at 3:58 pm #

        Once fit, the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

  3. Gopika Bhardwaj October 16, 2017 at 3:07 am #

    How do we further apply a neural network on this data?

    • Jason Brownlee October 16, 2017 at 5:45 am #

      Great question, I will have many tutorials about how to do that coming out on the blog in coming weeks.

  4. Ray November 10, 2017 at 4:26 am #

    Hi Jason,
    Kudos for all you efforts in these tutorials. I have a question though, in a documnent classification task, say a csv file where each row contains text from a document. How does one realistically determine the appropriate vocabulary size and max_length for word embedding? Thank you

    • Jason Brownlee November 10, 2017 at 10:38 am #

      Great question!

      Generally, I would recommend testing different sizes/lengths and evaluate their impact on model skill – e.g. a sensitivity analysis.

  5. Manish November 11, 2017 at 10:19 am #

    Thanks for putting this together

  6. David Comfort November 30, 2017 at 5:50 am #

    Hi Jason,

    You have a typo(s) in the above paragraph. You repeat sentences.

    “Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case. Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

    Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.”

    • Jason Brownlee November 30, 2017 at 8:29 am #

      Fixed, thanks David. It’s my crappy spell check plugin duplicating paragraphs!

  7. Kandambeth December 15, 2017 at 1:24 pm #

    Thanks for the great blog Jason,

    I am trying to compare how Hashing is done differently in Sklearn and in Keras. with HashingVectorizer (SciKit Learn) you would be bale to convert a document ( array of texts) into a matrix of n dimension. However when it comes to Keras we are more similar to CountVectorizer (SciKit Learn). Isnt this fundamentally different?

    • Jason Brownlee December 15, 2017 at 3:35 pm #

      Hashing and counts are different. One uses a hash function to project the vocab into a smaller dimensional space, the other has one entry for each word and a count of the words next to it.

      Also, keras and sklearn can do both approaches and more.

      Try multiple methods on your dataset and use model skill to help choose the one that works best.

  8. Abdo Shalaby December 31, 2017 at 5:28 am #

    Thank you!

  9. Hasan January 18, 2018 at 5:22 am #

    Hi Jason,

    Thank you so much for your time and effort for preparing tutorials.
    I just have a question regarding to this tutorial, I want to use http://archive.ics.uci.edu/ml/datasets/Adult as my dataset. As you can see, it has both Strings and Int as input.
    My question is how to deal with this type of data and what is the best way to prepare my data for training?

    Thanks

    • Jason Brownlee January 18, 2018 at 10:14 am #

      The strings look like labels or categories.

      Perhaps you can encode the string values as integers and/or one hot encode.

Leave a Reply