How to Prepare Text Data for Deep Learning with Keras

By Jason Brownlee on August 7, 2019 in Deep Learning for Natural Language Processing 112

You cannot feed raw text directly into deep learning models.

Text data must be encoded as numbers to be used as input or output for machine learning and deep learning models.

The Keras deep learning library provides some basic tools to help you prepare your text data.

In this tutorial, you will discover how you can use Keras to prepare your text data.

After completing this tutorial, you will know:

About the convenience methods that you can use to quickly prepare text data.
The Tokenizer API that can be fit on training data and used to encode training, validation, and test documents.
The range of 4 different document encoding schemes offered by the Tokenizer API.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Prepare Text Data for Deep Learning with Keras
Photo by ActiveSteve, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

Split words with text_to_word_sequence.
Encoding with one_hot.
Hash Encoding with hashing_trick.
Tokenizer API

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Split Words with text_to_word_sequence

A good first step when working with text is to split it into words.

Words are called tokens and the process of splitting text into tokens is called tokenization.

Keras provides the text_to_word_sequence() function that you can use to split text into a list of words.

By default, this function automatically does 3 things:

Splits words by space (split=” “).
Filters out punctuation (filters=’!”#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n’).
Converts text to lowercase (lower=True).

You can change any of these defaults by passing arguments to the function.

Below is an example of using the text_to_word_sequence() function to split a document (in this case a simple string) into a list of words.

from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# tokenize the document
result = text_to_word_sequence(text)
print(result)

from keras.preprocessing.text import text_to_word_sequence

# define the document

text = 'The quick brown fox jumped over the lazy dog.'

# tokenize the document

result = text_to_word_sequence(text)

print(result)

Running the example creates an array containing all of the words in the document. The list of words is printed for review.

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

1	['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

This is a good first step, but further pre-processing is required before you can work with the text.

Encoding with one_hot

It is popular to represent a document as a sequence of integer values, where each word in the document is represented as a unique integer.

Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case.

Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

As with the text_to_word_sequence() function in the previous section, the one_hot() function will make the text lower case, filter out punctuation, and split words based on white space.

In addition to the text, the vocabulary size (total words) must be specified. This could be the total number of words in the document or more if you intend to encode additional documents that contains additional words. The size of the vocabulary defines the hashing space from which words are hashed. Ideally, this should be larger than the vocabulary by some percentage (perhaps 25%) to minimize the number of collisions. By default, the ‘hash’ function is used, although as we will see in the next section, alternate hash functions can be specified when calling the hashing_trick() function directly.

We can use the text_to_word_sequence() function from the previous section to split the document into words and then use a set to represent only the unique words in the document. The size of this set can be used to estimate the size of the vocabulary for one document.

For example:

from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)

from keras.preprocessing.text import text_to_word_sequence

# define the document

text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary

words = set(text_to_word_sequence(text))

vocab_size = len(words)

print(vocab_size)

We can put this together with the one_hot() function and one hot encode the words in the document. The complete example is listed below.

The vocabulary size is increased by one-third to minimize collisions when hashing words.

from keras.preprocessing.text import one_hot
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = one_hot(text, round(vocab_size*1.3))
print(result)

from keras.preprocessing.text import one_hot

from keras.preprocessing.text import text_to_word_sequence

# define the document

text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary

words = set(text_to_word_sequence(text))

vocab_size = len(words)

print(vocab_size)

# integer encode the document

result = one_hot(text, round(vocab_size*1.3))

print(result)

Running the example first prints the size of the vocabulary as 8. The encoded document is then printed as an array of integer encoded words.

8
[5, 9, 8, 7, 9, 1, 5, 3, 8]

1 2	8 [5, 9, 8, 7, 9, 1, 5, 3, 8]

Hash Encoding with hashing_trick

A limitation of integer and count base encodings is that they must maintain a vocabulary of words and their mapping to integers.

An alternative to this approach is to use a one-way hash function to convert words to integers. This avoids the need to keep track of a vocabulary, which is faster and requires less memory.

Keras provides the hashing_trick() function that tokenizes and then integer encodes the document, just like the one_hot() function. It provides more flexibility, allowing you to specify the hash function as either ‘hash’ (the default) or other hash functions such as the built in md5 function or your own function.

Below is an example of integer encoding a document using the md5 hash function.

from keras.preprocessing.text import hashing_trick
from keras.preprocessing.text import text_to_word_sequence
# define the document
text = 'The quick brown fox jumped over the lazy dog.'
# estimate the size of the vocabulary
words = set(text_to_word_sequence(text))
vocab_size = len(words)
print(vocab_size)
# integer encode the document
result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')
print(result)

from keras.preprocessing.text import hashing_trick

from keras.preprocessing.text import text_to_word_sequence

# define the document

text = 'The quick brown fox jumped over the lazy dog.'

# estimate the size of the vocabulary

words = set(text_to_word_sequence(text))

vocab_size = len(words)

print(vocab_size)

# integer encode the document

result = hashing_trick(text, round(vocab_size*1.3), hash_function='md5')

print(result)

Running the example prints the size of the vocabulary and the integer encoded document.

We can see that the use of a different hash function results in consistent, but different integers for words as the one_hot() function in the previous section.

8
[6, 4, 1, 2, 7, 5, 6, 2, 6]

1 2	8 [6, 4, 1, 2, 7, 5, 6, 2, 6]

Tokenizer API

So far we have looked at one-off convenience methods for preparing text with Keras.

Keras provides a more sophisticated API for preparing text that can be fit and reused to prepare multiple text documents. This may be the preferred approach for large projects.

Keras provides the Tokenizer class for preparing text documents for deep learning. The Tokenizer must be constructed and then fit on either raw text documents or integer encoded text documents.

For example:

from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)

from keras.preprocessing.text import Tokenizer

# define 5 documents

docs = ['Well done!',

'Good work',

'Great effort',

'nice work',

'Excellent!']

# create the tokenizer

t = Tokenizer()

# fit the tokenizer on the documents

t.fit_on_texts(docs)

Once fit, the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:

word_counts: A dictionary of words and their counts.
word_docs: A dictionary of words and how many documents each appeared in.
word_index: A dictionary of words and their uniquely assigned integers.
document_count:An integer count of the total number of documents that were used to fit the Tokenizer.

For example:

# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)

# summarize what was learned

print(t.word_counts)

print(t.document_count)

print(t.word_index)

print(t.word_docs)

Once the Tokenizer has been fit on training data, it can be used to encode documents in the train or test datasets.

The texts_to_matrix() function on the Tokenizer can be used to create one vector per document provided per input. The length of the vectors is the total size of the vocabulary.

This function provides a suite of standard bag-of-words model text encoding schemes that can be provided via a mode argument to the function.

The modes available include:

‘binary‘: Whether or not each word is present in the document. This is the default.
‘count‘: The count of each word in the document.
‘tfidf‘: The Text Frequency-Inverse DocumentFrequency (TF-IDF) scoring for each word in the document.
‘freq‘: The frequency of each word as a ratio of words within each document.

We can put all of this together with a worked example.

from keras.preprocessing.text import Tokenizer
# define 5 documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!']
# create the tokenizer
t = Tokenizer()
# fit the tokenizer on the documents
t.fit_on_texts(docs)
# summarize what was learned
print(t.word_counts)
print(t.document_count)
print(t.word_index)
print(t.word_docs)
# integer encode documents
encoded_docs = t.texts_to_matrix(docs, mode='count')
print(encoded_docs)

from keras.preprocessing.text import Tokenizer

# define 5 documents

docs = ['Well done!',

'Good work',

'Great effort',

'nice work',

'Excellent!']

# create the tokenizer

t = Tokenizer()

# fit the tokenizer on the documents

t.fit_on_texts(docs)

# summarize what was learned

print(t.word_counts)

print(t.document_count)

print(t.word_index)

print(t.word_docs)

# integer encode documents

encoded_docs = t.texts_to_matrix(docs, mode='count')

print(encoded_docs)

Running the example fits the Tokenizer with 5 small documents. The details of the fit Tokenizer are printed. Then the 5 documents are encoded using a word count.

Each document is encoded as a 9-element vector with one position for each word and the chosen encoding scheme value for each word position. In this case, a simple word count mode is used.

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])
5
{'work': 1, 'effort': 6, 'done': 3, 'great': 5, 'good': 4, 'excellent': 8, 'well': 2, 'nice': 7}
{'work': 2, 'effort': 1, 'done': 1, 'well': 1, 'good': 1, 'great': 1, 'excellent': 1, 'nice': 1}
[[ 0.  0.  1.  1.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.]]

OrderedDict([('well', 1), ('done', 1), ('good', 1), ('work', 2), ('great', 1), ('effort', 1), ('nice', 1), ('excellent', 1)])

{'work': 1, 'effort': 6, 'done': 3, 'great': 5, 'good': 4, 'excellent': 8, 'well': 2, 'nice': 7}

{'work': 2, 'effort': 1, 'done': 1, 'well': 1, 'good': 1, 'great': 1, 'excellent': 1, 'nice': 1}

[[ 0. 0. 1. 1. 0. 0. 0. 0. 0.]

[ 0. 1. 0. 0. 1. 0. 0. 0. 0.]

[ 0. 0. 0. 0. 0. 1. 1. 0. 0.]

[ 0. 1. 0. 0. 0. 0. 0. 1. 0.]

[ 0. 0. 0. 0. 0. 0. 0. 0. 1.]]

Summary

In this tutorial, you discovered how you can use the Keras API to prepare your text data for deep learning.

Specifically, you learned:

About the convenience methods that you can use to quickly prepare text data.
The Tokenizer API that can be fit on training data and used to encode training, validation, and test documents.
The range of 4 different document encoding schemes offered by the Tokenizer API.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

112 Responses to How to Prepare Text Data for Deep Learning with Keras

Chiedu October 2, 2017 at 6:40 am #

Hi Jason,
Do you have any plans to cover word embeddings usine either word2vec or GloVe and how they work with keras?

Reply
- Jason Brownlee October 2, 2017 at 9:40 am #
  
  Yes! I have many posts on word embeddings scheduled for the coming days/weeks.
  
  Reply
Lalit Parihar October 6, 2017 at 6:59 pm #

Hello Jason,
It seems the attributes mentioned for Tokenizer have been typed incorrectly, document_count and word_docs have been inter-changed.

Thanks,
Lalit

Reply
- Jason Brownlee October 7, 2017 at 5:52 am #
  
  Thanks Lalit, in which part of the tutorial exactly?
  
  Reply
  - rahul January 15, 2018 at 3:58 pm #
    
    Once fit, the Tokenizer provides 4 attributes that you can use to query what has been learned about your documents:
    
    Reply
Gopika Bhardwaj October 16, 2017 at 3:07 am #

How do we further apply a neural network on this data?

Reply
- Jason Brownlee October 16, 2017 at 5:45 am #
  
  Great question, I will have many tutorials about how to do that coming out on the blog in coming weeks.
  
  Reply
Ray November 10, 2017 at 4:26 am #

Hi Jason,
Kudos for all you efforts in these tutorials. I have a question though, in a documnent classification task, say a csv file where each row contains text from a document. How does one realistically determine the appropriate vocabulary size and max_length for word embedding? Thank you

Reply
- Jason Brownlee November 10, 2017 at 10:38 am #
  
  Great question!
  
  Generally, I would recommend testing different sizes/lengths and evaluate their impact on model skill – e.g. a sensitivity analysis.
  
  Reply
  - Kim-Ndor October 28, 2018 at 4:36 am #
    
    Hi Jason
    
    Is there anyway to perform a sensitivity analysis using python/keras please? I am looking for the codes for a while know.
    
    Many thanks for you help.
    
    Kim
    
    Reply
    - Jason Brownlee October 28, 2018 at 6:14 am #
      
      Sure, pick a variable, then run your analysis.
      
      What is the problem that you’re having exactly?
      
      Reply
Manish November 11, 2017 at 10:19 am #

Thanks for putting this together

Reply
- Jason Brownlee November 12, 2017 at 8:59 am #
  
  You’re welcome.
  
  Reply
David Comfort November 30, 2017 at 5:50 am #

Hi Jason,

You have a typo(s) in the above paragraph. You repeat sentences.

“Keras provides the one_hot() function that you can use to tokenize and integer encode a text document in one step. The name suggests that it will create a one-hot encoding of the document, which is not the case. Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.

Instead, the function is a wrapper for the hashing_trick() function described in the next section. The function returns an integer encoded version of the document. The use of a hash function means that there may be collisions and not all words will be assigned unique integer values.”

Reply
- Jason Brownlee November 30, 2017 at 8:29 am #
  
  Fixed, thanks David. It’s my crappy spell check plugin duplicating paragraphs!
  
  Reply
Kandambeth December 15, 2017 at 1:24 pm #

Thanks for the great blog Jason,

I am trying to compare how Hashing is done differently in Sklearn and in Keras. with HashingVectorizer (SciKit Learn) you would be bale to convert a document ( array of texts) into a matrix of n dimension. However when it comes to Keras we are more similar to CountVectorizer (SciKit Learn). Isnt this fundamentally different?

Reply
- Jason Brownlee December 15, 2017 at 3:35 pm #
  
  Hashing and counts are different. One uses a hash function to project the vocab into a smaller dimensional space, the other has one entry for each word and a count of the words next to it.
  
  Also, keras and sklearn can do both approaches and more.
  
  Try multiple methods on your dataset and use model skill to help choose the one that works best.
  
  Reply
Abdo Shalaby December 31, 2017 at 5:28 am #

Thank you!

Reply
- Jason Brownlee January 1, 2018 at 5:25 am #
  
  You’re welcome.
  
  Reply
  - Kumari February 25, 2019 at 11:11 pm #
    
    Could please do text summarization on small datasets using auto encoder and decoder.please provide the code sir
    
    Reply
    - Jason Brownlee February 26, 2019 at 6:22 am #
      
      Thanks for the suggestion.
      
      Reply
Hasan January 18, 2018 at 5:22 am #

Hi Jason,

Thank you so much for your time and effort for preparing tutorials.
I just have a question regarding to this tutorial, I want to use http://archive.ics.uci.edu/ml/datasets/Adult as my dataset. As you can see, it has both Strings and Int as input.
My question is how to deal with this type of data and what is the best way to prepare my data for training?

Thanks

Reply
- Jason Brownlee January 18, 2018 at 10:14 am #
  
  The strings look like labels or categories.
  
  Perhaps you can encode the string values as integers and/or one hot encode.
  
  Reply
lina January 26, 2018 at 4:13 pm #

so impressive post:)

I have a question.

If my test set include words which are not in train set, how can I handle it?
Does Keras embed OoV words to unknown vector implicitly?
If so, which function cover it?
Thank you!

Reply
- Jason Brownlee January 27, 2018 at 5:54 am #
  
  If you keep the tokenizer, they will be assigned a 0 value I would expect.
  
  Reply
Ashan March 16, 2018 at 11:11 pm #

Hi Jason,
Thank you for the nice tutorial. I have a question. why is (vocab_size*1.3) 1.3 and not an integer? you said vocab_size should be 25% larger than the vocabulary. So why is it not 8*125/100 = 10

Reply
- Jason Brownlee March 17, 2018 at 8:38 am #
  
  I specified 130% of the size of the vocab. It could have just as easily been 125.
  
  Reply
@xita March 20, 2018 at 3:16 am #

Sir, how can i input a txt file that consists of many paragraphs.

Reply
- Jason Brownlee March 20, 2018 at 6:29 am #
  
  What is the problem you are having exactly?
  
  Reply
  - @xita March 21, 2018 at 12:54 am #
    
    Sir, i’m inputting
    
    text = open(‘summary.text’ , ‘r’)
    words = list(text.read().split)
    
    and while converting text_to_word_sequence i’m getting error
    
    AttributeError: ‘list’ object has no attribute ‘lower’
    
    Reply
    - Jason Brownlee March 21, 2018 at 6:38 am #
      
      Sorry to hear that. Did you copy all of the code? Confirm Python3 and all libs are up to date?
      
      Reply
pj April 20, 2018 at 9:16 pm #

Thank you for an excellent tutorial which has helped me enormously.

Just to let you know you have a mistake in the text under the Tokenizer Api…
This bit here:

word_counts: A dictionary of words and their counts.
word_docs: An integer count of the total number of documents that were used to fit the Tokenizer.
word_index: A dictionary of words and their uniquely assigned integers.
document_count: A dictionary of words and how many documents each appeared in.

Think you have the explanations for ‘word_docs’ and the ‘document_count’ mixed up…

Reply
- Jason Brownlee April 21, 2018 at 6:48 am #
  
  Thanks, fixed.
  
  Reply
Dr. D April 24, 2018 at 5:37 am #

Jason,

Fantastic website. Lots of great information.

I have this unusual case where I have short sentences (think about the size of a tweet) so there are, at most, say 24 tokens. These sentences have to be categorized into 64 categories.

With on the order of 10E+06 sentences, I can’t really use a counting technique, so I am thinking about using the hashing trick on each token going into a zero-padded vector that maps to a one-hot encoding of the label:

[123. 456. 789. 0. 0. 0. 0.] => [0. 0. 0. 1. 0. 0.]

Is this at all a sound approach?

Reply
- Jason Brownlee April 24, 2018 at 6:37 am #
  
  Sounds like a good start!
  
  Let me know how you go.
  
  Reply
C.D. May 15, 2018 at 11:42 am #

Thanks for the information. Just a quick question – suppose I am done training this on training data, and I am making predictions using test data.

What if test data has words in text that do not appear in training data? I am dealing with review text data so there are going to be many words that will only be in training data. And same question for making new predictions – what if it encounters words that were not there in original training dataset?

Reply
- Jason Brownlee May 15, 2018 at 2:44 pm #
  
  Ideally you want to choose training data that covers your vocab.
  
  If test data contains new words, they will be zero’ed or ignored.
  
  Reply
Mayank Pal May 16, 2018 at 9:51 pm #

I want to build a text classification model. Where the network will predict Yes or No. based upon the input sentence. I can use the above approaches you mentioned to convert them to the real number but not sure how these will give me fixed-length vectors. For example. Okay and Let’s do it can relate to Yes. But if I use above approach then input vector won’t be fixed size. Can you suggest something?

Reply
- Jason Brownlee May 17, 2018 at 6:32 am #
  
  You can truncate or pad the input sequences of words.
  
  Reply
Sonman May 30, 2018 at 6:46 am #

Wonderful site to know NLP coding using python

Reply
- Jason Brownlee May 30, 2018 at 6:46 am #
  
  Thanks.
  
  Reply
Hal July 2, 2018 at 3:24 pm #

Hi,
thank you for the informative tutorial!
in the text ‘The quick brown fox jumped over the lazy dog.’, the only repeating word is ‘the’, but in the one_hot and hashing_trick examples, the tokenized output seems to repeat several words:

one hot:
[5, 9, 8, 7, 9, 1, 5, 3, 8]
Here 5 is used twice, corresponding to ‘the’ appearing twice, but 9 appears twice as well (quick, jumped), as does 8 (brown, dog)

hashing trick:
[6, 4, 1, 2, 7, 5, 6, 2, 6]
here 6 seems to represent (the, the, dog), and 2 is (fox, lazy)

Trying out the code for myself, I got similar results until I increased the vocab size to around 3x, which gave unique numbers to each word.

Am I misunderstanding something here, or is there an error in the examples?

Reply
- Jason Brownlee July 3, 2018 at 6:23 am #
  
  Yes, it is hashing rather than encoding.
  
  You could try a true one hot encoding instead if you like.
  
  Reply
- tejasvi May 24, 2020 at 5:00 pm #
  
  Even I am getting the same results. Different words are coded with the same integer in the encoding.
  
  Reply
  - Jason Brownlee May 25, 2020 at 5:45 am #
    
    Yes, it uses a hash encoding.
    
    Reply
Ashok Kumar J July 5, 2018 at 8:13 pm #

Dr. Jason Brownlee, Thank you for works.

Could you give an idea that in which order indexing takes place? In the above example, print(t.word_index))

{‘work’: 1, ‘effort’: 6, ‘done’: 3, ‘great’: 5, ‘good’: 4, ‘excellent’: 8, ‘well’: 2, ‘nice’: 7}

Reply
- Jason Brownlee July 6, 2018 at 6:40 am #
  
  Order within the text I believe.
  
  Keras will use a hash function though, so it is nonlinear. I recommend using the Tokenizer.
  
  Reply
pranjal August 5, 2018 at 6:19 pm #

Hi Jason, i want to convert text to numbers in the form of their ranking as per most used words, like in the form on IMDb dataset on keras. Is it possible to do it and will doing that give me better results?

Reply
- Jason Brownlee August 6, 2018 at 6:26 am #
  
  It may help in choosing the low frequency words to remove.
  
  Reply
Samira October 14, 2018 at 5:24 am #

Hi Jason
I’m interested in Arabic text summarization,
can I use keras to prepare Arabic text?

Reply
- Jason Brownlee October 14, 2018 at 6:05 am #
  
  I don’t see why not.
  
  Reply
Christian October 18, 2018 at 2:26 pm #

Hi, Your blog is amazing.Thank you so much for doing it

Reply
- Jason Brownlee October 18, 2018 at 2:34 pm #
  
  Thanks!
  
  Reply
Virtee Parekh November 11, 2018 at 5:48 am #

Hi! Great blog. I have a question. While using fit_on_text(docs), does docs have to be the training data or both train + test data?

Reply
- Jason Brownlee November 11, 2018 at 6:12 am #
  
  Typically just the training data is used.
  
  Reply
Sriram November 15, 2018 at 1:47 am #

Hi sir,

Could you please help me to convert a English word to a tamil?

Reply
- Jason Brownlee November 15, 2018 at 5:36 am #
  
  Sorry, I don’t have the capacity for new projects.
  
  Reply
Hussain Ravat November 25, 2018 at 6:33 pm #

man thanks a lot once again was breaking my head as I got tokenizer at character level instead of words. To correct it had to pass it as a list
t = Tokenizer()
t.fit_on_texts([‘Hello world’])

instead of
t = Tokenizer()
t.fit_on_texts(‘Hello world’)

Reply
- Jason Brownlee November 26, 2018 at 6:16 am #
  
  Perhaps the API has changed?
  
  Reply
Art December 5, 2018 at 7:11 pm #

Hi Jason, in the texts_to_matrix() example, there are 8 words in the vocabulary but the dimension of the generated vectors is 9. Why?

Reply
- Jason Brownlee December 6, 2018 at 5:52 am #
  
  Words start at 1 and an extra space is added for 0 or “unknown”.
  
  Reply
Saurabh December 31, 2018 at 5:34 pm #

How to view the vocabulary created by keras texts_to_matrix function?
encoded_docs = t.texts_to_matrix(docs, mode=’tfidf’)

Reply
- Jason Brownlee January 1, 2019 at 6:14 am #
  
  The tokenizer has a dictionary that can be viewed.
  
  See the section titled “Tokenizer API” for details.
  
  Reply
Bright Chang January 2, 2019 at 8:12 pm #

Thank you very much for your informative tutorial!

I am currently doing sentiment analysis of the Twitter project. Inspired by the work emoji2vec[1], I try to add the emoji embedding(which is a 100*1 vector) to the Keras Tokenizer. In this way, I could construct the embedding matrix which contains both word embedding and emoji embedding in sentiment analysis. The constructed embedding matrix could be used as weights in the downstream Embedding layer.

However, the Tokenizer is mostly built by given num_words argument, It is undoubtedly true that the frequency of words is much higher than emoji and if I set num_words=20000, not all the emojis are included. Hence, I think I need to add the emoji manually in the Keras Tokenizer API so as to construct the word-emoji embedding matrix. But is it possible in Keras?

[1] https://arxiv.org/abs/1609.08359

Reply
- Jason Brownlee January 3, 2019 at 6:13 am #
  
  Good question.
  
  Perhaps you can create your own approach to carefully map words/emoji to integers in your word and emoji embeddings.
  
  Reply
Enes February 7, 2019 at 1:38 am #

Hello Jason, thank you for the nice post.

I want to tokenize some text into a sequence of tokens and I’m using

tokenizer.fit_on_texts(text_corpus)
sequences = tokenizer.texts_to_sequences(text)

My question is what is the best way to form this text_corpus ? This text_corpus is like some dictionary and which token correspond to which word depends on it. I will get in the future more text for tokenization and I need, for example, that every time word “good” have the same token. So for that, I need always to use the same text_corpus.
Btw. do you have some posts for text preprocessing like removing stop words, lemming etc and is it a good idea to do it before tokenization

Reply
- Jason Brownlee February 7, 2019 at 6:42 am #
  
  The text corpus is the training data that should be representative of the problem.
  
  The Keras API tokenizer is not designed to be updated as far as I know. You may need to use a different API or develop your own tokenizer if you need to update it. Or you can refit the tokenizer and model in the future when new data becomes available.
  
  Reply
Anishka February 7, 2019 at 7:48 am #

Hi, I’m working on a text summarizer for an Indian language.
When I use fit_on_texts fn it gives me an attribute error — ‘NoneType’ object has no attribute ‘lower’

tokenizer_outputs = Tokenizer(num_words=MAX_NUM_WORDS, filters=”)
tokenizer_outputs.fit_on_texts(target_texts + target_texts_inputs)

Does it work only for English language?

Reply
- Jason Brownlee February 7, 2019 at 2:05 pm #
  
  Perhaps double check your version of Python is 3.5+?
  
  Reply
Anishka February 7, 2019 at 6:06 pm #

I’m using Python 3.6.5

Reply
Rahul February 22, 2019 at 10:02 am #

Hi Jason,

Is there any way to do one-hot-encoding without doing tokenization first?

I have a data set which has a column for location. This can contain multi-word string. e.g.-
JERSEY CITY, NEW JERSEY
ST. LOUIS, MISSOURI
MORRISVILLE, NORTH CAROLINA

I want to one-hot-encode these values. How should I do it using Keras?

Reply
- Jason Brownlee February 22, 2019 at 2:45 pm #
  
  Words must be converted to integers first, then vectors.
  
  You could hash or integer encode for the first step, then one hot encode, bag of words, or use an embedding for the second step.
  
  Reply
SKim April 3, 2019 at 6:39 pm #

Hi Jason, Thank you for great post!

It really helps me to understand preprocessing step for text data.

But I can not understand when ‘hashing trick’ is needed.

I think in most of NLP case, such as text classification, I should choose ‘Encoding’ to avoid collision.

Because if positive words and negative words are mapped to same number, there will be some scope to misclassify.

Why and when is ‘hashing trick’ or ‘one-hot’ needed?

Reply
- Jason Brownlee April 4, 2019 at 7:43 am #
  
  Yes, I think so too, but sometimes you don’t have the space/RAM to handle the whole vocab and you may need the hash trick to work around it.
  
  Additionally, the hash trick lets you seamlessly handle new words in the future.
  
  Reply
Anjali Bhavan April 6, 2019 at 3:27 pm #

Hi,
Great tutorial! I’m actually working on some text in a csv file which has some null/empty entries as well. What should I assign such empty entries so it can be processed further? Removal would not be an option. Would assigning them ‘none’ or ’empty’ be a good choice?

Reply
- Jason Brownlee April 7, 2019 at 5:28 am #
  
  You can assign them a special word, like [MISSING], or assign them a value of 0 when mapping them to integers.
  
  Reply
Tony Gilpin May 24, 2019 at 1:34 am #

I have a dataset , which has 289323 rows
I have a column feature called InstanceDataId which has 25603 unique values,
What is the best way to deal with this one as one of 11 features ?

Reply
- Jason Brownlee May 24, 2019 at 7:58 am #
  
  Perhaps compare removing it, integer encode, one hot encode and embedding and use the approach that gives the best model skill.
  
  Reply
Sreedevi June 19, 2019 at 6:41 pm #

Thanks Jason for the wonderful tutorials on the Machine Learning theme.
A question on the one-hot encoding. After (one-hot) encoding the text, is it possible via a keras API to get (& print) the mapping between the integer code and the original word. This is to verify if the Thanks.

Reply
- Jason Brownlee June 20, 2019 at 8:27 am #
  
  Yes, it is available in the Tokenizer.
  
  Reply
  - Sreedevi June 21, 2019 at 5:11 pm #
    
    Thanks Jason. I assume you are referring to Tokenizer.word_index. Wouldn’t that only work if I used Tokenizer.fit_on_texts? How would it work on the text encoded using one_hot function?
    
    Reply
    - Jason Brownlee June 22, 2019 at 6:34 am #
      
      You would need to one hot encode yourself or use the tokenizer to do it, or perhaps the sklearn implementation – something you can reuse/save and operate on text consistently in the future.
      
      Reply
Amey Chavan June 28, 2019 at 9:49 pm #

Hi Jason, thanks for your great work! I’m gonna enter my career in deep learning.
I have questions:
1. If we are working on sentiment analysis model that has lots of review textual sentences data then with method should we prefer ? Is it Tokenizer or hashing_trick ? I think it would be better if we use Tokenizer for features of our model.
2. Which will give better performance ? ‘Tokenizer’ or ‘hashing_trick’ ?
Thanks 🙂

Reply
- Jason Brownlee June 29, 2019 at 6:51 am #
  
  A word embedding would be best for text classification:
  https://machinelearningmastery.com/what-are-word-embeddings/
  
  If you want to other methods, my best advice in general is to test and let the results guide you.
  
  Reply
Marek Swieton August 10, 2019 at 6:50 pm #

Hello Jason,

Thank you for your excellent tutorials!

May I Ask you a question concerning: tokenizer.texts_to_sequences method? Does it assign unique integer value to every token in the vocabulary? Do you have any more detailed post on this method? I am trying to create encoded texts inputs for the embedding layer and I wonder is it reasonable to use this method to encode documents.

Reply
- Jason Brownlee August 11, 2019 at 5:56 am #
  
  Yes, fit the Tokenizer on your data, then calls to functions like texts_to_matrix() will transform text to the integer mapping.
  
  You can learn more about the API here:
  https://keras.io/preprocessing/text/
  
  The source code is also helpful:
  https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L139
  
  Reply
Saurab Gupta August 15, 2019 at 2:36 pm #

HI Jason,

I am trying to classify a special instruction text clause from multiple page document. Please guide me how should i go about it .

Reply
- Jason Brownlee August 16, 2019 at 7:44 am #
  
  Perhaps some of the tutorials here will help you to get started:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
  - Saurab Gupta August 24, 2019 at 5:20 pm #
    
    HI Jason,
    
    I already know basics of Deep learning and NLP. But i am looking for any implementation in which we can classify a multi line comments present in document using Deep learning .
    
    Reply
    - Jason Brownlee August 25, 2019 at 6:35 am #
      
      You can load multiple line comments as a document and classify them as needed.
      
      Reply
Sam August 16, 2019 at 1:09 am #

Hello Jaso, great topic and also a nice tutorial:

I have a question:
I trained a word2vec model on my text data (texts) with min_count = 5, so I only have vectors for words that appear at least 5 times.
Now if I’m going to tokenize my text to feed it into a neural network:
Do I have to fit the tokenizer on the initital text data (text)? And then use this also for the test data (texts_test)? So does it look like the following :

tokenizer_obj = Tokenizer()
tokenizer_obj.fit_on_texts(texts)
seq = tokenizer_obj.texts_to_sequences(texts)
seq_test = tokenizer_obj.texts_to_sequences(texts_test)

1. Or do I have to fit a new tokenizer object for my test data?
2. Or do I have to fit the tokenizer object on the dictionary from my word2vec Model like:

dictionary = list(model.wv.vocab.keys())
dictionary

tokenizer_obj_dict = Tokenizer()
tokenizer_obj_dict.fit_on_texts(dictionary)
seq = tokenizer_obj_dict.texts_to_sequences(texts)

I read a lot but I’m getting more and more confused about this topic.

And then another and to me most important question: When I get to the point, that my Network is trained: How do I tokenize the so to say future data (texts_future) on which I want to run my predictions? Also with the same tokenizer object that I fitted on the training data in the “first” step? So:

seq_future_text = tokenizer_obj.texts_to_sequences(texts_future)

I hope that you can help me with this topic. That would be awesome!
Thanks a lot Jason.

Reply
- Jason Brownlee August 16, 2019 at 7:58 am #
  
  Yes, the first approach would be appropriate. Fit it on train, apply to test. That is how you would use the model in practice.
  
  Once you choose a config, fit the tokenizer and model on all data, save model and tokenizer (and other data prep procedures) and use it to prepare new data in the future. New data must be prepared identically to training data – because that is what the model will expect.
  
  Reply
  - Sam August 16, 2019 at 11:38 pm #
    
    Thanks a lot for the fast reply.
    Is there a reason why I should not fit it on the dictionary that I get from the Word2Vec?
    I thought that it would make sense as I only tokenize then the words that appear at least X times and the others get marked with an additional token.
    In other words, could you maybe explain why I have to fit it on the inital text data and not on the let’s say cleaned initial text (initial_text – words_that_appear_only_X-1_times).
    I hope the question is comprehensible.
    
    Reply
    - Jason Brownlee August 17, 2019 at 5:45 am #
      
      You can choose the vocab any way you wish.
      
      Choosing the vocab based on removing infrequent works is an excellent idea.
      
      Does that help?
      
      Reply
obsa October 26, 2019 at 11:46 pm #

Surely you share good concept how to apply deep learning in NLP is hot topics!thanks
but how can prepare large documents for text classification my own corpus?

Reply
- Jason Brownlee October 27, 2019 at 5:45 am #
  
  Thanks.
  
  What problem are you having with large documents as opposed to small documents?
  
  Reply
Nikhil Rana January 1, 2020 at 12:39 am #

What is the role of .texts_to_sequences?

Reply
- Jason Brownlee January 1, 2020 at 6:33 am #
  
  Convert text to sequences of numbers. E.g. each word is assigned a unique number.
  
  Reply
Savi May 8, 2020 at 1:57 pm #

Hi Jason,
When i use set of sentences of different sizes and apply text_to_word_sequence for each sentence, I end up with different vocab_lengths. Should i take max len or average len of these vocab_lengths and apply 130% and use it for OneHotEncoding ?

Reply
- Jason Brownlee May 8, 2020 at 3:56 pm #
  
  The vocab should based on the entire training dataset.
  
  Reply
Savi May 8, 2020 at 5:15 pm #

Say, X_train is the entire training set with say 1000 sentences/doc

X_unique_wordseq = [ set ( text_to_word_sequence ( x ) ) for x in X_train]

word_seq_len = [ len(x) for x in X_wordseq ]

vocab_size = ? #How do we get the vocab size ?

[ vocab_size = np.max(X_train_words_len) ] #Like this ??

X_train_oneH = [one_hot (x, vocab_size) for x in X_train]

Reply
- Jason Brownlee May 9, 2020 at 6:10 am #
  
  The vocab size is the total number of unique words in the training dataset.
  
  The tutorials here will help to get you started:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
mohammad June 23, 2020 at 6:44 pm #

Dear Jason,

Thanks for the post, is Tokenizer API provide a bag-of-word model or a word embedding?

Reply
- Jason Brownlee June 24, 2020 at 6:28 am #
  
  No, not quite. It does basic cleaning and assigns numbers to words.
  
  Reply
  - mohammad June 24, 2020 at 4:21 pm #
    
    I see, thanks. What about TF-IDF? is it a Bag-of-Word model?
    
    Reply
    - Jason Brownlee June 25, 2020 at 6:12 am #
      
      tf-idf is a more advanced representation than bag of words for text.
      
      Reply
Dipankar Porey August 20, 2020 at 12:25 am #

hello jaso,
is there any solution to get encoded string after using one_hot?????

Reply
- Jason Brownlee August 20, 2020 at 6:46 am #
  
  Yes, you can encode a single strng by calling:
  
  ... string = "abcd" result = encoder.transform([string])
  
  1
  2
  3
  
  ...
  string = "abcd"
  result = encoder.transform([string])
  
  Reply
tajfar October 22, 2020 at 5:26 am #

Thank you, Jason I was looking for that, great explanation

Reply
- Jason Brownlee October 22, 2020 at 6:51 am #
  
  You’re welcome.
  
  Reply
Carol October 13, 2022 at 11:29 am #

Hi Jason, thank you very much for the tutorial!

I have a deep learning project to understand some behaviors of a deternined comentaries database, so I need to transform each word of my database in one specific number, but when the word appears more than one time, it should receive the same number, for example:

I love the musics that you placed in the game
1, 2, 3, 4, 5, 6, 7 , 8 , 3 ,9

Can you please advise me if keras can do this? For the “one hot” sample that you placed, I didn´t understand if it fits to this use case I discribed

Reply
- James Carmichael October 14, 2022 at 11:08 am #
  
  Hi Carol…Please clarify what you are trying accomplish so that we may better assist you.
  
  Reply

Navigation

How to Prepare Text Data for Deep Learning with Keras

Tutorial Overview

Need help with Deep Learning for Text Data?

Split Words with text_to_word_sequence

Encoding with one_hot

Hash Encoding with hashing_trick

Tokenizer API

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

112 Responses to How to Prepare Text Data for Deep Learning with Keras

Leave a Reply Click here to cancel reply.