How to Prepare Text Data for Machine Learning with scikit-learn

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.

After completing this tutorial, you will know:

  • How to convert text to word count vectors with CountVectorizer.
  • How to convert text to word frequency vectors with TfidfVectorizer.
  • How to convert text to unique integers with HashingVectorizer.

Let’s get started.

How to Prepare Text Data for Machine Learning with scikit-learn

How to Prepare Text Data for Machine Learning with scikit-learn
Photo by Martin Kelly, some rights reserved.

Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

  1. Create an instance of the CountVectorizer class.
  2. Call the fit() function in order to learn a vocabulary from one or more documents.
  3. Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

Above, you can see that we access the vocabulary to see what exactly was tokenized by calling:

We can see that all words were made lowercase by default and that the punctuation was ignored. These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.

Running the example first prints the vocabulary, then the shape of the encoded document. We can see that there are 8 words in the vocab, and therefore encoded vectors have a length of 8.

We can then see that the encoded vector is a sparse matrix. Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.

Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.

Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocab and the other word not in the vocab completely ignored.

The encoded vectors can then be used directly with a machine learning algorithm.

Word Frequencies with TfidfVectorizer

Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

  • Term Frequency: This summarizes how often a given word appears within a document.
  • Inverse Document Frequency: This downscales words that appear a lot across documents.

Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.

A vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.

Finally, the first document is encoded as an 8-element sparse array and we can review the final scorings of each word with different values for “the“, “fox“, and “dog” from the other words in the vocabulary.

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

The example below demonstrates the HashingVectorizer for encoding a single document.

An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.

Note that this vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.

Running the example encodes the sample document as a 20-element sparse array.

The values of the encoded document correspond to normalized word counts by default in the range of -1 to 1, but could be made simple integer counts by changing the default configuration.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Natural Language Processing

sciki-learn

Class APIs

Summary

In this tutorial, you discovered how to prepare text documents for machine learning with scikit-learn.

We have only scratched the surface in these examples and I want to highlight that there are many configuration details for these classes to influence the tokenizing of documents that are worth exploring.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


50 Responses to How to Prepare Text Data for Machine Learning with scikit-learn

  1. Kinjal September 29, 2017 at 2:28 pm #

    Can you tell how to proceed in R for machine learning and feature selection!

  2. Jarbas September 29, 2017 at 11:49 pm #

    Hi, Dr. Brownlee,

    Congratulations for this great article. Do you know any technique to parse in a smart way HTML documents (DOM) to work with ANNs? This guys in this paper do that (http://proceedings.mlr.press/v70/shi17a/shi17a.pdf) but they didn’t specify how.

    Thank you so much!

    • Jason Brownlee September 30, 2017 at 7:42 am #

      No sorry, it is not something I have worked on.

  3. John Stec October 2, 2017 at 3:48 am #

    Was running this code and encountered an error

    ^
    SyntaxError: invalid syntax

    Is this because I’m using Python 2.7?

    • Jason Brownlee October 2, 2017 at 9:39 am #

      The code was developed with Python 2.7 and should work in Python 3.5 as well.

      Confirm that you copied all of the code and preserved the indenting.

  4. Advait Vasavada October 4, 2017 at 3:43 am #

    Hello Sir,
    How can ML be used to carry out survey based research?

    • Jason Brownlee October 4, 2017 at 5:48 am #

      I don’t know. Perhaps used in the analysis in some way?

  5. Rahul October 28, 2017 at 2:13 am #

    Hello Sir,

    I am trying to classify a delivery address as residential and commercial based on some numerical features weight ,qty , some derived features ,along with text address data as input. Should I apply countvectorizer or tdif on the text address data for converting numerical features. Or any other methods. I planning to use decision tree classifier.

  6. Ahmed November 7, 2017 at 3:07 am #

    Many thanks to you. I am working on my project and I extract data from tags of html web pages.
    I need to assign word in each tag to be feature. for example play in title tag not the same with play in header tag or play in anchor tag. any idea ?

    • Jason Brownlee November 7, 2017 at 9:52 am #

      Perhaps start with a bag of words model and perhaps move on to word embedding + neural net to see if it can do better.

  7. Daniel November 26, 2017 at 3:41 am #

    Hi Dr.Jason,
    How can i adding new columns (or features) to the current vector ? For ex: adding vector of number of wrong spelling words from each document.

    Thanks

    • Jason Brownlee November 26, 2017 at 7:34 am #

      You might want to keep the document representation separate from that information and feed them as two separate inputs to a model (e.g. a multi-input model in Keras).

  8. Vivek November 26, 2017 at 9:12 pm #

    I have a query. I have a a cluster of text files containing some topics of similar interest.
    I want to input these docs as text files to sklearn tools using python.

    Can you please tell me the process

    • Jason Brownlee November 27, 2017 at 5:50 am #

      The process is the blog post above, does that not help?

  9. Aman December 19, 2017 at 2:26 pm #

    This is a great article. Helped me a lot in my project. I have a followup question: What if I want a vector for all the documents present? Is there a more efficient way other than a for loop like this:

    for j in len(docs):
    vector = vectorizer.transform([text[j]])

    I hope to convert it into a massive dataset to present it to a ML algorithm

    • Jason Brownlee December 19, 2017 at 4:00 pm #

      The transform can take the entire document I believe.

  10. Russ Reinsch January 2, 2018 at 10:55 am #

    How come the words lazy and dog both scored 1.28768207, which is in between the low score of 1.0 that was assigned to the word “the” and the higher 1.69314718 score for the word fox; but

    after the scores are normalized to values between 0 and 1, the assigned scores do not follow the same pattern… the values for lazy and dog [0.276] are no longer in between the values for the words “fox” [0.363] and “the” [0.429]; the values for lazy and dog are at one end of the range of scores.

    For that matter, how did the encoder decide to assign different scores to dog and fox in the first place, when they both occur the same number of times?

    Thank you for sharing your knowledge the way you do.

    • Jason Brownlee January 2, 2018 at 4:00 pm #

      Good question Russ.

      I would recommend checking the references and reading up on the calculation of TF/IDF.

    • Tam September 9, 2018 at 9:39 pm #

      Russ –

      The words scored 1.28768207 are ‘dog’ and ‘fox’. The tokeniser starts counting from 0.

      The reason the pattern breaks in the encoded vector is that it is only for the first document, notice [text[0]]. Within that document, the word ‘the’ occurs twice unlike all other words, so even though it is common between documents and is penalised for that, it is also more common within that document so gets points for that. A couple of things to play around with to see how this is working… First, vary the index to see how the feature vector for a different document would work by doing [text[1]] for example. Second, put another ‘the’ in the first document, or take one away, and see what happens.

  11. Jesús Martínez January 26, 2018 at 3:54 am #

    Very good article. Do you think that stemming the vocabulary before applying any of these techniques would yield a better performance?

    • Jason Brownlee January 26, 2018 at 5:45 am #

      It may as it would reduce the vocab size, try it and see.

  12. punita February 8, 2018 at 9:05 pm #

    hello ….
    i need to ask u a question…
    i am thinking of working on “Tweet Sentiment Analysis for Cellular Network Service Providers using Machine Learning Algorithms”…could u please help me …is it possible to work on such data…. i have fetched 2000 tweets from twitter and
    i am facing problem in feature extraction n i am not deciding what could be the features for such data….please help if u could…..

  13. Ravi Shankar February 19, 2018 at 9:01 pm #

    The position of word in the text ? Is it not important? How is it taken care in the vector representation?

    Otherwise two texts with same words and frequency will result in same vector, if the order of words is different

    Since I am new to ML, please help me understand

    • Jason Brownlee February 21, 2018 at 6:26 am #

      It is for some models (like LSTMs) and not for others (like bag of words).

  14. Shiva March 9, 2018 at 2:54 am #

    Hi Jason,

    I have a question regarding HashingVectorizer(). How does one do online learning with it? Does it learn the vocabulary of the text like tfidf does. Also, what happens when a new text containing some unseen words come in. What do I do with that? I am suppose to wait for a bunch of data, and call transform() on HashingVectorizer() and feed it the new text samples. Any references/videos to online learning of text documents with “Multi-Label” output would be great.

    • Jason Brownlee March 9, 2018 at 6:25 am #

      Ideally the vocab would be defined up front.

      Otherwise, you could use it in an online manner, as long as you set an expectation on the size of the vocab, to ensure the hash function did not have too many collisions.

  15. Aykut Yararbas March 22, 2018 at 10:04 am #

    Very nice article. Thanks.

  16. Shahbaz Wasti March 25, 2018 at 10:43 pm #

    Hi Jason,

    First of all thank you for this great article to learn about TF/IDF practically. I have one question about it. How can I get the list of terms in the vocabulary with their relevant document frequency. In your example, “The” has appeared in all three documents so DF for “The” is 3, similarly DF is 2 for “Dog” and “Fox”.

    Best Regards

  17. Jack Smith April 5, 2018 at 12:52 pm #

    Love your articles Jason!

    I have a question related to turning a list of names into vectors and using them as a feature in a classifier. I am not sure which method to use, but I was thinking that Hashing would be appropriate. My issue is that with the number of possible names being high, wouldn’t this create very sparse vectors that my classifier would have difficulty learning from? Also is there another method I should consider for this case?

    • Jason Brownlee April 5, 2018 at 3:14 pm #

      Perhaps contrast a bag of words to a word2vec method to see what works best for your specific problem.

  18. Kamil May 13, 2018 at 11:24 pm #

    Hi Jason,

    First of all thank you for this great article. I have two questions:

    1. Supposing that we have a dataset similar to data from Kaggle: https://www.kaggle.com/aaron7sun/stocknews/data. How can we deal with some ‘N/A’ data?
    2. Second question is about stop words. I want to use NLTK to delete stop words from text, but unfortunatelly NLTK doesn’t has a polish words. How can I use my own dictionary?

    Regards,
    Kamil

  19. Ravi Shankar May 20, 2018 at 5:46 pm #

    For the example in HashVectorizer and countvectorizer, I am not able to understand the how the values in the vector are arrived at. I understand tf and idf. Can you illustrate atleast for one term, how the value is computed. In the example of hashvectorizer, there is one only document.

    • Jason Brownlee May 21, 2018 at 6:28 am #

      The hash method uses a hash function from string to int.

      The count uses an occurrence count in the document for each word.

  20. Nicolas June 18, 2018 at 12:37 pm #

    Great article! I do have a question. Let’s say I have a dataset with both numeric and text elements. I only want to apply TF-IDF (for example) to my text column, and then append it to my dataset so that I can train with my numerical and categorical data (that now it’s transformed) .

    Example:

    col 1 col 2
    this is a text 4.5
    this is also a text 7.5

    I only want to apply TF-IDF to my col 1, to that I can then use a ML Algorithm with both col 1 and col2.

    Result:

    col 1 col2
    (TF-IDFResult) 4.5
    (TF-IDFResult) 7.5

    How do you achieve this?
    Thanks!

    • Jason Brownlee June 18, 2018 at 3:12 pm #

      Perhaps try two models and ensemble their predictions.

      Perhaps try a neural net with two inputs.

  21. Nil July 5, 2018 at 3:25 pm #

    Hi DR. Jason,

    This is a very good post, I was looking for an explanation like this. I liked so much it cleared me many points.

    I have a doubt because I want to load text from many files to produce a Bag-of-Words, the doubt is:
    If I have many text files (with two classes or categories) to produce a single Bag-of-Words I should load them all together separately? or join all text in a single text file and load a single text file whit all text and then produce the Bag-of-Words?

    Best regards.

    • Jason Brownlee July 6, 2018 at 6:38 am #

      Thanks.

      Perhaps you load all files into memory then prepare the BoW model?

      • Nil July 7, 2018 at 2:18 am #

        Thank you I will try that.

  22. Oliver July 12, 2018 at 1:03 pm #

    Hello Jason, thank you for your great article first and I learnt a lot from that!

    Now I am dealing with a log analysis problem. In this problem, the order of words is a very important feature, for example log content like ‘No such file or directory’, some words always come together in some order.

    My question is, what kinds of feature extraction methods can I use to encode such order information?

    Thank you very much!

  23. Sun August 2, 2018 at 10:01 pm #

    How can we read a folder of text documents and apply the steps mentioned in the article to it, especially resumes?

    • Jason Brownlee August 3, 2018 at 6:02 am #

      Start by reading one file, and expand from there.

  24. Jun August 16, 2018 at 5:59 am #

    two question about the Bag of Words have obsessed me for a while.,

    first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.

    The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.

    Thanks.

    • Jason Brownlee August 16, 2018 at 6:16 am #

      Country would be a separate input, perhaps one hot encoded. You could concat with the bow feature vector as part of preparing input for the model.

      A sequence prediction method can handle the position of words, e.g. LSTM.

Leave a Reply