How to Encode Text Data for Machine Learning with scikit-learn

By Jason Brownlee on June 28, 2020 in Deep Learning for Natural Language Processing 119

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.

After completing this tutorial, you will know:

How to convert text to word count vectors with CountVectorizer.
How to convert text to word frequency vectors with TfidfVectorizer.
How to convert text to unique integers with HashingVectorizer.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Prepare Text Data for Machine Learning with scikit-learn
Photo by Martin Kelly, some rights reserved.

Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

For more on the bag of words model, see the tutorial:

A Gentle Introduction to the Bag-of-Words Model

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

Create an instance of the CountVectorizer class.
Call the fit() function in order to learn a vocabulary from one or more documents.
Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(type(vector))
print(vector.toarray())

from sklearn.feature_extraction.text import CountVectorizer

# list of text documents

text = ["The quick brown fox jumped over the lazy dog."]

# create the transform

vectorizer = CountVectorizer()

# tokenize and build vocab

vectorizer.fit(text)

# summarize

print(vectorizer.vocabulary_)

# encode document

vector = vectorizer.transform(text)

# summarize encoded vector

print(vector.shape)

print(type(vector))

print(vector.toarray())

Above, you can see that we access the vocabulary to see what exactly was tokenized by calling:

print(vectorizer.vocabulary_)

1	print(vectorizer.vocabulary_)

We can see that all words were made lowercase by default and that the punctuation was ignored. These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.

Running the example first prints the vocabulary, then the shape of the encoded document. We can see that there are 8 words in the vocab, and therefore encoded vectors have a length of 8.

We can then see that the encoded vector is a sparse matrix. Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.

{'dog': 1, 'fox': 2, 'over': 5, 'brown': 0, 'quick': 6, 'the': 7, 'lazy': 4, 'jumped': 3}
(1, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[1 1 1 1 1 1 1 2]]

{'dog': 1, 'fox': 2, 'over': 5, 'brown': 0, 'quick': 6, 'the': 7, 'lazy': 4, 'jumped': 3}

(1, 8)

[[1 1 1 1 1 1 1 2]]

Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.

# encode another document
text2 = ["the puppy"]
vector = vectorizer.transform(text2)
print(vector.toarray())

# encode another document

text2 = ["the puppy"]

vector = vectorizer.transform(text2)

print(vector.toarray())

Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocab and the other word not in the vocab completely ignored.

[[0 0 0 0 0 0 0 1]]

1	[[0 0 0 0 0 0 0 1]]

The encoded vectors can then be used directly with a machine learning algorithm.

Word Frequencies with TfidfVectorizer

Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.

Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.

from sklearn.feature_extraction.text import TfidfVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog.",
		"The dog.",
		"The fox"]
# create the transform
vectorizer = TfidfVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
# summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)
# encode document
vector = vectorizer.transform([text[0]])
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

from sklearn.feature_extraction.text import TfidfVectorizer

# list of text documents

text = ["The quick brown fox jumped over the lazy dog.",

"The dog.",

"The fox"]

# create the transform

vectorizer = TfidfVectorizer()

# tokenize and build vocab

vectorizer.fit(text)

# summarize

print(vectorizer.vocabulary_)

print(vectorizer.idf_)

# encode document

vector = vectorizer.transform([text[0]])

# summarize encoded vector

print(vector.shape)

print(vector.toarray())

A vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.

Finally, the first document is encoded as an 8-element sparse array and we can review the final scorings of each word with different values for “the“, “fox“, and “dog” from the other words in the vocabulary.

{'fox': 2, 'lazy': 4, 'dog': 1, 'quick': 6, 'the': 7, 'over': 5, 'brown': 0, 'jumped': 3}
[ 1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
1.69314718 1. ]
(1, 8)
[[ 0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646
0.36388646 0.42983441]]

{'fox': 2, 'lazy': 4, 'dog': 1, 'quick': 6, 'the': 7, 'over': 5, 'brown': 0, 'jumped': 3}

[ 1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718

1.69314718 1. ]

(1, 8)

[[ 0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646

0.36388646 0.42983441]]

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

The example below demonstrates the HashingVectorizer for encoding a single document.

An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.

Note that this vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.

from sklearn.feature_extraction.text import HashingVectorizer
# list of text documents
text = ["The quick brown fox jumped over the lazy dog."]
# create the transform
vectorizer = HashingVectorizer(n_features=20)
# encode document
vector = vectorizer.transform(text)
# summarize encoded vector
print(vector.shape)
print(vector.toarray())

from sklearn.feature_extraction.text import HashingVectorizer

# list of text documents

text = ["The quick brown fox jumped over the lazy dog."]

# create the transform

vectorizer = HashingVectorizer(n_features=20)

# encode document

vector = vectorizer.transform(text)

# summarize encoded vector

print(vector.shape)

print(vector.toarray())

Running the example encodes the sample document as a 20-element sparse array.

The values of the encoded document correspond to normalized word counts by default in the range of -1 to 1, but could be made simple integer counts by changing the default configuration.

(1, 20)
[[ 0.          0.          0.          0.          0.          0.33333333
   0.         -0.33333333  0.33333333  0.          0.          0.33333333
   0.          0.          0.         -0.33333333  0.          0.
  -0.66666667  0.        ]]

(1, 20)

[[ 0. 0. 0. 0. 0. 0.33333333

0. -0.33333333 0.33333333 0. 0. 0.33333333

0. 0. 0. -0.33333333 0. 0.

-0.66666667 0. ]]

Summary

In this tutorial, you discovered how to prepare text documents for machine learning with scikit-learn.

We have only scratched the surface in these examples and I want to highlight that there are many configuration details for these classes to influence the tokenizing of documents that are worth exploring.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

119 Responses to How to Encode Text Data for Machine Learning with scikit-learn

Kinjal September 29, 2017 at 2:28 pm #

Can you tell how to proceed in R for machine learning and feature selection!

Reply
- Jason Brownlee September 30, 2017 at 7:36 am #
  
  See this post:
  https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
  
  Reply
Jarbas September 29, 2017 at 11:49 pm #

Hi, Dr. Brownlee,

Congratulations for this great article. Do you know any technique to parse in a smart way HTML documents (DOM) to work with ANNs? This guys in this paper do that (http://proceedings.mlr.press/v70/shi17a/shi17a.pdf) but they didn’t specify how.

Thank you so much!

Reply
- Jason Brownlee September 30, 2017 at 7:42 am #
  
  No sorry, it is not something I have worked on.
  
  Reply
- Nandish April 24, 2019 at 6:24 pm #
  
  Hi did you get to know how to parse DOM to work with ANN
  
  Reply
John Stec October 2, 2017 at 3:48 am #

Was running this code and encountered an error

^
SyntaxError: invalid syntax

Is this because I’m using Python 2.7?

Reply
- Jason Brownlee October 2, 2017 at 9:39 am #
  
  The code was developed with Python 2.7 and should work in Python 3.5 as well.
  
  Confirm that you copied all of the code and preserved the indenting.
  
  Reply
  - Eric March 24, 2019 at 10:26 pm #
    
    I should be paying my tuition to you, thank you for all your great contents.
    
    Reply
    - Jason Brownlee March 25, 2019 at 6:45 am #
      
      Thanks Eric!
      
      Reply
  - ummi November 7, 2019 at 12:54 am #
    
    Pls I have an assignment to create a search engine from news feeds. After parsing the rss feeds how do I extract words from those links?
    
    Reply
    - Jason Brownlee November 7, 2019 at 6:44 am #
      
      Perhaps this will help:
      https://machinelearningmastery.com/clean-text-machine-learning-python/
      
      Reply
Advait Vasavada October 4, 2017 at 3:43 am #

Hello Sir,
How can ML be used to carry out survey based research?

Reply
- Jason Brownlee October 4, 2017 at 5:48 am #
  
  I don’t know. Perhaps used in the analysis in some way?
  
  Reply
Rahul October 28, 2017 at 2:13 am #

Hello Sir,

I am trying to classify a delivery address as residential and commercial based on some numerical features weight ,qty , some derived features ,along with text address data as input. Should I apply countvectorizer or tdif on the text address data for converting numerical features. Or any other methods. I planning to use decision tree classifier.

Reply
- Jason Brownlee October 28, 2017 at 5:15 am #
  
  Test many approaches and see what works best.
  
  Reply
Ahmed November 7, 2017 at 3:07 am #

Many thanks to you. I am working on my project and I extract data from tags of html web pages.
I need to assign word in each tag to be feature. for example play in title tag not the same with play in header tag or play in anchor tag. any idea ?

Reply
- Jason Brownlee November 7, 2017 at 9:52 am #
  
  Perhaps start with a bag of words model and perhaps move on to word embedding + neural net to see if it can do better.
  
  Reply
  - Santhosh January 6, 2020 at 9:01 pm #
    
    Hi sir can u please explain why we use this line(from sklearn. modelselection traindata….,) with a good explanation?
    
    Reply
    - Jason Brownlee January 7, 2020 at 7:22 am #
      
      Which line exactly?
      
      Reply
Daniel November 26, 2017 at 3:41 am #

Hi Dr.Jason,
How can i adding new columns (or features) to the current vector ? For ex: adding vector of number of wrong spelling words from each document.

Thanks

Reply
- Jason Brownlee November 26, 2017 at 7:34 am #
  
  You might want to keep the document representation separate from that information and feed them as two separate inputs to a model (e.g. a multi-input model in Keras).
  
  Reply
Vivek November 26, 2017 at 9:12 pm #

I have a query. I have a a cluster of text files containing some topics of similar interest.
I want to input these docs as text files to sklearn tools using python.

Can you please tell me the process

Reply
- Jason Brownlee November 27, 2017 at 5:50 am #
  
  The process is the blog post above, does that not help?
  
  Reply
Aman December 19, 2017 at 2:26 pm #

This is a great article. Helped me a lot in my project. I have a followup question: What if I want a vector for all the documents present? Is there a more efficient way other than a for loop like this:

for j in len(docs):
vector = vectorizer.transform([text[j]])

I hope to convert it into a massive dataset to present it to a ML algorithm

Reply
- Jason Brownlee December 19, 2017 at 4:00 pm #
  
  The transform can take the entire document I believe.
  
  Reply
Russ Reinsch January 2, 2018 at 10:55 am #

How come the words lazy and dog both scored 1.28768207, which is in between the low score of 1.0 that was assigned to the word “the” and the higher 1.69314718 score for the word fox; but

after the scores are normalized to values between 0 and 1, the assigned scores do not follow the same pattern… the values for lazy and dog [0.276] are no longer in between the values for the words “fox” [0.363] and “the” [0.429]; the values for lazy and dog are at one end of the range of scores.

For that matter, how did the encoder decide to assign different scores to dog and fox in the first place, when they both occur the same number of times?

Thank you for sharing your knowledge the way you do.

Reply
- Jason Brownlee January 2, 2018 at 4:00 pm #
  
  Good question Russ.
  
  I would recommend checking the references and reading up on the calculation of TF/IDF.
  
  Reply
- Tam September 9, 2018 at 9:39 pm #
  
  Russ –
  
  The words scored 1.28768207 are ‘dog’ and ‘fox’. The tokeniser starts counting from 0.
  
  The reason the pattern breaks in the encoded vector is that it is only for the first document, notice [text[0]]. Within that document, the word ‘the’ occurs twice unlike all other words, so even though it is common between documents and is penalised for that, it is also more common within that document so gets points for that. A couple of things to play around with to see how this is working… First, vary the index to see how the feature vector for a different document would work by doing [text[1]] for example. Second, put another ‘the’ in the first document, or take one away, and see what happens.
  
  Reply
Jesús Martínez January 26, 2018 at 3:54 am #

Very good article. Do you think that stemming the vocabulary before applying any of these techniques would yield a better performance?

Reply
- Jason Brownlee January 26, 2018 at 5:45 am #
  
  It may as it would reduce the vocab size, try it and see.
  
  Reply
punita February 8, 2018 at 9:05 pm #

hello ….
i need to ask u a question…
i am thinking of working on “Tweet Sentiment Analysis for Cellular Network Service Providers using Machine Learning Algorithms”…could u please help me …is it possible to work on such data…. i have fetched 2000 tweets from twitter and
i am facing problem in feature extraction n i am not deciding what could be the features for such data….please help if u could…..

Reply
- Jason Brownlee February 9, 2018 at 9:05 am #
  
  This post might help as a template:
  https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
  
  Reply
Ravi Shankar February 19, 2018 at 9:01 pm #

The position of word in the text ? Is it not important? How is it taken care in the vector representation?

Otherwise two texts with same words and frequency will result in same vector, if the order of words is different

Since I am new to ML, please help me understand

Reply
- Jason Brownlee February 21, 2018 at 6:26 am #
  
  It is for some models (like LSTMs) and not for others (like bag of words).
  
  Reply
Shiva March 9, 2018 at 2:54 am #

Hi Jason,

I have a question regarding HashingVectorizer(). How does one do online learning with it? Does it learn the vocabulary of the text like tfidf does. Also, what happens when a new text containing some unseen words come in. What do I do with that? I am suppose to wait for a bunch of data, and call transform() on HashingVectorizer() and feed it the new text samples. Any references/videos to online learning of text documents with “Multi-Label” output would be great.

Reply
- Jason Brownlee March 9, 2018 at 6:25 am #
  
  Ideally the vocab would be defined up front.
  
  Otherwise, you could use it in an online manner, as long as you set an expectation on the size of the vocab, to ensure the hash function did not have too many collisions.
  
  Reply
Aykut Yararbas March 22, 2018 at 10:04 am #

Very nice article. Thanks.

Reply
- Jason Brownlee March 22, 2018 at 2:51 pm #
  
  Thanks, I’m glad it helped.
  
  Reply
Shahbaz Wasti March 25, 2018 at 10:43 pm #

Hi Jason,

First of all thank you for this great article to learn about TF/IDF practically. I have one question about it. How can I get the list of terms in the vocabulary with their relevant document frequency. In your example, “The” has appeared in all three documents so DF for “The” is 3, similarly DF is 2 for “Dog” and “Fox”.

Best Regards

Reply
- Jason Brownlee March 26, 2018 at 10:02 am #
  
  Check the attributes on the object, see here:
  http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
  
  Reply
Jack Smith April 5, 2018 at 12:52 pm #

Love your articles Jason!

I have a question related to turning a list of names into vectors and using them as a feature in a classifier. I am not sure which method to use, but I was thinking that Hashing would be appropriate. My issue is that with the number of possible names being high, wouldn’t this create very sparse vectors that my classifier would have difficulty learning from? Also is there another method I should consider for this case?

Reply
- Jason Brownlee April 5, 2018 at 3:14 pm #
  
  Perhaps contrast a bag of words to a word2vec method to see what works best for your specific problem.
  
  Reply
Kamil May 13, 2018 at 11:24 pm #

Hi Jason,

First of all thank you for this great article. I have two questions:

1. Supposing that we have a dataset similar to data from Kaggle: https://www.kaggle.com/aaron7sun/stocknews/data. How can we deal with some ‘N/A’ data?
2. Second question is about stop words. I want to use NLTK to delete stop words from text, but unfortunatelly NLTK doesn’t has a polish words. How can I use my own dictionary?

Regards,
Kamil

Reply
- Jason Brownlee May 14, 2018 at 6:37 am #
  
  I have advice on how to handle missing data here:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-missing-data
  
  You can use your own, perhaps NLTK does have stop words in other languages. It would be worth checking.
  
  Reply
Ravi Shankar May 20, 2018 at 5:46 pm #

For the example in HashVectorizer and countvectorizer, I am not able to understand the how the values in the vector are arrived at. I understand tf and idf. Can you illustrate atleast for one term, how the value is computed. In the example of hashvectorizer, there is one only document.

Reply
- Jason Brownlee May 21, 2018 at 6:28 am #
  
  The hash method uses a hash function from string to int.
  
  The count uses an occurrence count in the document for each word.
  
  Reply
Nicolas June 18, 2018 at 12:37 pm #

Great article! I do have a question. Let’s say I have a dataset with both numeric and text elements. I only want to apply TF-IDF (for example) to my text column, and then append it to my dataset so that I can train with my numerical and categorical data (that now it’s transformed) .

Example:

col 1 col 2
this is a text 4.5
this is also a text 7.5

I only want to apply TF-IDF to my col 1, to that I can then use a ML Algorithm with both col 1 and col2.

Result:

col 1 col2
(TF-IDFResult) 4.5
(TF-IDFResult) 7.5

How do you achieve this?
Thanks!

Reply
- Jason Brownlee June 18, 2018 at 3:12 pm #
  
  Perhaps try two models and ensemble their predictions.
  
  Perhaps try a neural net with two inputs.
  
  Reply
- kessia June 29, 2021 at 4:43 am #
  
  I have this same question.
  I was wondering if you could separate the data (apply the TF-IDF in col 1) and then concat the result with col 2 as part of preparing input for the model
  
  Reply
  - Jason Brownlee June 29, 2021 at 4:51 am #
    
    Sure, try it.
    
    Reply
Nil July 5, 2018 at 3:25 pm #

Hi DR. Jason,

This is a very good post, I was looking for an explanation like this. I liked so much it cleared me many points.

I have a doubt because I want to load text from many files to produce a Bag-of-Words, the doubt is:
If I have many text files (with two classes or categories) to produce a single Bag-of-Words I should load them all together separately? or join all text in a single text file and load a single text file whit all text and then produce the Bag-of-Words?

Best regards.

Reply
- Jason Brownlee July 6, 2018 at 6:38 am #
  
  Thanks.
  
  Perhaps you load all files into memory then prepare the BoW model?
  
  Reply
  - Nil July 7, 2018 at 2:18 am #
    
    Thank you I will try that.
    
    Reply
Oliver July 12, 2018 at 1:03 pm #

Hello Jason, thank you for your great article first and I learnt a lot from that!

Now I am dealing with a log analysis problem. In this problem, the order of words is a very important feature, for example log content like ‘No such file or directory’, some words always come together in some order.

My question is, what kinds of feature extraction methods can I use to encode such order information?

Thank you very much!

Reply
- Jason Brownlee July 12, 2018 at 3:34 pm #
  
  I recommend a word embedding as a representation that is distributed and allows you to preserve word order.
  https://machinelearningmastery.com/what-are-word-embeddings/
  
  Reply
Sun August 2, 2018 at 10:01 pm #

How can we read a folder of text documents and apply the steps mentioned in the article to it, especially resumes?

Reply
- Jason Brownlee August 3, 2018 at 6:02 am #
  
  Start by reading one file, and expand from there.
  
  Reply
Jun August 16, 2018 at 5:59 am #

two question about the Bag of Words have obsessed me for a while.,

first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.

The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.

Thanks.

Reply
- Jason Brownlee August 16, 2018 at 6:16 am #
  
  Country would be a separate input, perhaps one hot encoded. You could concat with the bow feature vector as part of preparing input for the model.
  
  A sequence prediction method can handle the position of words, e.g. LSTM.
  
  Reply
ravi October 4, 2018 at 7:49 pm #

Hello Dr. Jason,
I have set of PDF documents, I would like to read the contents of PDF documents and use certain paragraph of the PDF to provide answer through interactive bot. How Can I make bot learn about the text contents? Please let help me here.

Reply
- Jason Brownlee October 5, 2018 at 5:32 am #
  
  Sounds like a very challenging research problem, not something trivial I can answer in a blog comment, sorry.
  
  Reply
vengama naidu October 15, 2018 at 5:42 am #

nice work sir,
how can I extract specific text like name,date,value from pdf file

Reply
- Jason Brownlee October 15, 2018 at 7:35 am #
  
  Sounds like you’re intersted in named entity recognition:
  https://en.wikipedia.org/wiki/Named-entity_recognition
  
  Reply
Nishad Tupe October 31, 2018 at 1:54 pm #

Good Explanation! Are you referring to documents here is can be data frame/list containing bag of words (sentences /tweets).
Also in all your books, where one should start especially if we have Python programming knowledge.
Can we start with Master Machine Learning Algorithms or Machine Learning with Python or Linear Algebra (least preferred) as we may not get time?

Reply
- Jason Brownlee October 31, 2018 at 2:56 pm #
  
  A good place to start with ML using Python is here:
  https://machinelearningmastery.com/start-here/#python
  
  If you want to get started with deep learning in python for text, you can start here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
krenovut November 4, 2018 at 3:52 am #

Hello Jason,
Can you help me? I have a big dataset, about 4 millions texts. And I can not use clear tf-idf. So, firstly, i use hashing. But, I also need to use tf-idf. How can I concat 2 methods? It’s possible to give sparse matrix in tf-idf instead of corpus?

Reply
- Jason Brownlee November 4, 2018 at 6:29 am #
  
  Perhaps you can use progressive loading and manually build the tf-idf from your dataset?
  
  Reply
ankitha November 14, 2018 at 1:40 am #

Hi, I am using data set from yahoo and it is in csv format. I have tokenized the data set and stemming, stop words are removed. I have generated vector values using skip gram model. Now how to convert variable cardinality vector to fixed length vector? Please help me in this

Reply
- Jason Brownlee November 14, 2018 at 7:33 am #
  
  What do you mean exactly?
  
  Reply
Fataleagle January 20, 2019 at 7:15 am #

Hello Sir,

I want to extract emotions associated with tweets for that I used Tfid Vectorizer to transform each word into a number but Tweets are going to vary in length so the length of each vector array also will vary so how can I apply that to my model let say suppose to SVM?

Thank you

Reply
- Jason Brownlee January 21, 2019 at 5:28 am #
  
  If you use a vectorized representation, you will have a fixed sized vocab, but the length of the tweets can vary no problem.
  
  Reply
sky January 20, 2019 at 7:14 pm #

hello. how to do train and test text data??
I want a sample code

Reply
- Jason Brownlee January 21, 2019 at 5:31 am #
  
  You can get started here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
Donni February 1, 2019 at 6:21 pm #

Hello Jason,
thank you for the nice tutorial.

I’m doing a text classification task to classify 10 different categories. I have a training data set that contains text documents for each category. I have in hand list of predefined features (words/terms) for each category, I want to add these features into the training dataset to be the top features during the training of ML model.

Do you have any idea how can I do that?

Thank you,

Reply
- Jason Brownlee February 2, 2019 at 6:12 am #
  
  Yes, you can remove all other words except the ones you want to include, then model the classification problem.
  
  Reply
Ravi February 2, 2019 at 8:07 pm #

Hi Jason,

What is Difference between Co-variance and Co-occurrence matrix, How can we obtain Co-occurrence matrix from the text corpus/text data in python

Thanks

Reply
- Jason Brownlee February 3, 2019 at 6:16 am #
  
  Good question, I don’t have an example of a co-occurrance matrix, sorry.
  
  Reply
Jane April 12, 2019 at 9:41 am #

After applying the vectorization on train and test sample using TfidfVectorizer, the train and test samples will have different number of features. So when you apply the trained model to make predictions on the validation sample, it will complain the validation sample doesn’t have enough features. What is your advice to solve this issue? thanks!!!!

Reply
- Jason Brownlee April 12, 2019 at 2:41 pm #
  
  You must use the same vocab to prepare both datasets, e.g. fit on train and transform train and test.
  
  Reply
tahere April 13, 2019 at 2:44 pm #

your tutorials is very helpful

Reply
- Jason Brownlee April 14, 2019 at 5:42 am #
  
  Thanks, I’m happy to hear that.
  
  Reply
Sehaba95 April 15, 2019 at 6:06 am #

Thank you so much Jason, your tutorial helped me a lot and I wish that you will keep sharing more and thank you so much again

Reply
- Jason Brownlee April 15, 2019 at 7:57 am #
  
  Thanks, I’m happy to hear that.
  
  Reply
Mostafa wagih eltazy May 7, 2019 at 9:16 am #

Thank you so much for the help

i wanted to ask two questions based on my work, I am currently trying to pre-process java source code with known labels and feed the output to my model for a classification problem,
can i use this techniques for pre-processing on java source code or is it there something more suitable for my work
and if it’s possible on which basis should i select a proper vectorizer for my reserch

again thank you so much for the article

Reply
- Jason Brownlee May 7, 2019 at 2:28 pm #
  
  I don’t know much about pre-process code for modeling, sorry.
  
  Perhaps check papers on the topic to see what is common.
  
  Reply
Maysoon alkhair May 21, 2019 at 8:56 pm #

Your explanation is very clear I liked it. I have a question how I can apply this on a text file include an Arabic dataset? Can you help me?

Reply
- Jason Brownlee May 22, 2019 at 8:04 am #
  
  Good question, I don’t have exampels of working with Arabic text, sorry.
  
  Reply
shahzaib September 4, 2019 at 1:16 am #

thanks alot bro good work

Reply
- Jason Brownlee September 4, 2019 at 6:01 am #
  
  I’m glad it helped.
  
  Reply
Soheb December 6, 2019 at 4:12 pm #

I am beginner to Data Science (machine learning). In my recent course i used anaconda-paython and used few algo for machine learning.

Currently i am Fullstack developer (.Net, Angular, sql, firebase etc.)

I want to upgrade myself for data science-machine learning.

Can you please suggest any tutorial for the beginner?

Reply
- Jason Brownlee December 7, 2019 at 5:34 am #
  
  Yes, you can get started right here:
  https://machinelearningmastery.com/start-here/#getstarted
  
  Reply
Meenah December 26, 2019 at 9:26 am #

Really help, thanks.

My question is how can you insert a large corpus instead of the small text documents you used as an example.

Reply
- Jason Brownlee December 27, 2019 at 6:28 am #
  
  You can use the same code, and load your large corpus.
  
  Perhaps I don’t follow your question?
  
  Reply
  - uzma June 15, 2020 at 6:45 pm #
    
    sir how we can load large corpus for countvectorizer?
    
    Reply
    - Jason Brownlee June 16, 2020 at 5:36 am #
      
      The same as a small corpus. Perhaps I don’t understand your problem?
      
      Reply
Riad January 12, 2020 at 4:56 pm #

Thank you very much, can TFIDF be applied to documents that have been applied to Hash trick and why are there negative numbers in feature Vectors

Reply
- Jason Brownlee January 13, 2020 at 8:18 am #
  
  No idea, sorry.
  
  Reply
TRAN January 21, 2020 at 3:00 pm #

I have a question with regard to creating an instance of the CountVectorizer class by using:

vectorizer = CountVectorizer(tokenizer=word_tokenize)

Could you please clarify the meaning of “tokenizer=word_tokenize” .
What is the difference between vectorizer = CountVectorizer and vectorizer = CountVectorizer(tokenizer=word_tokenize)

I’m so grateful for your advice.

Reply
- Jason Brownlee January 22, 2020 at 6:17 am #
  
  According to the API, it is a function that you can specify to perform the tokenization:
  https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
  
  Reply
Despina M January 22, 2020 at 7:39 am #

Hello, another great article.

I want to ask, is it possible to change the size of vocabulary in TfidfVectorizer example?

Thank you in advance

Reply
- Jason Brownlee January 22, 2020 at 1:53 pm #
  
  Thanks.
  
  Yes, it takes the size of the vocab from the training data.
  
  Reply
zahrabk January 30, 2020 at 5:30 am #

Hi Dr.
In line with what you said in your website, for each text classification we should use a way to represent our text to a vector in order to get ready to use it as an input for machine learning.
I know that we should use some ways such as BOW or Word2vec, but my question is how should we represent our features as vector? say I want to do a sentiment analysis and have a feature such as emoticon dictionary, how should I represent it as vector?
with word2vec??
could you please help me

Reply
- Jason Brownlee January 30, 2020 at 6:59 am #
  
  Each emoticon would be a “word” that could appear in text. Represented like other words in your dataset.
  
  Reply
Monarch119 February 21, 2020 at 2:58 am #

Hi, nice article
I have a doubt:
What is the exact use of fit() function ?

Reply
- Jason Brownlee February 21, 2020 at 8:26 am #
  
  Fits the model.
  
  Specifically, it runs an optimization algorithm to find model parameters that best capture the relationship between inputs and outputs in your dataset.
  
  Reply
  - Monarch119 February 21, 2020 at 6:26 pm #
    
    so in this case the vectorizer.fit(text) is creating a dictionary right ?
    how does that help the transform() function ?
    
    Reply
    - Jason Brownlee February 22, 2020 at 6:20 am #
      
      Yes, it defines the vocab. Anytime transform() is performed it is limited to the vocab seen when fit was called.
      
      Reply
zahrabk February 25, 2020 at 11:34 pm #

Thank you for your practical blog. I’m somewhat confused. please help!
In line with what you said: ….Then the words need to be encoded as integers or floating-point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
I know that one popular way is BOW with witch we can have a numeric representation Right?
Now my question is: imagine I want to do some text classification and one of my features is: the number of emoticon icons! obviously, with some pieces of code, I can determine every sentence which has emoticon icons, the problem is how can I deploy this feature? in other words, by what method can I represent this feature numerically in order to be suitable to feed to ML algorithm?

Reply
- Jason Brownlee February 26, 2020 at 8:21 am #
  
  Yes, you can have these static features along side the numeric representations of the text. E.g. just more features. The model cannot tell the difference.
  
  Reply
Al May 4, 2020 at 11:48 am #

Hi!

Great post. Is there a way to use CountVectorizer on a corpus which is a list and not a string?

Reply
- Jason Brownlee May 4, 2020 at 1:27 pm #
  
  Hmm, what do you mean exactly? Can you elaborate?
  
  Reply
Dan May 29, 2020 at 10:38 pm #

Hey Jason, Great article.

I was wondering if you could demonstrate how to combine NLP with time series analysis (ARIMA model) as you presented in this article? https://machinelearningmastery.com/make-sample-forecasts-arima-python/

Not many resources online that discuss this and many errors come from using CountVectorizer with ARIMA

Reply
- Jason Brownlee May 30, 2020 at 6:03 am #
  
  Thanks for the suggestion.
  
  Reply
Dr. Brownlee's #1 Fan July 7, 2020 at 7:50 am #

In the CountVectorizer section, the codes includes “print(vectorizer.vocabulary_)”, which returns a dictionary comprised of the tokens and their respective indices in the array. See below.

{‘dog’: 1, ‘fox’: 2, ‘over’: 5, ‘brown’: 0, ‘quick’: 6, ‘the’: 7, ‘lazy’: 4, ‘jumped’: 3}

It seems that the tokens are assigned an index based on alphanumeric. Is there any other significance to the order?

Reply
- Jason Brownlee July 7, 2020 at 1:56 pm #
  
  Not that I’m aware.
  
  Reply
Alan Jose Tom February 10, 2021 at 8:28 pm #

How we can pass two columns for the ‘X’ for preparation of the model?

Reply
- Jason Brownlee February 11, 2021 at 5:54 am #
  
  Perhaps prepare each variable separately then concat the results prior to modeling?
  
  Reply
Shrey Jain August 23, 2021 at 7:53 pm #

Hi Jason! I am trying to perform tfidf vectorization on my train_x dataframe (4064, 1). But after calling fit_transform(train_x).toarray(), the resulting matrix has size (1,1) which doesn’t make sense. Why is this happening?

Reply
- Adrian Tam August 24, 2021 at 8:32 am #
  
  May be your input matrix is in the wrong shape? Transposed?
  
  Reply

Navigation

How to Encode Text Data for Machine Learning with scikit-learn

Bag-of-Words Model

Need help with Deep Learning for Text Data?

Word Counts with CountVectorizer

Word Frequencies with TfidfVectorizer

Hashing with HashingVectorizer

Further Reading

Natural Language Processing

sciki-learn

Class APIs

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

119 Responses to How to Encode Text Data for Machine Learning with scikit-learn

Leave a Reply Click here to cancel reply.