Text data requires special preparation before you can start using it for predictive modeling.
The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.
In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.
After completing this tutorial, you will know:
- How to convert text to word count vectors with CountVectorizer.
- How to convert text to word frequency vectors with TfidfVectorizer.
- How to convert text to unique integers with HashingVectorizer.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
Bag-of-Words Model
We cannot work with text directly when using machine learning algorithms.
Instead, we need to convert the text to numbers.
We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.
A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.
The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.
This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.
This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.
For more on the bag of words model, see the tutorial:
There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.
The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Word Counts with CountVectorizer
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.
You can use it as follows:
- Create an instance of the CountVectorizer class.
- Call the fit() function in order to learn a vocabulary from one or more documents.
- Call the transform() function on one or more documents as needed to encode each as a vector.
An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.
Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.
The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.
Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from sklearn.feature_extraction.text import CountVectorizer # list of text documents text = ["The quick brown fox jumped over the lazy dog."] # create the transform vectorizer = CountVectorizer() # tokenize and build vocab vectorizer.fit(text) # summarize print(vectorizer.vocabulary_) # encode document vector = vectorizer.transform(text) # summarize encoded vector print(vector.shape) print(type(vector)) print(vector.toarray()) |
Above, you can see that we access the vocabulary to see what exactly was tokenized by calling:
1 |
print(vectorizer.vocabulary_) |
We can see that all words were made lowercase by default and that the punctuation was ignored. These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.
Running the example first prints the vocabulary, then the shape of the encoded document. We can see that there are 8 words in the vocab, and therefore encoded vectors have a length of 8.
We can then see that the encoded vector is a sparse matrix. Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.
1 2 3 4 |
{'dog': 1, 'fox': 2, 'over': 5, 'brown': 0, 'quick': 6, 'the': 7, 'lazy': 4, 'jumped': 3} (1, 8) <class 'scipy.sparse.csr.csr_matrix'> [[1 1 1 1 1 1 1 2]] |
Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.
For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.
1 2 3 4 |
# encode another document text2 = ["the puppy"] vector = vectorizer.transform(text2) print(vector.toarray()) |
Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocab and the other word not in the vocab completely ignored.
1 |
[[0 0 0 0 0 0 0 1]] |
The encoded vectors can then be used directly with a machine learning algorithm.
Word Frequencies with TfidfVectorizer
Word counts are a good starting point, but are very basic.
One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.
An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.
- Term Frequency: This summarizes how often a given word appears within a document.
- Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.
The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.
The same create, fit, and transform process is used as with the CountVectorizer.
Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.feature_extraction.text import TfidfVectorizer # list of text documents text = ["The quick brown fox jumped over the lazy dog.", "The dog.", "The fox"] # create the transform vectorizer = TfidfVectorizer() # tokenize and build vocab vectorizer.fit(text) # summarize print(vectorizer.vocabulary_) print(vectorizer.idf_) # encode document vector = vectorizer.transform([text[0]]) # summarize encoded vector print(vector.shape) print(vector.toarray()) |
A vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector.
The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.
Finally, the first document is encoded as an 8-element sparse array and we can review the final scorings of each word with different values for “the“, “fox“, and “dog” from the other words in the vocabulary.
1 2 3 4 5 6 |
{'fox': 2, 'lazy': 4, 'dog': 1, 'quick': 6, 'the': 7, 'over': 5, 'brown': 0, 'jumped': 3} [ 1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718 1.69314718 1. ] (1, 8) [[ 0.36388646 0.27674503 0.27674503 0.36388646 0.36388646 0.36388646 0.36388646 0.42983441]] |
The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.
Hashing with HashingVectorizer
Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.
This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.
A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).
The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.
The example below demonstrates the HashingVectorizer for encoding a single document.
An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.
Note that this vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.feature_extraction.text import HashingVectorizer # list of text documents text = ["The quick brown fox jumped over the lazy dog."] # create the transform vectorizer = HashingVectorizer(n_features=20) # encode document vector = vectorizer.transform(text) # summarize encoded vector print(vector.shape) print(vector.toarray()) |
Running the example encodes the sample document as a 20-element sparse array.
The values of the encoded document correspond to normalized word counts by default in the range of -1 to 1, but could be made simple integer counts by changing the default configuration.
1 2 3 4 5 |
(1, 20) [[ 0. 0. 0. 0. 0. 0.33333333 0. -0.33333333 0.33333333 0. 0. 0.33333333 0. 0. 0. -0.33333333 0. 0. -0.66666667 0. ]] |
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Natural Language Processing
sciki-learn
- Section 4.2. Feature extraction, scikit-learn User Guide
- sckit-learn Feature Extraction API
- Working With Text Data, scikit-learn Tutorial
Class APIs
- CountVectorizer scikit-learn API
- TfidfVectorizer scikit-learn API
- TfidfTransformer scikit-learn API
- HashingVectorizer scikit-learn API
Summary
In this tutorial, you discovered how to prepare text documents for machine learning with scikit-learn.
We have only scratched the surface in these examples and I want to highlight that there are many configuration details for these classes to influence the tokenizing of documents that are worth exploring.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Can you tell how to proceed in R for machine learning and feature selection!
See this post:
https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
Hi, Dr. Brownlee,
Congratulations for this great article. Do you know any technique to parse in a smart way HTML documents (DOM) to work with ANNs? This guys in this paper do that (http://proceedings.mlr.press/v70/shi17a/shi17a.pdf) but they didn’t specify how.
Thank you so much!
No sorry, it is not something I have worked on.
Hi did you get to know how to parse DOM to work with ANN
Was running this code and encountered an error
^
SyntaxError: invalid syntax
Is this because I’m using Python 2.7?
The code was developed with Python 2.7 and should work in Python 3.5 as well.
Confirm that you copied all of the code and preserved the indenting.
I should be paying my tuition to you, thank you for all your great contents.
Thanks Eric!
Pls I have an assignment to create a search engine from news feeds. After parsing the rss feeds how do I extract words from those links?
Perhaps this will help:
https://machinelearningmastery.com/clean-text-machine-learning-python/
Hello Sir,
How can ML be used to carry out survey based research?
I don’t know. Perhaps used in the analysis in some way?
Hello Sir,
I am trying to classify a delivery address as residential and commercial based on some numerical features weight ,qty , some derived features ,along with text address data as input. Should I apply countvectorizer or tdif on the text address data for converting numerical features. Or any other methods. I planning to use decision tree classifier.
Test many approaches and see what works best.
Many thanks to you. I am working on my project and I extract data from tags of html web pages.
I need to assign word in each tag to be feature. for example play in title tag not the same with play in header tag or play in anchor tag. any idea ?
Perhaps start with a bag of words model and perhaps move on to word embedding + neural net to see if it can do better.
Hi sir can u please explain why we use this line(from sklearn. modelselection traindata….,) with a good explanation?
Which line exactly?
Hi Dr.Jason,
How can i adding new columns (or features) to the current vector ? For ex: adding vector of number of wrong spelling words from each document.
Thanks
You might want to keep the document representation separate from that information and feed them as two separate inputs to a model (e.g. a multi-input model in Keras).
I have a query. I have a a cluster of text files containing some topics of similar interest.
I want to input these docs as text files to sklearn tools using python.
Can you please tell me the process
The process is the blog post above, does that not help?
This is a great article. Helped me a lot in my project. I have a followup question: What if I want a vector for all the documents present? Is there a more efficient way other than a for loop like this:
for j in len(docs):
vector = vectorizer.transform([text[j]])
I hope to convert it into a massive dataset to present it to a ML algorithm
The transform can take the entire document I believe.
How come the words lazy and dog both scored 1.28768207, which is in between the low score of 1.0 that was assigned to the word “the” and the higher 1.69314718 score for the word fox; but
after the scores are normalized to values between 0 and 1, the assigned scores do not follow the same pattern… the values for lazy and dog [0.276] are no longer in between the values for the words “fox” [0.363] and “the” [0.429]; the values for lazy and dog are at one end of the range of scores.
For that matter, how did the encoder decide to assign different scores to dog and fox in the first place, when they both occur the same number of times?
Thank you for sharing your knowledge the way you do.
Good question Russ.
I would recommend checking the references and reading up on the calculation of TF/IDF.
Russ –
The words scored 1.28768207 are ‘dog’ and ‘fox’. The tokeniser starts counting from 0.
The reason the pattern breaks in the encoded vector is that it is only for the first document, notice [text[0]]. Within that document, the word ‘the’ occurs twice unlike all other words, so even though it is common between documents and is penalised for that, it is also more common within that document so gets points for that. A couple of things to play around with to see how this is working… First, vary the index to see how the feature vector for a different document would work by doing [text[1]] for example. Second, put another ‘the’ in the first document, or take one away, and see what happens.
Very good article. Do you think that stemming the vocabulary before applying any of these techniques would yield a better performance?
It may as it would reduce the vocab size, try it and see.
hello ….
i need to ask u a question…
i am thinking of working on “Tweet Sentiment Analysis for Cellular Network Service Providers using Machine Learning Algorithms”…could u please help me …is it possible to work on such data…. i have fetched 2000 tweets from twitter and
i am facing problem in feature extraction n i am not deciding what could be the features for such data….please help if u could…..
This post might help as a template:
https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
The position of word in the text ? Is it not important? How is it taken care in the vector representation?
Otherwise two texts with same words and frequency will result in same vector, if the order of words is different
Since I am new to ML, please help me understand
It is for some models (like LSTMs) and not for others (like bag of words).
Hi Jason,
I have a question regarding HashingVectorizer(). How does one do online learning with it? Does it learn the vocabulary of the text like tfidf does. Also, what happens when a new text containing some unseen words come in. What do I do with that? I am suppose to wait for a bunch of data, and call transform() on HashingVectorizer() and feed it the new text samples. Any references/videos to online learning of text documents with “Multi-Label” output would be great.
Ideally the vocab would be defined up front.
Otherwise, you could use it in an online manner, as long as you set an expectation on the size of the vocab, to ensure the hash function did not have too many collisions.
Very nice article. Thanks.
Thanks, I’m glad it helped.
Hi Jason,
First of all thank you for this great article to learn about TF/IDF practically. I have one question about it. How can I get the list of terms in the vocabulary with their relevant document frequency. In your example, “The” has appeared in all three documents so DF for “The” is 3, similarly DF is 2 for “Dog” and “Fox”.
Best Regards
Check the attributes on the object, see here:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Love your articles Jason!
I have a question related to turning a list of names into vectors and using them as a feature in a classifier. I am not sure which method to use, but I was thinking that Hashing would be appropriate. My issue is that with the number of possible names being high, wouldn’t this create very sparse vectors that my classifier would have difficulty learning from? Also is there another method I should consider for this case?
Perhaps contrast a bag of words to a word2vec method to see what works best for your specific problem.
Hi Jason,
First of all thank you for this great article. I have two questions:
1. Supposing that we have a dataset similar to data from Kaggle: https://www.kaggle.com/aaron7sun/stocknews/data. How can we deal with some ‘N/A’ data?
2. Second question is about stop words. I want to use NLTK to delete stop words from text, but unfortunatelly NLTK doesn’t has a polish words. How can I use my own dictionary?
Regards,
Kamil
I have advice on how to handle missing data here:
https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-missing-data
You can use your own, perhaps NLTK does have stop words in other languages. It would be worth checking.
For the example in HashVectorizer and countvectorizer, I am not able to understand the how the values in the vector are arrived at. I understand tf and idf. Can you illustrate atleast for one term, how the value is computed. In the example of hashvectorizer, there is one only document.
The hash method uses a hash function from string to int.
The count uses an occurrence count in the document for each word.
Great article! I do have a question. Let’s say I have a dataset with both numeric and text elements. I only want to apply TF-IDF (for example) to my text column, and then append it to my dataset so that I can train with my numerical and categorical data (that now it’s transformed) .
Example:
col 1 col 2
this is a text 4.5
this is also a text 7.5
I only want to apply TF-IDF to my col 1, to that I can then use a ML Algorithm with both col 1 and col2.
Result:
col 1 col2
(TF-IDFResult) 4.5
(TF-IDFResult) 7.5
How do you achieve this?
Thanks!
Perhaps try two models and ensemble their predictions.
Perhaps try a neural net with two inputs.
I have this same question.
I was wondering if you could separate the data (apply the TF-IDF in col 1) and then concat the result with col 2 as part of preparing input for the model
Sure, try it.
Hi DR. Jason,
This is a very good post, I was looking for an explanation like this. I liked so much it cleared me many points.
I have a doubt because I want to load text from many files to produce a Bag-of-Words, the doubt is:
If I have many text files (with two classes or categories) to produce a single Bag-of-Words I should load them all together separately? or join all text in a single text file and load a single text file whit all text and then produce the Bag-of-Words?
Best regards.
Thanks.
Perhaps you load all files into memory then prepare the BoW model?
Thank you I will try that.
Hello Jason, thank you for your great article first and I learnt a lot from that!
Now I am dealing with a log analysis problem. In this problem, the order of words is a very important feature, for example log content like ‘No such file or directory’, some words always come together in some order.
My question is, what kinds of feature extraction methods can I use to encode such order information?
Thank you very much!
I recommend a word embedding as a representation that is distributed and allows you to preserve word order.
https://machinelearningmastery.com/what-are-word-embeddings/
How can we read a folder of text documents and apply the steps mentioned in the article to it, especially resumes?
Start by reading one file, and expand from there.
two question about the Bag of Words have obsessed me for a while.,
first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.
The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.
Thanks.
Country would be a separate input, perhaps one hot encoded. You could concat with the bow feature vector as part of preparing input for the model.
A sequence prediction method can handle the position of words, e.g. LSTM.
Hello Dr. Jason,
I have set of PDF documents, I would like to read the contents of PDF documents and use certain paragraph of the PDF to provide answer through interactive bot. How Can I make bot learn about the text contents? Please let help me here.
Sounds like a very challenging research problem, not something trivial I can answer in a blog comment, sorry.
nice work sir,
how can I extract specific text like name,date,value from pdf file
Sounds like you’re intersted in named entity recognition:
https://en.wikipedia.org/wiki/Named-entity_recognition
Good Explanation! Are you referring to documents here is can be data frame/list containing bag of words (sentences /tweets).
Also in all your books, where one should start especially if we have Python programming knowledge.
Can we start with Master Machine Learning Algorithms or Machine Learning with Python or Linear Algebra (least preferred) as we may not get time?
A good place to start with ML using Python is here:
https://machinelearningmastery.com/start-here/#python
If you want to get started with deep learning in python for text, you can start here:
https://machinelearningmastery.com/start-here/#nlp
Hello Jason,
Can you help me? I have a big dataset, about 4 millions texts. And I can not use clear tf-idf. So, firstly, i use hashing. But, I also need to use tf-idf. How can I concat 2 methods? It’s possible to give sparse matrix in tf-idf instead of corpus?
Perhaps you can use progressive loading and manually build the tf-idf from your dataset?
Hi, I am using data set from yahoo and it is in csv format. I have tokenized the data set and stemming, stop words are removed. I have generated vector values using skip gram model. Now how to convert variable cardinality vector to fixed length vector? Please help me in this
What do you mean exactly?
Hello Sir,
I want to extract emotions associated with tweets for that I used Tfid Vectorizer to transform each word into a number but Tweets are going to vary in length so the length of each vector array also will vary so how can I apply that to my model let say suppose to SVM?
Thank you
If you use a vectorized representation, you will have a fixed sized vocab, but the length of the tweets can vary no problem.
hello. how to do train and test text data??
I want a sample code
You can get started here:
https://machinelearningmastery.com/start-here/#nlp
Hello Jason,
thank you for the nice tutorial.
I’m doing a text classification task to classify 10 different categories. I have a training data set that contains text documents for each category. I have in hand list of predefined features (words/terms) for each category, I want to add these features into the training dataset to be the top features during the training of ML model.
Do you have any idea how can I do that?
Thank you,
Yes, you can remove all other words except the ones you want to include, then model the classification problem.
Hi Jason,
What is Difference between Co-variance and Co-occurrence matrix, How can we obtain Co-occurrence matrix from the text corpus/text data in python
Thanks
Good question, I don’t have an example of a co-occurrance matrix, sorry.
After applying the vectorization on train and test sample using TfidfVectorizer, the train and test samples will have different number of features. So when you apply the trained model to make predictions on the validation sample, it will complain the validation sample doesn’t have enough features. What is your advice to solve this issue? thanks!!!!
You must use the same vocab to prepare both datasets, e.g. fit on train and transform train and test.
your tutorials is very helpful
Thanks, I’m happy to hear that.
Thank you so much Jason, your tutorial helped me a lot and I wish that you will keep sharing more and thank you so much again
Thanks, I’m happy to hear that.
Thank you so much for the help
i wanted to ask two questions based on my work, I am currently trying to pre-process java source code with known labels and feed the output to my model for a classification problem,
can i use this techniques for pre-processing on java source code or is it there something more suitable for my work
and if it’s possible on which basis should i select a proper vectorizer for my reserch
again thank you so much for the article
I don’t know much about pre-process code for modeling, sorry.
Perhaps check papers on the topic to see what is common.
Your explanation is very clear I liked it. I have a question how I can apply this on a text file include an Arabic dataset? Can you help me?
Good question, I don’t have exampels of working with Arabic text, sorry.
thanks alot bro good work
I’m glad it helped.
I am beginner to Data Science (machine learning). In my recent course i used anaconda-paython and used few algo for machine learning.
Currently i am Fullstack developer (.Net, Angular, sql, firebase etc.)
I want to upgrade myself for data science-machine learning.
Can you please suggest any tutorial for the beginner?
Yes, you can get started right here:
https://machinelearningmastery.com/start-here/#getstarted
Really help, thanks.
My question is how can you insert a large corpus instead of the small text documents you used as an example.
You can use the same code, and load your large corpus.
Perhaps I don’t follow your question?
sir how we can load large corpus for countvectorizer?
The same as a small corpus. Perhaps I don’t understand your problem?
Thank you very much, can TFIDF be applied to documents that have been applied to Hash trick and why are there negative numbers in feature Vectors
No idea, sorry.
I have a question with regard to creating an instance of the CountVectorizer class by using:
vectorizer = CountVectorizer(tokenizer=word_tokenize)
Could you please clarify the meaning of “tokenizer=word_tokenize” .
What is the difference between vectorizer = CountVectorizer and vectorizer = CountVectorizer(tokenizer=word_tokenize)
I’m so grateful for your advice.
According to the API, it is a function that you can specify to perform the tokenization:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
Hello, another great article.
I want to ask, is it possible to change the size of vocabulary in TfidfVectorizer example?
Thank you in advance
Thanks.
Yes, it takes the size of the vocab from the training data.
Hi Dr.
In line with what you said in your website, for each text classification we should use a way to represent our text to a vector in order to get ready to use it as an input for machine learning.
I know that we should use some ways such as BOW or Word2vec, but my question is how should we represent our features as vector? say I want to do a sentiment analysis and have a feature such as emoticon dictionary, how should I represent it as vector?
with word2vec??
could you please help me
Each emoticon would be a “word” that could appear in text. Represented like other words in your dataset.
Hi, nice article
I have a doubt:
What is the exact use of fit() function ?
Fits the model.
Specifically, it runs an optimization algorithm to find model parameters that best capture the relationship between inputs and outputs in your dataset.
so in this case the vectorizer.fit(text) is creating a dictionary right ?
how does that help the transform() function ?
Yes, it defines the vocab. Anytime transform() is performed it is limited to the vocab seen when fit was called.
Thank you for your practical blog. I’m somewhat confused. please help!
In line with what you said: ….Then the words need to be encoded as integers or floating-point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
I know that one popular way is BOW with witch we can have a numeric representation Right?
Now my question is: imagine I want to do some text classification and one of my features is: the number of emoticon icons! obviously, with some pieces of code, I can determine every sentence which has emoticon icons, the problem is how can I deploy this feature? in other words, by what method can I represent this feature numerically in order to be suitable to feed to ML algorithm?
Yes, you can have these static features along side the numeric representations of the text. E.g. just more features. The model cannot tell the difference.
Hi!
Great post. Is there a way to use CountVectorizer on a corpus which is a list and not a string?
Hmm, what do you mean exactly? Can you elaborate?
Hey Jason, Great article.
I was wondering if you could demonstrate how to combine NLP with time series analysis (ARIMA model) as you presented in this article? https://machinelearningmastery.com/make-sample-forecasts-arima-python/
Not many resources online that discuss this and many errors come from using CountVectorizer with ARIMA
Thanks for the suggestion.
In the CountVectorizer section, the codes includes “print(vectorizer.vocabulary_)”, which returns a dictionary comprised of the tokens and their respective indices in the array. See below.
{‘dog’: 1, ‘fox’: 2, ‘over’: 5, ‘brown’: 0, ‘quick’: 6, ‘the’: 7, ‘lazy’: 4, ‘jumped’: 3}
It seems that the tokens are assigned an index based on alphanumeric. Is there any other significance to the order?
Not that I’m aware.
How we can pass two columns for the ‘X’ for preparation of the model?
Perhaps prepare each variable separately then concat the results prior to modeling?
Hi Jason! I am trying to perform tfidf vectorization on my train_x dataframe (4064, 1). But after calling fit_transform(train_x).toarray(), the resulting matrix has size (1,1) which doesn’t make sense. Why is this happening?
May be your input matrix is in the wrong shape? Transposed?