How to Encode Text Data for Machine Learning with scikit-learn

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

In this tutorial, you will discover exactly how you can prepare your text data for predictive modeling in Python with scikit-learn.

After completing this tutorial, you will know:

  • How to convert text to word count vectors with CountVectorizer.
  • How to convert text to word frequency vectors with TfidfVectorizer.
  • How to convert text to unique integers with HashingVectorizer.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Prepare Text Data for Machine Learning with scikit-learn

How to Prepare Text Data for Machine Learning with scikit-learn
Photo by Martin Kelly, some rights reserved.

Bag-of-Words Model

We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

For more on the bag of words model, see the tutorial:

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Word Counts with CountVectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

  1. Create an instance of the CountVectorizer class.
  2. Call the fit() function in order to learn a vocabulary from one or more documents.
  3. Call the transform() function on one or more documents as needed to encode each as a vector.

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and better understand what is going on by calling the toarray() function.

Below is an example of using the CountVectorizer to tokenize, build a vocabulary, and then encode a document.

Above, you can see that we access the vocabulary to see what exactly was tokenized by calling:

We can see that all words were made lowercase by default and that the punctuation was ignored. These and other aspects of tokenizing can be configured and I encourage you to review all of the options in the API documentation.

Running the example first prints the vocabulary, then the shape of the encoded document. We can see that there are 8 words in the vocab, and therefore encoded vectors have a length of 8.

We can then see that the encoded vector is a sparse matrix. Finally, we can see an array version of the encoded vector showing a count of 1 occurrence for each word except the (index and id 7) that has an occurrence of 2.

Importantly, the same vectorizer can be used on documents that contain words not included in the vocabulary. These words are ignored and no count is given in the resulting vector.

For example, below is an example of using the vectorizer above to encode a document with one word in the vocab and one word that is not.

Running this example prints the array version of the encoded sparse vector showing one occurrence of the one word in the vocab and the other word not in the vocab completely ignored.

The encoded vectors can then be used directly with a machine learning algorithm.

Word Frequencies with TfidfVectorizer

Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

  • Term Frequency: This summarizes how often a given word appears within a document.
  • Inverse Document Frequency: This downscales words that appear a lot across documents.

Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

The same create, fit, and transform process is used as with the CountVectorizer.

Below is an example of using the TfidfVectorizer to learn vocabulary and inverse document frequencies across 3 small documents and then encode one of those documents.

A vocabulary of 8 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word: “the” at index 7.

Finally, the first document is encoded as an 8-element sparse array and we can review the final scorings of each word with different values for “the“, “fox“, and “dog” from the other words in the vocabulary.

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

Hashing with HashingVectorizer

Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

The example below demonstrates the HashingVectorizer for encoding a single document.

An arbitrary fixed-length vector size of 20 was chosen. This corresponds to the range of the hash function, where small values (like 20) may result in hash collisions. Remembering back to compsci classes, I believe there are heuristics that you can use to pick the hash length and probability of collision based on estimated vocabulary size.

Note that this vectorizer does not require a call to fit on the training data documents. Instead, after instantiation, it can be used directly to start encoding documents.

Running the example encodes the sample document as a 20-element sparse array.

The values of the encoded document correspond to normalized word counts by default in the range of -1 to 1, but could be made simple integer counts by changing the default configuration.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Natural Language Processing

sciki-learn

Class APIs

Summary

In this tutorial, you discovered how to prepare text documents for machine learning with scikit-learn.

We have only scratched the surface in these examples and I want to highlight that there are many configuration details for these classes to influence the tokenizing of documents that are worth exploring.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

119 Responses to How to Encode Text Data for Machine Learning with scikit-learn

  1. Avatar
    Kinjal September 29, 2017 at 2:28 pm #

    Can you tell how to proceed in R for machine learning and feature selection!

  2. Avatar
    Jarbas September 29, 2017 at 11:49 pm #

    Hi, Dr. Brownlee,

    Congratulations for this great article. Do you know any technique to parse in a smart way HTML documents (DOM) to work with ANNs? This guys in this paper do that (http://proceedings.mlr.press/v70/shi17a/shi17a.pdf) but they didn’t specify how.

    Thank you so much!

    • Avatar
      Jason Brownlee September 30, 2017 at 7:42 am #

      No sorry, it is not something I have worked on.

    • Avatar
      Nandish April 24, 2019 at 6:24 pm #

      Hi did you get to know how to parse DOM to work with ANN

  3. Avatar
    John Stec October 2, 2017 at 3:48 am #

    Was running this code and encountered an error

    ^
    SyntaxError: invalid syntax

    Is this because I’m using Python 2.7?

  4. Avatar
    Advait Vasavada October 4, 2017 at 3:43 am #

    Hello Sir,
    How can ML be used to carry out survey based research?

    • Avatar
      Jason Brownlee October 4, 2017 at 5:48 am #

      I don’t know. Perhaps used in the analysis in some way?

  5. Avatar
    Rahul October 28, 2017 at 2:13 am #

    Hello Sir,

    I am trying to classify a delivery address as residential and commercial based on some numerical features weight ,qty , some derived features ,along with text address data as input. Should I apply countvectorizer or tdif on the text address data for converting numerical features. Or any other methods. I planning to use decision tree classifier.

  6. Avatar
    Ahmed November 7, 2017 at 3:07 am #

    Many thanks to you. I am working on my project and I extract data from tags of html web pages.
    I need to assign word in each tag to be feature. for example play in title tag not the same with play in header tag or play in anchor tag. any idea ?

    • Avatar
      Jason Brownlee November 7, 2017 at 9:52 am #

      Perhaps start with a bag of words model and perhaps move on to word embedding + neural net to see if it can do better.

      • Avatar
        Santhosh January 6, 2020 at 9:01 pm #

        Hi sir can u please explain why we use this line(from sklearn. modelselection traindata….,) with a good explanation?

  7. Avatar
    Daniel November 26, 2017 at 3:41 am #

    Hi Dr.Jason,
    How can i adding new columns (or features) to the current vector ? For ex: adding vector of number of wrong spelling words from each document.

    Thanks

    • Avatar
      Jason Brownlee November 26, 2017 at 7:34 am #

      You might want to keep the document representation separate from that information and feed them as two separate inputs to a model (e.g. a multi-input model in Keras).

  8. Avatar
    Vivek November 26, 2017 at 9:12 pm #

    I have a query. I have a a cluster of text files containing some topics of similar interest.
    I want to input these docs as text files to sklearn tools using python.

    Can you please tell me the process

    • Avatar
      Jason Brownlee November 27, 2017 at 5:50 am #

      The process is the blog post above, does that not help?

  9. Avatar
    Aman December 19, 2017 at 2:26 pm #

    This is a great article. Helped me a lot in my project. I have a followup question: What if I want a vector for all the documents present? Is there a more efficient way other than a for loop like this:

    for j in len(docs):
    vector = vectorizer.transform([text[j]])

    I hope to convert it into a massive dataset to present it to a ML algorithm

    • Avatar
      Jason Brownlee December 19, 2017 at 4:00 pm #

      The transform can take the entire document I believe.

  10. Avatar
    Russ Reinsch January 2, 2018 at 10:55 am #

    How come the words lazy and dog both scored 1.28768207, which is in between the low score of 1.0 that was assigned to the word “the” and the higher 1.69314718 score for the word fox; but

    after the scores are normalized to values between 0 and 1, the assigned scores do not follow the same pattern… the values for lazy and dog [0.276] are no longer in between the values for the words “fox” [0.363] and “the” [0.429]; the values for lazy and dog are at one end of the range of scores.

    For that matter, how did the encoder decide to assign different scores to dog and fox in the first place, when they both occur the same number of times?

    Thank you for sharing your knowledge the way you do.

    • Avatar
      Jason Brownlee January 2, 2018 at 4:00 pm #

      Good question Russ.

      I would recommend checking the references and reading up on the calculation of TF/IDF.

    • Avatar
      Tam September 9, 2018 at 9:39 pm #

      Russ –

      The words scored 1.28768207 are ‘dog’ and ‘fox’. The tokeniser starts counting from 0.

      The reason the pattern breaks in the encoded vector is that it is only for the first document, notice [text[0]]. Within that document, the word ‘the’ occurs twice unlike all other words, so even though it is common between documents and is penalised for that, it is also more common within that document so gets points for that. A couple of things to play around with to see how this is working… First, vary the index to see how the feature vector for a different document would work by doing [text[1]] for example. Second, put another ‘the’ in the first document, or take one away, and see what happens.

  11. Avatar
    Jesús Martínez January 26, 2018 at 3:54 am #

    Very good article. Do you think that stemming the vocabulary before applying any of these techniques would yield a better performance?

    • Avatar
      Jason Brownlee January 26, 2018 at 5:45 am #

      It may as it would reduce the vocab size, try it and see.

  12. Avatar
    punita February 8, 2018 at 9:05 pm #

    hello ….
    i need to ask u a question…
    i am thinking of working on “Tweet Sentiment Analysis for Cellular Network Service Providers using Machine Learning Algorithms”…could u please help me …is it possible to work on such data…. i have fetched 2000 tweets from twitter and
    i am facing problem in feature extraction n i am not deciding what could be the features for such data….please help if u could…..

  13. Avatar
    Ravi Shankar February 19, 2018 at 9:01 pm #

    The position of word in the text ? Is it not important? How is it taken care in the vector representation?

    Otherwise two texts with same words and frequency will result in same vector, if the order of words is different

    Since I am new to ML, please help me understand

    • Avatar
      Jason Brownlee February 21, 2018 at 6:26 am #

      It is for some models (like LSTMs) and not for others (like bag of words).

  14. Avatar
    Shiva March 9, 2018 at 2:54 am #

    Hi Jason,

    I have a question regarding HashingVectorizer(). How does one do online learning with it? Does it learn the vocabulary of the text like tfidf does. Also, what happens when a new text containing some unseen words come in. What do I do with that? I am suppose to wait for a bunch of data, and call transform() on HashingVectorizer() and feed it the new text samples. Any references/videos to online learning of text documents with “Multi-Label” output would be great.

    • Avatar
      Jason Brownlee March 9, 2018 at 6:25 am #

      Ideally the vocab would be defined up front.

      Otherwise, you could use it in an online manner, as long as you set an expectation on the size of the vocab, to ensure the hash function did not have too many collisions.

  15. Avatar
    Aykut Yararbas March 22, 2018 at 10:04 am #

    Very nice article. Thanks.

  16. Avatar
    Shahbaz Wasti March 25, 2018 at 10:43 pm #

    Hi Jason,

    First of all thank you for this great article to learn about TF/IDF practically. I have one question about it. How can I get the list of terms in the vocabulary with their relevant document frequency. In your example, “The” has appeared in all three documents so DF for “The” is 3, similarly DF is 2 for “Dog” and “Fox”.

    Best Regards

  17. Avatar
    Jack Smith April 5, 2018 at 12:52 pm #

    Love your articles Jason!

    I have a question related to turning a list of names into vectors and using them as a feature in a classifier. I am not sure which method to use, but I was thinking that Hashing would be appropriate. My issue is that with the number of possible names being high, wouldn’t this create very sparse vectors that my classifier would have difficulty learning from? Also is there another method I should consider for this case?

    • Avatar
      Jason Brownlee April 5, 2018 at 3:14 pm #

      Perhaps contrast a bag of words to a word2vec method to see what works best for your specific problem.

  18. Avatar
    Kamil May 13, 2018 at 11:24 pm #

    Hi Jason,

    First of all thank you for this great article. I have two questions:

    1. Supposing that we have a dataset similar to data from Kaggle: https://www.kaggle.com/aaron7sun/stocknews/data. How can we deal with some ‘N/A’ data?
    2. Second question is about stop words. I want to use NLTK to delete stop words from text, but unfortunatelly NLTK doesn’t has a polish words. How can I use my own dictionary?

    Regards,
    Kamil

  19. Avatar
    Ravi Shankar May 20, 2018 at 5:46 pm #

    For the example in HashVectorizer and countvectorizer, I am not able to understand the how the values in the vector are arrived at. I understand tf and idf. Can you illustrate atleast for one term, how the value is computed. In the example of hashvectorizer, there is one only document.

    • Avatar
      Jason Brownlee May 21, 2018 at 6:28 am #

      The hash method uses a hash function from string to int.

      The count uses an occurrence count in the document for each word.

  20. Avatar
    Nicolas June 18, 2018 at 12:37 pm #

    Great article! I do have a question. Let’s say I have a dataset with both numeric and text elements. I only want to apply TF-IDF (for example) to my text column, and then append it to my dataset so that I can train with my numerical and categorical data (that now it’s transformed) .

    Example:

    col 1 col 2
    this is a text 4.5
    this is also a text 7.5

    I only want to apply TF-IDF to my col 1, to that I can then use a ML Algorithm with both col 1 and col2.

    Result:

    col 1 col2
    (TF-IDFResult) 4.5
    (TF-IDFResult) 7.5

    How do you achieve this?
    Thanks!

    • Avatar
      Jason Brownlee June 18, 2018 at 3:12 pm #

      Perhaps try two models and ensemble their predictions.

      Perhaps try a neural net with two inputs.

    • Avatar
      kessia June 29, 2021 at 4:43 am #

      I have this same question.
      I was wondering if you could separate the data (apply the TF-IDF in col 1) and then concat the result with col 2 as part of preparing input for the model

  21. Avatar
    Nil July 5, 2018 at 3:25 pm #

    Hi DR. Jason,

    This is a very good post, I was looking for an explanation like this. I liked so much it cleared me many points.

    I have a doubt because I want to load text from many files to produce a Bag-of-Words, the doubt is:
    If I have many text files (with two classes or categories) to produce a single Bag-of-Words I should load them all together separately? or join all text in a single text file and load a single text file whit all text and then produce the Bag-of-Words?

    Best regards.

    • Avatar
      Jason Brownlee July 6, 2018 at 6:38 am #

      Thanks.

      Perhaps you load all files into memory then prepare the BoW model?

      • Avatar
        Nil July 7, 2018 at 2:18 am #

        Thank you I will try that.

  22. Avatar
    Oliver July 12, 2018 at 1:03 pm #

    Hello Jason, thank you for your great article first and I learnt a lot from that!

    Now I am dealing with a log analysis problem. In this problem, the order of words is a very important feature, for example log content like ‘No such file or directory’, some words always come together in some order.

    My question is, what kinds of feature extraction methods can I use to encode such order information?

    Thank you very much!

  23. Avatar
    Sun August 2, 2018 at 10:01 pm #

    How can we read a folder of text documents and apply the steps mentioned in the article to it, especially resumes?

    • Avatar
      Jason Brownlee August 3, 2018 at 6:02 am #

      Start by reading one file, and expand from there.

  24. Avatar
    Jun August 16, 2018 at 5:59 am #

    two question about the Bag of Words have obsessed me for a while.,

    first question is my source file has 2 columns, one is email content, which is text format, the other is country name(3 different countries) from where the email is sent, and I want to label if the email is Spam or not, here the assumption is the email sent from different countries also matters if email is spam or not. so besides the bag of words, I want to add a feature which is country, the question is that is there is way to implement it in sklearn.

    The other question is besides Bag of Words, what if I also want to consider the position of the words, for instance if word appears in first sentence, I want to lower its weight, if word appears in last sentence, I want to increase its weight, is there a way to implement it in sklearn.

    Thanks.

    • Avatar
      Jason Brownlee August 16, 2018 at 6:16 am #

      Country would be a separate input, perhaps one hot encoded. You could concat with the bow feature vector as part of preparing input for the model.

      A sequence prediction method can handle the position of words, e.g. LSTM.

  25. Avatar
    ravi October 4, 2018 at 7:49 pm #

    Hello Dr. Jason,
    I have set of PDF documents, I would like to read the contents of PDF documents and use certain paragraph of the PDF to provide answer through interactive bot. How Can I make bot learn about the text contents? Please let help me here.

    • Avatar
      Jason Brownlee October 5, 2018 at 5:32 am #

      Sounds like a very challenging research problem, not something trivial I can answer in a blog comment, sorry.

  26. Avatar
    vengama naidu October 15, 2018 at 5:42 am #

    nice work sir,
    how can I extract specific text like name,date,value from pdf file

  27. Avatar
    Nishad Tupe October 31, 2018 at 1:54 pm #

    Good Explanation! Are you referring to documents here is can be data frame/list containing bag of words (sentences /tweets).
    Also in all your books, where one should start especially if we have Python programming knowledge.
    Can we start with Master Machine Learning Algorithms or Machine Learning with Python or Linear Algebra (least preferred) as we may not get time?

  28. Avatar
    krenovut November 4, 2018 at 3:52 am #

    Hello Jason,
    Can you help me? I have a big dataset, about 4 millions texts. And I can not use clear tf-idf. So, firstly, i use hashing. But, I also need to use tf-idf. How can I concat 2 methods? It’s possible to give sparse matrix in tf-idf instead of corpus?

    • Avatar
      Jason Brownlee November 4, 2018 at 6:29 am #

      Perhaps you can use progressive loading and manually build the tf-idf from your dataset?

  29. Avatar
    ankitha November 14, 2018 at 1:40 am #

    Hi, I am using data set from yahoo and it is in csv format. I have tokenized the data set and stemming, stop words are removed. I have generated vector values using skip gram model. Now how to convert variable cardinality vector to fixed length vector? Please help me in this

  30. Avatar
    Fataleagle January 20, 2019 at 7:15 am #

    Hello Sir,

    I want to extract emotions associated with tweets for that I used Tfid Vectorizer to transform each word into a number but Tweets are going to vary in length so the length of each vector array also will vary so how can I apply that to my model let say suppose to SVM?

    Thank you

    • Avatar
      Jason Brownlee January 21, 2019 at 5:28 am #

      If you use a vectorized representation, you will have a fixed sized vocab, but the length of the tweets can vary no problem.

  31. Avatar
    sky January 20, 2019 at 7:14 pm #

    hello. how to do train and test text data??
    I want a sample code

  32. Avatar
    Donni February 1, 2019 at 6:21 pm #

    Hello Jason,
    thank you for the nice tutorial.

    I’m doing a text classification task to classify 10 different categories. I have a training data set that contains text documents for each category. I have in hand list of predefined features (words/terms) for each category, I want to add these features into the training dataset to be the top features during the training of ML model.

    Do you have any idea how can I do that?

    Thank you,

    • Avatar
      Jason Brownlee February 2, 2019 at 6:12 am #

      Yes, you can remove all other words except the ones you want to include, then model the classification problem.

  33. Avatar
    Ravi February 2, 2019 at 8:07 pm #

    Hi Jason,

    What is Difference between Co-variance and Co-occurrence matrix, How can we obtain Co-occurrence matrix from the text corpus/text data in python

    Thanks

    • Avatar
      Jason Brownlee February 3, 2019 at 6:16 am #

      Good question, I don’t have an example of a co-occurrance matrix, sorry.

  34. Avatar
    Jane April 12, 2019 at 9:41 am #

    After applying the vectorization on train and test sample using TfidfVectorizer, the train and test samples will have different number of features. So when you apply the trained model to make predictions on the validation sample, it will complain the validation sample doesn’t have enough features. What is your advice to solve this issue? thanks!!!!

    • Avatar
      Jason Brownlee April 12, 2019 at 2:41 pm #

      You must use the same vocab to prepare both datasets, e.g. fit on train and transform train and test.

  35. Avatar
    tahere April 13, 2019 at 2:44 pm #

    your tutorials is very helpful

  36. Avatar
    Sehaba95 April 15, 2019 at 6:06 am #

    Thank you so much Jason, your tutorial helped me a lot and I wish that you will keep sharing more and thank you so much again

  37. Avatar
    Mostafa wagih eltazy May 7, 2019 at 9:16 am #

    Thank you so much for the help

    i wanted to ask two questions based on my work, I am currently trying to pre-process java source code with known labels and feed the output to my model for a classification problem,
    can i use this techniques for pre-processing on java source code or is it there something more suitable for my work
    and if it’s possible on which basis should i select a proper vectorizer for my reserch

    again thank you so much for the article

    • Avatar
      Jason Brownlee May 7, 2019 at 2:28 pm #

      I don’t know much about pre-process code for modeling, sorry.

      Perhaps check papers on the topic to see what is common.

  38. Avatar
    Maysoon alkhair May 21, 2019 at 8:56 pm #

    Your explanation is very clear I liked it. I have a question how I can apply this on a text file include an Arabic dataset? Can you help me?

    • Avatar
      Jason Brownlee May 22, 2019 at 8:04 am #

      Good question, I don’t have exampels of working with Arabic text, sorry.

  39. Avatar
    shahzaib September 4, 2019 at 1:16 am #

    thanks alot bro good work

  40. Avatar
    Soheb December 6, 2019 at 4:12 pm #

    I am beginner to Data Science (machine learning). In my recent course i used anaconda-paython and used few algo for machine learning.

    Currently i am Fullstack developer (.Net, Angular, sql, firebase etc.)

    I want to upgrade myself for data science-machine learning.

    Can you please suggest any tutorial for the beginner?

  41. Avatar
    Meenah December 26, 2019 at 9:26 am #

    Really help, thanks.

    My question is how can you insert a large corpus instead of the small text documents you used as an example.

    • Avatar
      Jason Brownlee December 27, 2019 at 6:28 am #

      You can use the same code, and load your large corpus.

      Perhaps I don’t follow your question?

      • Avatar
        uzma June 15, 2020 at 6:45 pm #

        sir how we can load large corpus for countvectorizer?

        • Avatar
          Jason Brownlee June 16, 2020 at 5:36 am #

          The same as a small corpus. Perhaps I don’t understand your problem?

  42. Avatar
    Riad January 12, 2020 at 4:56 pm #

    Thank you very much, can TFIDF be applied to documents that have been applied to Hash trick and why are there negative numbers in feature Vectors

  43. Avatar
    TRAN January 21, 2020 at 3:00 pm #

    I have a question with regard to creating an instance of the CountVectorizer class by using:

    vectorizer = CountVectorizer(tokenizer=word_tokenize)

    Could you please clarify the meaning of “tokenizer=word_tokenize” .
    What is the difference between vectorizer = CountVectorizer and vectorizer = CountVectorizer(tokenizer=word_tokenize)

    I’m so grateful for your advice.

  44. Avatar
    Despina M January 22, 2020 at 7:39 am #

    Hello, another great article.

    I want to ask, is it possible to change the size of vocabulary in TfidfVectorizer example?

    Thank you in advance

    • Avatar
      Jason Brownlee January 22, 2020 at 1:53 pm #

      Thanks.

      Yes, it takes the size of the vocab from the training data.

  45. Avatar
    zahrabk January 30, 2020 at 5:30 am #

    Hi Dr.
    In line with what you said in your website, for each text classification we should use a way to represent our text to a vector in order to get ready to use it as an input for machine learning.
    I know that we should use some ways such as BOW or Word2vec, but my question is how should we represent our features as vector? say I want to do a sentiment analysis and have a feature such as emoticon dictionary, how should I represent it as vector?
    with word2vec??
    could you please help me

    • Avatar
      Jason Brownlee January 30, 2020 at 6:59 am #

      Each emoticon would be a “word” that could appear in text. Represented like other words in your dataset.

  46. Avatar
    Monarch119 February 21, 2020 at 2:58 am #

    Hi, nice article
    I have a doubt:
    What is the exact use of fit() function ?

    • Avatar
      Jason Brownlee February 21, 2020 at 8:26 am #

      Fits the model.

      Specifically, it runs an optimization algorithm to find model parameters that best capture the relationship between inputs and outputs in your dataset.

      • Avatar
        Monarch119 February 21, 2020 at 6:26 pm #

        so in this case the vectorizer.fit(text) is creating a dictionary right ?
        how does that help the transform() function ?

        • Avatar
          Jason Brownlee February 22, 2020 at 6:20 am #

          Yes, it defines the vocab. Anytime transform() is performed it is limited to the vocab seen when fit was called.

  47. Avatar
    zahrabk February 25, 2020 at 11:34 pm #

    Thank you for your practical blog. I’m somewhat confused. please help!
    In line with what you said: ….Then the words need to be encoded as integers or floating-point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
    I know that one popular way is BOW with witch we can have a numeric representation Right?
    Now my question is: imagine I want to do some text classification and one of my features is: the number of emoticon icons! obviously, with some pieces of code, I can determine every sentence which has emoticon icons, the problem is how can I deploy this feature? in other words, by what method can I represent this feature numerically in order to be suitable to feed to ML algorithm?

    • Avatar
      Jason Brownlee February 26, 2020 at 8:21 am #

      Yes, you can have these static features along side the numeric representations of the text. E.g. just more features. The model cannot tell the difference.

  48. Avatar
    Al May 4, 2020 at 11:48 am #

    Hi!

    Great post. Is there a way to use CountVectorizer on a corpus which is a list and not a string?

  49. Avatar
    Dan May 29, 2020 at 10:38 pm #

    Hey Jason, Great article.

    I was wondering if you could demonstrate how to combine NLP with time series analysis (ARIMA model) as you presented in this article? https://machinelearningmastery.com/make-sample-forecasts-arima-python/

    Not many resources online that discuss this and many errors come from using CountVectorizer with ARIMA

  50. Avatar
    Dr. Brownlee's #1 Fan July 7, 2020 at 7:50 am #

    In the CountVectorizer section, the codes includes “print(vectorizer.vocabulary_)”, which returns a dictionary comprised of the tokens and their respective indices in the array. See below.

    {‘dog’: 1, ‘fox’: 2, ‘over’: 5, ‘brown’: 0, ‘quick’: 6, ‘the’: 7, ‘lazy’: 4, ‘jumped’: 3}

    It seems that the tokens are assigned an index based on alphanumeric. Is there any other significance to the order?

  51. Avatar
    Alan Jose Tom February 10, 2021 at 8:28 pm #

    How we can pass two columns for the ‘X’ for preparation of the model?

    • Avatar
      Jason Brownlee February 11, 2021 at 5:54 am #

      Perhaps prepare each variable separately then concat the results prior to modeling?

  52. Avatar
    Shrey Jain August 23, 2021 at 7:53 pm #

    Hi Jason! I am trying to perform tfidf vectorization on my train_x dataframe (4064, 1). But after calling fit_transform(train_x).toarray(), the resulting matrix has size (1,1) which doesn’t make sense. Why is this happening?

    • Adrian Tam
      Adrian Tam August 24, 2021 at 8:32 am #

      May be your input matrix is in the wrong shape? Transposed?

Leave a Reply