How to Develop Word Embeddings in Python with Gensim

Word embeddings are a modern approach for representing text in natural language processing.

Embedding algorithms like word2vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.

In this tutorial, you will discover how to train and load word embedding models for natural language processing applications in Python using Gensim.

After completing this tutorial, you will know:

  • How to train your own word2vec word embedding model on text data.
  • How to visualize a trained word embedding model using Principal Component Analysis.
  • How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.

Let’s get started.

How to Develop Word Embeddings in Python with Gensim

How to Develop Word Embeddings in Python with Gensim
Photo by dilettantiquity, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Word Embeddings
  2. Gensim Library
  3. Develop Word2Vec Embedding
  4. Visualize Word Embedding
  5. Load Google’s Word2Vec Embedding
  6. Load Stanford’s GloVe Embedding

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Word Embeddings

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.

In this tutorial, we are going to look at how to use two different word embedding methods called word2vec by researchers at Google and GloVe by researchers at Stanford.

Gensim Python Library

Gensim is an open source Python library for natural language processing, with a focus on topic modeling.

It is billed as:

topic modelling for humans

Gensim was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies.

It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text.

It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.

We will use the Gensim library in this tutorial.

If you do not have a Python environment setup, you can use this tutorial:

Gensim can be installed easily using pip or easy_install.

For example, you can install Gensim with pip by typing the following on your command line:

If you need help installing Gensim on your system, you can see the Gensim Installation Instructions.

Develop Word2Vec Embedding

Word2vec is one algorithm for learning a word embedding from a text corpus.

There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.

We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.

Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text.

Gensim provides the Word2Vec class for working with a Word2Vec model.

Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example:

Specifically, each sentence must be tokenized, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case).

The sentences could be text loaded into memory, or an iterator that progressively loads text, required for very large text corpora.

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

  • size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
  • window: (default 5) The maximum distance between a target word and words around the target word.
  • min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
  • workers: (default 3) The number of threads to use while training.
  • sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

The defaults are often good enough when just getting started. If you have a lot of cores, as most modern computers do, I strongly encourage you to increase workers to match the number of cores (e.g. 8).

After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.

For example, you can print the learned vocabulary of tokens (words) as follows:

You can review the embedded vector for a specific token as follows:

Finally, a trained model can then be saved to file by calling the save_word2vec_format() function on the word vector model.

By default, the model is saved in a binary format to save space. For example:

When getting started, you can save the learned model in ASCII format and review the contents.

You can do this by setting binary=False when calling the save_word2vec_format() function, for example:

The saved model can then be loaded again by calling the Word2Vec.load() function. For example:

We can tie all of this together with a worked example.

Rather than loading a large text document or corpus from file, we will work with a small, in-memory list of pre-tokenized sentences. The model is trained and the minimum count for words is set to 1 so that no words are ignored.

After the model is learned, we summarize, print the vocabulary, then print a single vector for the word ‘sentence‘.

Finally, the model is saved to a file in binary format, loaded, and then summarized.

Running the example prints the following output.

You can see that with a little work to prepare your text document, you can create your own word embedding very easily with Gensim.

Visualize Word Embedding

After you learn word embedding for your text data, it can be nice to explore it with visualization.

You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.

The visualizations can provide a qualitative diagnostic for your learned model.

We can retrieve all of the vectors from a trained model as follows:

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.

Let’s look at an example with Principal Component Analysis or PCA.

Plot Word Vectors Using PCA

We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows.

The resulting projection can be plotted using matplotlib as follows, pulling out the two dimensions as x and y coordinates.

We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows.

Putting this all together with the model from the previous section, the complete example is listed below.

Running the example creates a scatter plot with the dots annotated with the words.

It is hard to pull much meaning out of the graph given such a tiny corpus was used to fit the model.

Scatter Plot of PCA Projection of Word2Vec Model

Scatter Plot of PCA Projection of Word2Vec Model

Load Google’s Word2Vec Embedding

Training your own word vectors may be the best approach for a given NLP problem.

But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm.

An alternative is to simply use an existing pre-trained word embedding.

Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the Word2Vec Google Code Project.

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors.

It is a 1.53 Gigabytes file. You can download it from here:

Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

The Gensim library provides tools to load this file. Specifically, you can call the KeyedVectors.load_word2vec_format() function to load this model into memory, for example:

On my modern workstation, it takes about 43 seconds to load.

Another interesting thing that you can do is do a little linear algebra arithmetic with words.

For example, a popular example described in lectures and introduction papers is:

That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The “man-ness” in king is replaced with “woman-ness” to give us queen. A very cool concept.

Gensim provides an interface for performing these types of operations in the most_similar() function on the trained or loaded model.

For example:

We can put all of this together as follows.

Running the example loads the Google pre-trained word2vec model and then calculates the (king – man) + woman = ? operation on the word vectors for those words.

The answer, as we would expect, is queen.

See some of the posts in the further reading section for more interesting arithmetic examples that you can explore.

Load Stanford’s GloVe Embedding

Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.

I won’t get into the details of the differences between word2vec and GloVe here, but generally, NLP practitioners seem to prefer GloVe at the moment based on results.

Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

You can download the GloVe pre-trained word vectors and load them easily with gensim.

The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:

Once converted, the file can be loaded just like word2vec file above.

Let’s make this concrete with an example.

You can download the smallest GloVe pre-trained model from the GloVe website. It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.

The direct download link is here:

Working with the 100-dimensional version of the model, we can convert the file to word2vec format as follows:

You now have a copy of the GloVe model in word2vec format with the filename glove.6B.100d.txt.word2vec.

Now we can load it and perform the same (king – man) + woman = ? test as in the previous section. The complete code listing is provided below. Note that the converted file is ASCII format, not binary, so we set binary=False when loading.

Running the example prints the same result of ‘queen’.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Gensim

Posts

Summary

In this tutorial, you discovered how to develop and load word embedding layers in Python using Gensim.

Specifically, you learned:

  • How to train your own word2vec word embedding model on text data.
  • How to visualize a trained word embedding model using Principal Component Analysis.
  • How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


65 Responses to How to Develop Word Embeddings in Python with Gensim

  1. hana rashied October 6, 2017 at 4:21 pm #

    thank you for your wonderful tutorial, please sir can I download the pdf of these tutorials

    regards

    • HGeo October 6, 2017 at 8:37 pm #

      Try some web2pdf plug-ins.

    • Jason Brownlee October 7, 2017 at 5:48 am #

      I will release a book on this topic soon.

  2. Anirban October 6, 2017 at 7:26 pm #

    Ran the example, however not sure what the 25 by 4 Vector represents or how the plot should be read.

    • Jason Brownlee October 7, 2017 at 5:53 am #

      The vectors are the learned representation for the words.

      The plot may provide insight to the “natural” grouping of the words, perhaps it would be more clear with a larger dataset.

  3. Alexander October 7, 2017 at 12:11 am #

    Thank you, Jason. Very clear, interest and igniting.

  4. Rose October 7, 2017 at 5:30 pm #

    Thanks for great tutorial
    I’ve question. ‘GoogleNews-vectors-negative300.bin.gz’ implemented by which algorithm skip-gram or CBOW?

  5. Klaas October 8, 2017 at 4:33 am #

    Very structured and even for a beginner to follow. Thanks a lot. I highly appreciate your work!

  6. Quang October 10, 2017 at 2:01 pm #

    Hi Jason,

    Thanks for your detail explanation !

    For the “Scatter Plot of PCA Projection of Word2Vec Model” image, I had different result comparing with yours. I am not sure if there is any wrong with my code. Actually, I copied code from yours.

    This is mine
    http://prntscr.com/gvg5jz

    Could you please take a look and give your comment ?

    Thanks a lot !

  7. Natallia Lundqvist October 16, 2017 at 12:36 am #

    Great tutorial, thank you a lot! A question, is there similar to GloVe embeddings but based on other languages then English?

    • Jason Brownlee October 16, 2017 at 5:43 am #

      There may be, I am not aware of them, sorry.

      You can train your own vector in minutes – it is a very fast algorithm.

  8. Athif Shaffy October 17, 2017 at 5:02 am #

    Thanks for the simple explanation 🙂

  9. Rob Hamilton-smith October 23, 2017 at 12:29 am #

    hi,

    I am trying to combine my own corpus with the above Glove embeddings. I haven’t really found a solution/example where I can leverage the GloVe 6b for the known embeddings and then ‘extend’ or train on my own Out of Vocab tokens (These tend to be non language words or machine generated).

    Any help is appreciated.

    Thanks

    • Jason Brownlee October 23, 2017 at 5:48 am #

      Hi Rob, one way would be to load the embeddings into an Embedding layer and tune them with your data while fitting the network.

      Perhaps Gensim lets you update existing vectors, but I have not seen or tried to do that, sorry.

  10. Janne October 27, 2017 at 2:39 am #

    Thanks for the great tutorial (as usual). Are you going to make a post about word embeddings for complete documents using doc2vec and/or Fasttext?
    I am particularly interested in using pre-trained word embeddings to represent documents (~500 words). This allows leveraging from huge corpuses. However, it’s well known that simple averaging is only as good (or even worse) than classic BOW in classification tasks. Apparently you could do better by first PCA-transforming word2vec dictionary (paper “In Defense of Word Embedding for Generic Text Representation”), but so far I haven’t seen anyone else using that trick…

    • Jason Brownlee October 27, 2017 at 5:25 am #

      I hope to cover Fasttext in the future.

      When model skill is the goal, I recommend testing every trick you can find! Let me know how you go.

  11. Guy November 15, 2017 at 6:13 pm #

    This tutorial helped me a lot !!
    Thank you very much!

  12. Deepu November 24, 2017 at 12:39 am #

    Another excellent article. I have been struggling for a long time to get a knack of word2vec & this helped me a lot.

  13. Hritwik Dutta December 1, 2017 at 6:33 am #

    Dear Sir,

    1) I have an 8GB RAM and i5 processor system. How long should it take for the google news corpus to be trained using the model?

    2) In the demo example you described, your dataset used was in the form you inputted on your own. If I want to train the model using any corpus, how do I process the corpus? Like the Brown corpus?
    Or if I have any arbitrary corpus, how do I process it so as to feed the corpus to the word2vec model in a suitable form for processing?

    • Jason Brownlee December 1, 2017 at 7:46 am #

      I don’t know how long it takes to train on the google corpus sorry, perhaps ask the google researchers?

      You should be able to load the learned google word vector in a minute, given sufficient resources.

      To learn your own corpus, tokenize the text the way you need for your application and use the Gensim code above.

  14. dilip December 9, 2017 at 7:00 am #

    hats off to Jason for clear and pricise explanation of a very complicated topic

  15. Denzil Sequeira December 30, 2017 at 5:25 am #

    Hey Jason,

    How do I incorporate ngrams into my vocabulary. Does gensim provide a function to do so ?

    • Jason Brownlee December 30, 2017 at 5:31 am #

      You can extract them from your dataset manually with a line or two of Python.

      There may be tools in Gensim for this, I’m not aware of them though.

  16. Gagan Vij January 11, 2018 at 9:00 pm #

    Thank you so much for such detailed explaination

  17. Kishore January 13, 2018 at 10:04 pm #

    Hi Jason, Could you give some idea about language modelling for Question , answer based model, Thanks in advance..

    • Jason Brownlee January 14, 2018 at 6:36 am #

      Thanks for the suggestion, I hope to cover it in the future.

  18. Gagan Vij January 17, 2018 at 4:16 am #

    I have small question , te glove based word to vector don’t provide model.score functionaloty

  19. Jagadeesh January 28, 2018 at 8:33 pm #

    how to convert a sentence into vector by using word2vec (google pre-trained).

  20. Yazid Bounab January 29, 2018 at 10:31 am #

    Hello my name is Yazid a just wana know how to update googleNews model with new words
    Best Regard

    • Jason Brownlee January 30, 2018 at 9:45 am #

      You could train vectors with your new works and take the union of the vectors for the words that interest you.

  21. Vladimir January 31, 2018 at 5:33 pm #

    Thank you, such a great explanation.

    Jason, a quick question please:
    Suppose we are pretraining embeddings ourselves (without glove/google).
    Which of the following would be better?
    1) pretrain using gensim, and feed into the keras embedding layer, trainable=False.
    2) train as part of the neural network in embedding layer?
    Kindly advise.

    • Jason Brownlee February 1, 2018 at 7:16 am #

      These would be pre-trained word embeddings.

      • Vladimir February 4, 2018 at 10:27 am #

        Sorry to tell, but gensim embedding pretraining proved to be actually worse. (In case of a dataset of 1.5 M small texts.)
        So, those randomly initialized weights in case of Embedding layer as part of a NN show much better results.
        My justification would be:
        – It makes sense to use pretrained word embeddings only if using GloVe/Google or such.
        – It makes sense to use pre-trained word embeddings only on dataset with relatively big texts. (i.e. not on tweets or small messages)

        What would be your opinion please?
        Cheers!

  22. Tsuki February 6, 2018 at 3:32 am #

    Very nice tutorial. Could you please let me know if there is a way of getting the sentences that were used in training a model? I have a doc2vec model M and I tried to fetch the list of sentences with M.documents, like one would use M.vector_size to get the size of the vectors.

    Also, having a doc2vec model and wanting to infer new vectors, is there a way to use tagged sentences? I see on gensim page it says:

    infer_vector(doc_words, alpha=0.1, min_alpha=0.0001, steps=5)¶
    Infer a vector for given post-bulk training document.

    Document should be a list of (word) tokens.

    But I would like to use tagged sentences. Thank you very much!

    • Jason Brownlee February 6, 2018 at 9:20 am #

      I’m not sure I follow, sorry.

      You will have the training data for the model that you can access directly?

      • Tsuki February 6, 2018 at 7:28 pm #

        I downloaded a doc2vec model trained on a wikipedia dump. I was wondering if the model stores the sentences too and if yes, how can I access them. Thank you very much.

  23. Naveen Kumar March 1, 2018 at 4:35 am #

    How to generate word embeddings for a complete review sentence? I mean in word2vec we will get 300dim vector for each word.

    Other than computing the average of vectors of all words in a sentence, any good technique to achieve good representation of vector for sentence.

    • Jason Brownlee March 1, 2018 at 6:16 am #

      I believe there are methods for this. Sorry, I don’t have examples at this stage.

  24. Sourav Maharana March 2, 2018 at 5:15 am #

    Hi Jason,

    Since past few days i had been facing issue regarding word embedding using word2vec and glove.
    I went through alot of post, but its messy.
    Yesterday i came across this post, and wow! it helped me alot!
    I also came to know that you have a book “Deep Learning for Natural Language Processing”.Looks great!
    I am interested to purchase it!
    But i wanted to connect to you directly to understand whether this book fits my requirement.
    Tasks that I work on like Word2Vec, Doc2Vec, mapping one document to another using embedding for recommendation,etc
    I just need fundamental understanding so that i can take the next step and create by my own idea/analysis.

    • Jason Brownlee March 2, 2018 at 5:39 am #

      The book covers word2vec models and how to use them in deep learning. It does not cover doc2vec.

  25. Ayush March 29, 2018 at 3:58 am #

    I always refer to your site when ever i start new topic.
    Thank You

  26. Reza M March 29, 2018 at 6:14 pm #

    Can I build that model by Indonesian language?

  27. M R Abhishek March 30, 2018 at 10:27 pm #

    Nice post. Very easy to understand and very informative.

  28. Angelus April 8, 2018 at 7:20 am #

    I have questions regards similarity. If I apply w2vec to convert words to vectors. How I can find similarities between words.

    I know I can use most. Similar() function to find similarity for specific word but How I can achieve that among all words in documents.

    • Jason Brownlee April 9, 2018 at 5:59 am #

      Perhaps iterate over all words and calculate similarity manually to all other words in the vocab (e.g. write two for-loops).

      Why wold you want this?

  29. Angelus April 8, 2018 at 8:13 am #

    One more questions.

    Can I apply w2vec to convert unigrams and bigrams to vectors ?

    • Jason Brownlee April 9, 2018 at 5:59 am #

      It might be easier to learn a bigram model and a unigram model separately, and if still needed, learn a mapping between them (e.g. a model).

Leave a Reply