How to Develop Word Embeddings in Python with Gensim

Word embeddings are a modern approach for representing text in natural language processing.

Embedding algorithms like word2vec and GloVe are key to the state-of-the-art results achieved by neural network models on natural language processing problems like machine translation.

In this tutorial, you will discover how to train and load word embedding models for natural language processing applications in Python using Gensim.

After completing this tutorial, you will know:

  • How to train your own word2vec word embedding model on text data.
  • How to visualize a trained word embedding model using Principal Component Analysis.
  • How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.

Let’s get started.

How to Develop Word Embeddings in Python with Gensim

How to Develop Word Embeddings in Python with Gensim
Photo by dilettantiquity, some rights reserved.

Tutorial Overview

This tutorial is divided into 6 parts; they are:

  1. Word Embeddings
  2. Gensim Library
  3. Develop Word2Vec Embedding
  4. Visualize Word Embedding
  5. Load Google’s Word2Vec Embedding
  6. Load Stanford’s GloVe Embedding

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Word Embeddings

A word embedding is an approach to provide a dense vector representation of words that capture something about their meaning.

Word embeddings are an improvement over simpler bag-of-word model word encoding schemes like word counts and frequencies that result in large and sparse vectors (mostly 0 values) that describe documents but not the meaning of the words.

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

It is defining a word by the company that it keeps that allows the word embedding to learn something about the meaning of words. The vector space representation of the words provides a projection where words with similar meanings are locally clustered within the space.

The use of word embeddings over other text representations is one of the key methods that has led to breakthrough performance with deep neural networks on problems like machine translation.

In this tutorial, we are going to look at how to use two different word embedding methods called word2vec by researchers at Google and GloVe by researchers at Stanford.

Gensim Python Library

Gensim is an open source Python library for natural language processing, with a focus on topic modeling.

It is billed as:

topic modelling for humans

Gensim was developed and is maintained by the Czech natural language processing researcher Radim Řehůřek and his company RaRe Technologies.

It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling. Most notably for this tutorial, it supports an implementation of the Word2Vec word embedding for learning new word vectors from text.

It also provides tools for loading pre-trained word embeddings in a few formats and for making use and querying a loaded embedding.

We will use the Gensim library in this tutorial.

If you do not have a Python environment setup, you can use this tutorial:

Gensim can be installed easily using pip or easy_install.

For example, you can install Gensim with pip by typing the following on your command line:

If you need help installing Gensim on your system, you can see the Gensim Installation Instructions.

Develop Word2Vec Embedding

Word2vec is one algorithm for learning a word embedding from a text corpus.

There are two main training algorithms that can be used to learn the embedding from text; they are continuous bag of words (CBOW) and skip grams.

We will not get into the algorithms other than to say that they generally look at a window of words for each target word to provide context and in turn meaning for words. The approach was developed by Tomas Mikolov, formerly at Google and currently at Facebook.

Word2Vec models require a lot of text, e.g. the entire Wikipedia corpus. Nevertheless, we will demonstrate the principles using a small in-memory example of text.

Gensim provides the Word2Vec class for working with a Word2Vec model.

Learning a word embedding from text involves loading and organizing the text into sentences and providing them to the constructor of a new Word2Vec() instance. For example:

Specifically, each sentence must be tokenized, meaning divided into words and prepared (e.g. perhaps pre-filtered and perhaps converted to a preferred case).

The sentences could be text loaded into memory, or an iterator that progressively loads text, required for very large text corpora.

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

  • size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
  • window: (default 5) The maximum distance between a target word and words around the target word.
  • min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
  • workers: (default 3) The number of threads to use while training.
  • sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

The defaults are often good enough when just getting started. If you have a lot of cores, as most modern computers do, I strongly encourage you to increase workers to match the number of cores (e.g. 8).

After the model is trained, it is accessible via the “wv” attribute. This is the actual word vector model in which queries can be made.

For example, you can print the learned vocabulary of tokens (words) as follows:

You can review the embedded vector for a specific token as follows:

Finally, a trained model can then be saved to file by calling the save_word2vec_format() function on the word vector model.

By default, the model is saved in a binary format to save space. For example:

When getting started, you can save the learned model in ASCII format and review the contents.

You can do this by setting binary=False when calling the save_word2vec_format() function, for example:

The saved model can then be loaded again by calling the Word2Vec.load() function. For example:

We can tie all of this together with a worked example.

Rather than loading a large text document or corpus from file, we will work with a small, in-memory list of pre-tokenized sentences. The model is trained and the minimum count for words is set to 1 so that no words are ignored.

After the model is learned, we summarize, print the vocabulary, then print a single vector for the word ‘sentence‘.

Finally, the model is saved to a file in binary format, loaded, and then summarized.

Running the example prints the following output.

You can see that with a little work to prepare your text document, you can create your own word embedding very easily with Gensim.

Visualize Word Embedding

After you learn word embedding for your text data, it can be nice to explore it with visualization.

You can use classical projection methods to reduce the high-dimensional word vectors to two-dimensional plots and plot them on a graph.

The visualizations can provide a qualitative diagnostic for your learned model.

We can retrieve all of the vectors from a trained model as follows:

We can then train a projection method on the vectors, such as those methods offered in scikit-learn, then use matplotlib to plot the projection as a scatter plot.

Let’s look at an example with Principal Component Analysis or PCA.

Plot Word Vectors Using PCA

We can create a 2-dimensional PCA model of the word vectors using the scikit-learn PCA class as follows.

The resulting projection can be plotted using matplotlib as follows, pulling out the two dimensions as x and y coordinates.

We can go one step further and annotate the points on the graph with the words themselves. A crude version without any nice offsets looks as follows.

Putting this all together with the model from the previous section, the complete example is listed below.

Running the example creates a scatter plot with the dots annotated with the words.

It is hard to pull much meaning out of the graph given such a tiny corpus was used to fit the model.

Scatter Plot of PCA Projection of Word2Vec Model

Scatter Plot of PCA Projection of Word2Vec Model

Load Google’s Word2Vec Embedding

Training your own word vectors may be the best approach for a given NLP problem.

But it can take a long time, a fast computer with a lot of RAM and disk space, and perhaps some expertise in finessing the input data and training algorithm.

An alternative is to simply use an existing pre-trained word embedding.

Along with the paper and code for word2vec, Google also published a pre-trained word2vec model on the Word2Vec Google Code Project.

A pre-trained model is nothing more than a file containing tokens and their associated word vectors. The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors.

It is a 1.53 Gigabytes file. You can download it from here:

Unzipped, the binary file (GoogleNews-vectors-negative300.bin) is 3.4 Gigabytes.

The Gensim library provides tools to load this file. Specifically, you can call the KeyedVectors.load_word2vec_format() function to load this model into memory, for example:

On my modern workstation, it takes about 43 seconds to load.

Another interesting thing that you can do is do a little linear algebra arithmetic with words.

For example, a popular example described in lectures and introduction papers is:

That is the word queen is the closest word given the subtraction of the notion of man from king and adding the word woman. The “man-ness” in king is replaced with “woman-ness” to give us queen. A very cool concept.

Gensim provides an interface for performing these types of operations in the most_similar() function on the trained or loaded model.

For example:

We can put all of this together as follows.

Running the example loads the Google pre-trained word2vec model and then calculates the (king – man) + woman = ? operation on the word vectors for those words.

The answer, as we would expect, is queen.

See some of the posts in the further reading section for more interesting arithmetic examples that you can explore.

Load Stanford’s GloVe Embedding

Stanford researchers also have their own word embedding algorithm like word2vec called Global Vectors for Word Representation, or GloVe for short.

I won’t get into the details of the differences between word2vec and GloVe here, but generally, NLP practitioners seem to prefer GloVe at the moment based on results.

Like word2vec, the GloVe researchers also provide pre-trained word vectors, in this case, a great selection to choose from.

You can download the GloVe pre-trained word vectors and load them easily with gensim.

The first step is to convert the GloVe file format to the word2vec file format. The only difference is the addition of a small header line. This can be done by calling the glove2word2vec() function. For example:

Once converted, the file can be loaded just like word2vec file above.

Let’s make this concrete with an example.

You can download the smallest GloVe pre-trained model from the GloVe website. It an 822 Megabyte zip file with 4 different models (50, 100, 200 and 300-dimensional vectors) trained on Wikipedia data with 6 billion tokens and a 400,000 word vocabulary.

The direct download link is here:

Working with the 100-dimensional version of the model, we can convert the file to word2vec format as follows:

You now have a copy of the GloVe model in word2vec format with the filename glove.6B.100d.txt.word2vec.

Now we can load it and perform the same (king – man) + woman = ? test as in the previous section. The complete code listing is provided below. Note that the converted file is ASCII format, not binary, so we set binary=False when loading.

Running the example prints the same result of ‘queen’.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Gensim

Posts

Summary

In this tutorial, you discovered how to develop and load word embedding layers in Python using Gensim.

Specifically, you learned:

  • How to train your own word2vec word embedding model on text data.
  • How to visualize a trained word embedding model using Principal Component Analysis.
  • How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.


Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.


97 Responses to How to Develop Word Embeddings in Python with Gensim

  1. hana rashied October 6, 2017 at 4:21 pm #

    thank you for your wonderful tutorial, please sir can I download the pdf of these tutorials

    regards

    • HGeo October 6, 2017 at 8:37 pm #

      Try some web2pdf plug-ins.

    • Jason Brownlee October 7, 2017 at 5:48 am #

      I will release a book on this topic soon.

  2. Anirban October 6, 2017 at 7:26 pm #

    Ran the example, however not sure what the 25 by 4 Vector represents or how the plot should be read.

    • Jason Brownlee October 7, 2017 at 5:53 am #

      The vectors are the learned representation for the words.

      The plot may provide insight to the “natural” grouping of the words, perhaps it would be more clear with a larger dataset.

  3. Alexander October 7, 2017 at 12:11 am #

    Thank you, Jason. Very clear, interest and igniting.

  4. Rose October 7, 2017 at 5:30 pm #

    Thanks for great tutorial
    I’ve question. ‘GoogleNews-vectors-negative300.bin.gz’ implemented by which algorithm skip-gram or CBOW?

  5. Klaas October 8, 2017 at 4:33 am #

    Very structured and even for a beginner to follow. Thanks a lot. I highly appreciate your work!

  6. Quang October 10, 2017 at 2:01 pm #

    Hi Jason,

    Thanks for your detail explanation !

    For the “Scatter Plot of PCA Projection of Word2Vec Model” image, I had different result comparing with yours. I am not sure if there is any wrong with my code. Actually, I copied code from yours.

    This is mine
    http://prntscr.com/gvg5jz

    Could you please take a look and give your comment ?

    Thanks a lot !

  7. Natallia Lundqvist October 16, 2017 at 12:36 am #

    Great tutorial, thank you a lot! A question, is there similar to GloVe embeddings but based on other languages then English?

    • Jason Brownlee October 16, 2017 at 5:43 am #

      There may be, I am not aware of them, sorry.

      You can train your own vector in minutes – it is a very fast algorithm.

  8. Athif Shaffy October 17, 2017 at 5:02 am #

    Thanks for the simple explanation 🙂

  9. Rob Hamilton-smith October 23, 2017 at 12:29 am #

    hi,

    I am trying to combine my own corpus with the above Glove embeddings. I haven’t really found a solution/example where I can leverage the GloVe 6b for the known embeddings and then ‘extend’ or train on my own Out of Vocab tokens (These tend to be non language words or machine generated).

    Any help is appreciated.

    Thanks

    • Jason Brownlee October 23, 2017 at 5:48 am #

      Hi Rob, one way would be to load the embeddings into an Embedding layer and tune them with your data while fitting the network.

      Perhaps Gensim lets you update existing vectors, but I have not seen or tried to do that, sorry.

  10. Janne October 27, 2017 at 2:39 am #

    Thanks for the great tutorial (as usual). Are you going to make a post about word embeddings for complete documents using doc2vec and/or Fasttext?
    I am particularly interested in using pre-trained word embeddings to represent documents (~500 words). This allows leveraging from huge corpuses. However, it’s well known that simple averaging is only as good (or even worse) than classic BOW in classification tasks. Apparently you could do better by first PCA-transforming word2vec dictionary (paper “In Defense of Word Embedding for Generic Text Representation”), but so far I haven’t seen anyone else using that trick…

    • Jason Brownlee October 27, 2017 at 5:25 am #

      I hope to cover Fasttext in the future.

      When model skill is the goal, I recommend testing every trick you can find! Let me know how you go.

  11. Guy November 15, 2017 at 6:13 pm #

    This tutorial helped me a lot !!
    Thank you very much!

  12. Deepu November 24, 2017 at 12:39 am #

    Another excellent article. I have been struggling for a long time to get a knack of word2vec & this helped me a lot.

  13. Hritwik Dutta December 1, 2017 at 6:33 am #

    Dear Sir,

    1) I have an 8GB RAM and i5 processor system. How long should it take for the google news corpus to be trained using the model?

    2) In the demo example you described, your dataset used was in the form you inputted on your own. If I want to train the model using any corpus, how do I process the corpus? Like the Brown corpus?
    Or if I have any arbitrary corpus, how do I process it so as to feed the corpus to the word2vec model in a suitable form for processing?

    • Jason Brownlee December 1, 2017 at 7:46 am #

      I don’t know how long it takes to train on the google corpus sorry, perhaps ask the google researchers?

      You should be able to load the learned google word vector in a minute, given sufficient resources.

      To learn your own corpus, tokenize the text the way you need for your application and use the Gensim code above.

  14. dilip December 9, 2017 at 7:00 am #

    hats off to Jason for clear and pricise explanation of a very complicated topic

  15. Denzil Sequeira December 30, 2017 at 5:25 am #

    Hey Jason,

    How do I incorporate ngrams into my vocabulary. Does gensim provide a function to do so ?

    • Jason Brownlee December 30, 2017 at 5:31 am #

      You can extract them from your dataset manually with a line or two of Python.

      There may be tools in Gensim for this, I’m not aware of them though.

  16. Gagan Vij January 11, 2018 at 9:00 pm #

    Thank you so much for such detailed explaination

  17. Kishore January 13, 2018 at 10:04 pm #

    Hi Jason, Could you give some idea about language modelling for Question , answer based model, Thanks in advance..

    • Jason Brownlee January 14, 2018 at 6:36 am #

      Thanks for the suggestion, I hope to cover it in the future.

  18. Gagan Vij January 17, 2018 at 4:16 am #

    I have small question , te glove based word to vector don’t provide model.score functionaloty

  19. Jagadeesh January 28, 2018 at 8:33 pm #

    how to convert a sentence into vector by using word2vec (google pre-trained).

  20. Yazid Bounab January 29, 2018 at 10:31 am #

    Hello my name is Yazid a just wana know how to update googleNews model with new words
    Best Regard

    • Jason Brownlee January 30, 2018 at 9:45 am #

      You could train vectors with your new works and take the union of the vectors for the words that interest you.

  21. Vladimir January 31, 2018 at 5:33 pm #

    Thank you, such a great explanation.

    Jason, a quick question please:
    Suppose we are pretraining embeddings ourselves (without glove/google).
    Which of the following would be better?
    1) pretrain using gensim, and feed into the keras embedding layer, trainable=False.
    2) train as part of the neural network in embedding layer?
    Kindly advise.

    • Jason Brownlee February 1, 2018 at 7:16 am #

      These would be pre-trained word embeddings.

      • Vladimir February 4, 2018 at 10:27 am #

        Sorry to tell, but gensim embedding pretraining proved to be actually worse. (In case of a dataset of 1.5 M small texts.)
        So, those randomly initialized weights in case of Embedding layer as part of a NN show much better results.
        My justification would be:
        – It makes sense to use pretrained word embeddings only if using GloVe/Google or such.
        – It makes sense to use pre-trained word embeddings only on dataset with relatively big texts. (i.e. not on tweets or small messages)

        What would be your opinion please?
        Cheers!

  22. Tsuki February 6, 2018 at 3:32 am #

    Very nice tutorial. Could you please let me know if there is a way of getting the sentences that were used in training a model? I have a doc2vec model M and I tried to fetch the list of sentences with M.documents, like one would use M.vector_size to get the size of the vectors.

    Also, having a doc2vec model and wanting to infer new vectors, is there a way to use tagged sentences? I see on gensim page it says:

    infer_vector(doc_words, alpha=0.1, min_alpha=0.0001, steps=5)¶
    Infer a vector for given post-bulk training document.

    Document should be a list of (word) tokens.

    But I would like to use tagged sentences. Thank you very much!

    • Jason Brownlee February 6, 2018 at 9:20 am #

      I’m not sure I follow, sorry.

      You will have the training data for the model that you can access directly?

      • Tsuki February 6, 2018 at 7:28 pm #

        I downloaded a doc2vec model trained on a wikipedia dump. I was wondering if the model stores the sentences too and if yes, how can I access them. Thank you very much.

  23. Naveen Kumar March 1, 2018 at 4:35 am #

    How to generate word embeddings for a complete review sentence? I mean in word2vec we will get 300dim vector for each word.

    Other than computing the average of vectors of all words in a sentence, any good technique to achieve good representation of vector for sentence.

    • Jason Brownlee March 1, 2018 at 6:16 am #

      I believe there are methods for this. Sorry, I don’t have examples at this stage.

  24. Sourav Maharana March 2, 2018 at 5:15 am #

    Hi Jason,

    Since past few days i had been facing issue regarding word embedding using word2vec and glove.
    I went through alot of post, but its messy.
    Yesterday i came across this post, and wow! it helped me alot!
    I also came to know that you have a book “Deep Learning for Natural Language Processing”.Looks great!
    I am interested to purchase it!
    But i wanted to connect to you directly to understand whether this book fits my requirement.
    Tasks that I work on like Word2Vec, Doc2Vec, mapping one document to another using embedding for recommendation,etc
    I just need fundamental understanding so that i can take the next step and create by my own idea/analysis.

    • Jason Brownlee March 2, 2018 at 5:39 am #

      The book covers word2vec models and how to use them in deep learning. It does not cover doc2vec.

  25. Ayush March 29, 2018 at 3:58 am #

    I always refer to your site when ever i start new topic.
    Thank You

  26. Reza M March 29, 2018 at 6:14 pm #

    Can I build that model by Indonesian language?

  27. M R Abhishek March 30, 2018 at 10:27 pm #

    Nice post. Very easy to understand and very informative.

  28. Angelus April 8, 2018 at 7:20 am #

    I have questions regards similarity. If I apply w2vec to convert words to vectors. How I can find similarities between words.

    I know I can use most. Similar() function to find similarity for specific word but How I can achieve that among all words in documents.

    • Jason Brownlee April 9, 2018 at 5:59 am #

      Perhaps iterate over all words and calculate similarity manually to all other words in the vocab (e.g. write two for-loops).

      Why wold you want this?

  29. Angelus April 8, 2018 at 8:13 am #

    One more questions.

    Can I apply w2vec to convert unigrams and bigrams to vectors ?

    • Jason Brownlee April 9, 2018 at 5:59 am #

      It might be easier to learn a bigram model and a unigram model separately, and if still needed, learn a mapping between them (e.g. a model).

  30. jawad May 20, 2018 at 5:02 pm #

    Hello, Jason i have a question about the selection of dimensions of word vectors.let say 300-dimensions they will be same for all words? On which criteria we select these dimensions?

  31. Aiza May 24, 2018 at 2:44 am #

    Hi,
    it might be silly question but the sentences you pass to Word2Vec to train on ,they consist of all the sentences (i.e. train,validate as well as test) right?
    Also when you create an embedding matrix/dictionary for your sentences,it should also contain the words of test sentences?
    Thirdly when evaluating the model,test sentences would also be converted to sequences or integers?
    Thanks

    • Jason Brownlee May 24, 2018 at 8:19 am #

      Yes, sentences are split, then encoded for use in modeling.

      The training dataset should be representative of the broader domain. This applies generally, not just in NLP.

  32. Emna June 6, 2018 at 6:25 pm #

    Thank you for this nice tutorial. is there a way to revert the word embedding transformation ? I am feeding the embedded matrix ‘ X = model[model.wv.vocab]’ to an autoencoder model. I will get also as result a matrix. I want to interpret that matrix by applying the inverse word2vec transformation so I can compare the input to the output results. Any ideas ?

    • Jason Brownlee June 7, 2018 at 6:25 am #

      Sure, you could search for the closest one or set of vectors for a given vector.

  33. Maryam June 15, 2018 at 1:29 am #

    Hi Jason,
    the tutorial was awesome as the others, but tell you the truth when I applied your commands :
    model = Word2Vec(sentences, min_count=1)
    words = list(model.wv.vocab)
    print(words)
    it works fine when I use your dataset, but when I apply my own dataset which structure is such as this: a folder which name is diseases, in this folder I have 2 sub-folder which are blood cancer and breast cancer. in each subfolder such as blood cancer, there are many txt files which include at least 20 sentences, unfortunately when I apply your commands model = Word2Vec(vocab_dic, size=100, window=5, workers=8, min_count=1) and words = list(model.wv.vocab)
    print (‘words’ , words)

    the printed words are as follows: words [‘b’, ‘t’, ‘f’, ‘r’, ‘u’, ‘l’, ‘y’, ‘w’, ‘m’, ‘d’, ‘p’, ‘g’, ‘j’, ‘v’, ‘k’, ‘s’, ‘e’, ‘q’, ‘c’, ‘x’, ‘h’, ‘i’, ‘n’, ‘z’, ‘a’, ‘o’]
    whereas I expect to have words instead of charachters.

    I know you do not have the capacity to write custom code for me or check everyone’s commands but please look at my few commands to find out the problem as I am a beginner ans I have already tried to find it but I could not.

    I use keras in ubuntu 17.10 to write commnads

    breast= glob.glob(‘/home/mary/.config/spyder-py3/BinaryClassClassification/breastcancer/*.txt’)
    blood=glob.glob(‘/home/mary/.config/spyder-py3/BinaryClassClassification/bloodcancer/*.txt’)

    breast_samples_text = [load_file(file) for file in breast]
    bloodـsamples_text= [load_file(file) for file in blood]

    vocab_dic = breast_samples_text + blood_samples_text
    model = Word2Vec(vocab_dic, size=100, window=5, workers=8, min_count=1)
    words = list(model.wv.vocab)
    printed words are as these: words [‘b’, ‘t’, ‘f’, ‘r’, ‘u’, ‘l’, ‘y’, ‘w’, ‘m’, ‘d’, ‘p’, ‘g’, ‘j’, ‘v’, ‘k’, ‘s’, ‘e’, ‘q’, ‘c’, ‘x’, ‘h’, ‘i’, ‘n’, ‘z’, ‘a’, ‘o’]

    waiting for your answer as I need it necessarily.
    Best Regards
    Maryam

    • Jason Brownlee June 15, 2018 at 6:45 am #

      I believe your data is in a different format. Perhaps focus on data loading and confirm the data in memory has the same structure as the data in the tutorial first?

    • Nish July 10, 2018 at 2:05 pm #

      import re
      from nltk.tokenize import TweetTokenizer, sent_tokenize
      tokenizer_words = TweetTokenizer()
      tokens_sentences = [tokenizer_words.tokenize(t) for t in
      nltk.sent_tokenize(text)]

      Use this piece of code and you’ll get the output you wanted. I had the same issue as you

  34. Maryam June 16, 2018 at 2:54 am #

    Hi Jason,
    I have modified it with this command and gave me a correct output::[

    def load_file(file_name):
    cleaned_txt = re.sub(“[^a-zA-Z]+”, ” “, open(file_name, ‘r’, encoding=”utf8”).read()).lower()
    return cleaned_txt

    def gen_vocab_dic(all_text):
    voc = set()
    for record in all_text:
    for word in record.split():
    voc.add(word)
    voc_dic = {}
    index = 1 # we start from 1
    for i in voc:
    voc_dic[i] = index
    index = index + 1
    return voc_dic,voc

    breast= glob.glob(‘/home/mary/.config/spyder-py3/Dataset#2_BinaryClassClassification breasrcancer_disease/*.txt’)
    bloodcancer=glob.glob(‘/home/mary/.config/spyder-py3/Dataset#2_BinaryClassClassification/bloodcancer_disease/*.txt’)

    breast_samples_text = [load_file(file) for file in breast]
    bloodـsamples_text= [load_file(file) for file in blood]

    vocab_dic = gen_vocab_dic(breast_samples_text + bloodـsamples_text)

    model = Word2Vec(vocab_dic, size=100, window=5, workers=8, min_count=1)

    words_list = list(model.wv.vocab)
    print (‘words_list’ , words_list)== ‘subcategory’, ‘disproportionally’, ‘alen’ and etc.

    but I do not know how i can refer the label of each class to each word? in other words How to tag any sample which belongs to each class?
    I have already provided x_train or x_ test but I do not know how I should provide y_dataset ??
    Jason, please help me with this kind of dataset as I have never seen a tutorial to teach word2vec for this kind of structure for a dataset. the structure of my dataset is as follows: a folder which name is diseases, in this folder I have 2 sub-folder which are blood cancer and breast cancer. in each subfolder such as blood cancer, there are many txt files which include at least 20 sentences, as I mentioned in the latter post.
    waiting for the reply as I need it and sorry to ask you the request.
    Best
    Maryam

    • Jason Brownlee June 16, 2018 at 7:30 am #

      Sorry, I don’t know about your prediction problem. Perhaps you could summarize it for me?

  35. Gonzalo Moreno June 16, 2018 at 4:05 am #

    How do you decide how many negative words to use? which is the criteria?

  36. sarah June 16, 2018 at 11:26 pm #

    Hi Jason,
    thank you for offering us one of the best tutorials about word2vec. but I think there is a limitation in your tutorial as follows:
    when you create a model via word2vec such this:
    model = Word2Vec(sentences, size=100, window=5, workers=8, min_count=1)
    you should have explained how we are able to apply vectors of words after embedding them by word2vec model. for example, when we use embedding keras layer,
    z = Embedding(vocab_dic_size, 100, input_length=seq_length, name=”embedding”)
    we can apply in this way. but I do not know how i can use the model which you created by word2vec in order to embed words and giving them weights??
    sorry I got confused so if you write an instance about using the model = Word2Vec( ) like the example I wrote above by keras embedding layer, It will be a great guidance as I have already searched about it but I did not understand.
    I am sorry to write a big comment such as this but plz consider that I a beginner and found your tutorial as the best one in clearness.

  37. Ahmed Mohamed June 25, 2018 at 3:51 am #

    Hi Jason,

    I love your articles. I built the following comparison for different document representations. Your feedback will be appreciated.

    https://github.com/ahmed-mohamed-sn/DocumentRepresentations

    • Jason Brownlee June 25, 2018 at 6:23 am #

      I don’t have the capacity to review and debug your code sorry.

  38. Amit M July 6, 2018 at 10:49 pm #

    Hey,

    The article is a great help. Thanks.

    I require to train a RNN for classification of Name, Address, Age, DOB etc. from raw OCR output of an identity card. I want the classifier to be generic and work on most kind of id’s.

    As names, address etc. won’t be available in a pre-trained model, I decided to train my own word embedding. I have around 50 variants of ID cards and will be generating training data from those by altering details in it.

    I find out someone did similar –
    “We hash the text of the word into a binary vector of size 2^18 which is embedded in a trainable 500 dimensional distributed representation using an embedding layer”

    So I understand the “Word Hashing” part, but how they embedded those binary feature vectors in a 500 dims., I mean how will your convert those 2^18 dims. features to 500 dims.

    Your help would be really appreciated. Please suggest some other alternative if you have.

  39. Hitesh Nankani July 9, 2018 at 1:00 pm #

    I have got caught up in a really results of gensim Word2Vec. I have formatted the question and asked here: https://stackoverflow.com/questions/51233632/word2vec-gensim-multiple-languagess . would it be possible to train gensim on multiple languages and still get the right similarity result? My Result says no.

    Also I want to ask that the random weights when we train Word2Vec from gensim, do they remain same or they change every time we train the model ?

    • Jason Brownlee July 10, 2018 at 6:38 am #

      I don’t know about cross-language models. Sounds like deeper thought is required.

      Weight are learned via training on the provided dataset.

  40. Chitrak July 14, 2018 at 7:08 pm #

    what is the “alpha” in those code?
    tell me more about that.
    “alpha” is initialized “0.025” . what is the use of that ?

  41. Pranav July 23, 2018 at 8:20 pm #

    I am building a domain-specific word2vec model using more or less the same steps. Given a word, I know what the similar words should be. I also don’t have a dataset. What I am doing is trying to overfit the model based on very small variations. When I plot thegraph, I see that the words I need are grouped together. For eg, words a,b and c are grouped together in a single cluster. So I assume when I input a, I should get b and c as the similar words. However, words farther away from the cluster in question have a better score and are displayed first. How can I change this behaviour? Any help is appreciated. Thanks

    • Jason Brownlee July 24, 2018 at 6:15 am #

      The model has parameters, try tuning them and see if they have a desired effect?

      Perhaps you need new/different training data?

      Perhaps your assumptions or scoring have flaws?

  42. jawad September 18, 2018 at 1:21 pm #

    Hi Jason,
    Thank you for your awesome tutorials.
    I have a question about dimensions of vectors. Can we visualize the dimensions. Kindly guide me.

    • Jason Brownlee September 18, 2018 at 2:26 pm #

      What do you mean exactly? All vector have the same dimensions.

      If you mean visualize the word vectors, the above tutorial shows you how.

  43. Shubham Nagalwade October 6, 2018 at 5:04 pm #

    Hi,
    word2vec or Glove algorithm its usefull for non-english word/language (i.e.Russian)

Leave a Reply