# How to Use Word Embedding Layers for Deep Learning with Keras

Last Updated on February 2, 2021

Word embeddings provide a dense representation of words and their relative meanings.

They are an improvement over sparse representations used in simpler bag of word model representations.

Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.

In this tutorial, you will discover how to use word embeddings for deep learning in Python with Keras.

After completing this tutorial, you will know:

• About word embeddings and that Keras supports word embeddings via the Embedding layer.
• How to learn a word embedding while fitting a neural network.
• How to use a pre-trained word embedding in a neural network.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Updated Feb/2018: Fixed a bug due to a change in the underlying APIs.
• Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.

How to Use Word Embedding Layers for Deep Learning with Keras
Photo by thisguy, some rights reserved.

## Tutorial Overview

This tutorial is divided into 3 parts; they are:

1. Word Embedding
2. Keras Embedding Layer
3. Example of Learning an Embedding
4. Example of Using Pre-Trained GloVe Embedding

### Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## 1. Word Embedding

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.

The position of a word in the learned vector space is referred to as its embedding.

Two popular examples of methods of learning word embeddings from text include:

• Word2Vec.
• GloVe.

In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.

## 2. Keras Embedding Layer

Keras offers an Embedding layer that can be used for neural networks on text data.

It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used in a variety of ways, such as:

• It can be used alone to learn a word embedding that can be saved and used in another model later.
• It can be used as part of a deep learning model where the embedding is learned along with the model itself.
• It can be used to load a pre-trained word embedding model, a type of transfer learning.

The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:

It must specify 3 arguments:

• input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
• output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
• input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.

For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer.

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.

Now, let’s see how we can use an Embedding layer in practice.

## 3. Example of Learning an Embedding

In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.

We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.

First, we will define the documents and their class labels.

Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers. We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF.

Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.

The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras function, in this case the pad_sequences() function.

We are now ready to define our Embedding layer as part of our neural network model.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.

Finally, we can fit and evaluate the classification model.

The complete code listing is provided below.

Running the example first prints the integer encoded documents.

Then the padded versions of each document are printed, making them all uniform length.

After the network is defined, a summary of the layers is printed. We can see that as expected, the output of the Embedding layer is a 4×8 matrix and this is squashed to a 32-element vector by the Flatten layer.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Finally, the accuracy of the trained model is printed, showing that it learned the training dataset perfectly (which is not surprising).

You could save the learned weights from the Embedding layer to file for later use in other models.

You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.

## 4. Example of Using Pre-Trained GloVe Embedding

The Keras Embedding layer can also use a word embedding learned elsewhere.

It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:

The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.

You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.

This example is inspired by an example in the Keras project: pretrained_word_embeddings.py.

After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.

If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.

As in the previous section, the first step is to define the examples, encode them as integers, then pad the sequences to be the same length.

In this case, we need to be able to map words to integers as well as integers to words.

Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.

Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.

This is pretty slow. It might be better to filter the embedding for the unique words in your training data.

Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.

The result is a matrix of weights only for words we will see during training.

Now we can define our model, fit, and evaluate it as before.

The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.

The complete worked example is listed below.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example may take a bit longer, but then demonstrates that it is just as capable of fitting this simple problem.

In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.

See what works best for your specific problem.

This section provides more resources on the topic if you are looking go deeper.

## Summary

In this tutorial, you discovered how to use word embeddings for deep learning in Python with Keras.

Specifically, you learned:

• About word embeddings and that Keras supports word embeddings via the Embedding layer.
• How to learn a word embedding while fitting a neural network.
• How to use a pre-trained word embedding in a neural network.

Do you have any questions?

## Develop Deep Learning models for Text Data Today!

#### Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

### 644 Responses to How to Use Word Embedding Layers for Deep Learning with Keras

1. Mohammad October 4, 2017 at 7:58 am #

Thank you Jason,
I am excited to read more NLP posts.

• Jason Brownlee October 4, 2017 at 8:03 am #

Thanks.

• Ajit March 15, 2020 at 12:40 am #

Thanks man, It was really helpful.

• sherry July 22, 2019 at 7:22 pm #

after embedding,have to have a “Flatten()” layer? In my project, I used a dense layer directly after embedding. is it ok?

• Peter Nduru October 8, 2019 at 6:06 am #

I appreciate how well updated you keep these tutorials. the first thing I always look at, when I start reading is the update date. thank you very much.

• Jason Brownlee October 8, 2019 at 8:10 am #

You’re welcome.

I require all of the code to work and keep working!

• Martin October 11, 2019 at 4:09 am #

Hi, Jason:

when one_hot encoding is used, why is padding necessary? Doesn’t one_hot encoding already create an input of equal length?

• Jason Brownlee October 11, 2019 at 6:25 am #

The one hot encoding is for one variable at one time step, e.g. features.

Padding is needed to make all sequences have the same number of time steps.

2. shiv October 5, 2017 at 10:07 am #

I split my data into 80-20 test-train and I’m still getting 100% accuracy. Any idea why? It is ~99% on epoch 1 and the rest its 100%.

3. Sandy October 6, 2017 at 2:44 pm #

Thank you Jason. I always find things easier when reading your post.
I have a question about the vector of each word after training. For example, the word “done” in sentence “Well done!” will be represented in different vector from that word in sentence “Could have done better!”. Is that right? I mean the presentation of each word will depend on the context of each sentence?

• Jason Brownlee October 7, 2017 at 5:48 am #

No, each word in the dictionary is represented differently, but the same word in different contexts will have the same representation.

It is the word in its different contexts that is used to define the representation of the word.

Does that help?

• Sandy October 7, 2017 at 5:37 pm #

Yes, thank you. But I still have a question. We will train each context separately, then after training the first context, in this case is “Well done!”, we will have a vector representation of the word “done”. After training the second context, “Could have done better”, we have another vector representation of the word “done”. So, which vector will we choose to be the representation of the word “done”?
I might misunderstand the procedure of training. Thank you for clarifying it for me.

• Jason Brownlee October 8, 2017 at 8:32 am #

No. All examples where a word is used are used as part of the training of the representation of the word. There is only one representation for each word during and after training.

• Sandy October 8, 2017 at 2:46 pm #

I got it. Thank you, Jason.

4. Chiedu October 7, 2017 at 5:36 pm #

Hi Jason,
any ideas on how to “filter the embedding for the unique words in your training data” as mentioned in the tutorial?

• Jason Brownlee October 8, 2017 at 8:32 am #

The mapping of word to vector dictionary is built into Gensim, you can access it directly to retrieve the representations for the words you want: model.wv.vocab

• mahna April 28, 2018 at 2:31 am #

HI Jason,
I am really appreciated the time U spend to write this tutorial and also replying.
My question is about “model.wv.vocab” you wrote. is it an address site?
It does not work actually.

• Jason Brownlee April 28, 2018 at 5:33 am #

No, it is an attribute on the model.

5. Abbey October 8, 2017 at 2:19 am #

Hi, Jason

Good day.

I just need your suggestion and example. I have two different dataset, where one is structured and the other is unstructured. The goal is to use the structured to construct a representation for the unstructured, so apply use word embedding on the two input data but how can I find the average of the two embedding and flatten it to one before feeding the layer into CNN and LSTM.

Regards
Abbey

• Jason Brownlee October 8, 2017 at 8:40 am #

If your question was if this is a good approach, my advice is to try it and see.

• Abiodun Modupe October 9, 2017 at 7:46 pm #

Hi, Jason
How can I find the average of the word embedding from the two input?
Regards
Abbey

• Jason Brownlee October 10, 2017 at 7:43 am #

Perhaps you could retrieve the vectors for each word and take their average?

Perhaps you can use the Gensim API to achieve this result?

• Rafael Sá June 17, 2019 at 2:47 am #

Hi Jason,

I have a set of documents(1200 text of movie Scripts) and i want to use pretrained embeddings. But i want to update the vocabulary and train again adding the words of my corpus. Is that possible ?

• Jason Brownlee June 17, 2019 at 8:24 am #

Sure.

Load the pre-trained vectors. Add new random vectors for the new words in the vocab and train the whole lot together.

6. Vinu October 9, 2017 at 5:54 pm #

Hi Jason…Could you also help us with R codes for using Pre-Trained GloVe Embedding

• Jason Brownlee October 10, 2017 at 7:43 am #

Sorry, I don’t have R code for word embeddings.

7. Hao October 12, 2017 at 5:49 pm #

Hi Jason, really appreciate that you answered all the replies! I am planning to try both CNN and RNN (maybe LSTM & GRU) on text classification. Most of my documents are less than 100 words long, but about 5 % are longer than 500 words. How do you suggest to set the max length when using RNN?If I set it to be 1000, will it degrade the learning result? Should I just use 100? Will it be different in the case of CNN?
Thank you!

• Jason Brownlee October 13, 2017 at 5:45 am #

I would recommend experimenting with different configurations and see how the impact model skill.

• ammara May 10, 2018 at 2:47 am #

Dear Hao,
Did you try RNN(LSTM or GRU) on text classification?If yes then can you plz provide me the code??

8. Michael October 13, 2017 at 10:22 am #

I’d like to thank you for this post. I’ve been struggling to understand this precise way of using keras for a week now and this is the only post I’ve found that actually explains what each step in the process is doing – and provides code that self-documents what the data looks like as the model is constructed and trained. This makes it so much easier to adapt to my particular requirements.

• Jason Brownlee October 13, 2017 at 2:55 pm #

9. Azim October 17, 2017 at 5:47 pm #

In the above Keras example, how can we predict a list of context words given a word? Lets say i have a word named ‘sudoku’ and want to predict the sourrounding words. how can we use word2vec from keras to do that?

• Jason Brownlee October 18, 2017 at 5:32 am #

It sounds like you are describing a language model. We can use LSTMs to learn these relationships.

• Azim October 21, 2017 at 9:34 pm #

No, what i meant was for word2vec skip-gram model predicts a context word given the center word. So if i train a word2vec skip-gram model, how can i predict the list of context words if my center word is ‘sudoku’?

Regards,

Azim

• Jason Brownlee October 22, 2017 at 5:19 am #

I don’t know Azim.

• Kevin Toms November 16, 2018 at 7:20 pm #

You can get the cosine distance between the words, and the one that is having the least distance would surround it.. here is the link:
https://github.com/Hvass-Labs/TensorFlow-Tutorials

Go to Natural Language Processing and you can find a cosine function there, use them to find yours..

10. Willie October 21, 2017 at 5:55 pm #

Hi Jason,

Thanks for your useful blog I have learned a lots.

I am wondering if I already have pretrained word embedding, is that possible to set keras embedding layer trainable to be true? If it is workable, will I get a better result, when I only use small size of data to pretrain the word embedding model. Many thanks!

• Jason Brownlee October 22, 2017 at 5:16 am #

You can. It is hard to know whether it will give better results. Try it.

11. cam October 28, 2017 at 5:34 am #

Hey Jason,

Is it possible to perform probability calculations on the label? I am looking at a case where it is not simply +/- but that a given data entry could be both but more likely one and not the other.

• Jason Brownlee October 29, 2017 at 5:48 am #

Yes, a neural network with a sigmoid or softmax output can predict a probability-like score for each class.

• David Stancu November 3, 2017 at 6:10 am #

I’m doing something like this except with my own feature vectors — but to the point of the labels — I do ternary classification using categorical_crossentropy and a softmax output. I get back an answer of the probability of each label.

12. Ravil November 3, 2017 at 5:43 am #

Hey Jason!

Thanks for a wonderful and detailed explanation of the post. It helped me a lot.

However, I’m struggling to understand how the model predicts a sentence as positive or negative.
i understand that each word in the document is converted into a word embedding, so how does our model evaluate the entire sentence as positive or negative? Does it take the sum of all the word vectors? Perhaps average of them? I’ve not been able to figure this part out.

• Jason Brownlee November 3, 2017 at 2:13 pm #

Great question!

The model interprets all of the words in the sequence and learns to associate specific patterns (of encoded words) with positive or negative sentiment

• Ken April 2, 2018 at 8:37 am #

Hi Jason,

Thanks a lot for your amazing posts. I have the same question as Ravil. Can you elaborate a bit more on “learns to associate specific patterns?”

• Jason Brownlee April 2, 2018 at 2:49 pm #

Good question Ken, perhaps this post will make it clearer how ml algorithms work (a functional mapping):
http://machinelearningmastery.com/how-machine-learning-algorithms-work/

Does that help?

• Ken April 2, 2018 at 10:40 pm #

Thanks for your reply. But I was trying to ask is that how does keras manage to produce a document level representation by having the vectors of each word? I don’t seem to find how was this being done in the code.

Cheers.

• Jason Brownlee April 3, 2018 at 6:34 am #

The model such as the LSTM or CNN will put this together.

https://machinelearningmastery.com/start-here/#lstm

Does that help?

• Alexi September 27, 2018 at 1:48 am #

Hi Jason,

First, thanks for all your really useful posts.

If I understand well your post and answers to Ken and Ravil, the neural network you build in fact reduces the sequence of embedding vectors corresponding to all the words of a document to a one-dimensional vector with the Flatten layer, and you just train this flattening, as well as the embedding, to get the best classification on your training set, isn’t it?

• Jason Brownlee September 27, 2018 at 6:04 am #

Sort of.

words => integers => embedding

The embedding has a vector per word which the network will use as a representation for the word.

We have a sequence of vectors for the words, so we flatten this sequence to one long vector for the Dense model to read. Alternately, we could wrap the dense in a timedistributed layer.

• Alexi September 27, 2018 at 5:49 pm #

Aaah! So nothing tricky is done when flattening, more or less just concatenating the fixed number of embedding vectors that is the output of the embedding layer, and this is why the number of words per document has to be fixed as a setting of this layer. If this is correct, I think I’m finally understanding how all this works.

I’m sorry to bother you more, but how does the network works if a document much shorter than the longest document (the number of its words being set as the number of words per document to the embedding layer) is given to the network as training or testing? It just fills the embedding vectors of this non-appearing words as 0? I’ve been looking for ways to convert all the word embeddings of a text to some sort of document embedding, and this just seems a solution too simple to work, or that may work but for short documents (as well as other options like averaging the word vectors or taking the element-wise maximum of minimum).

I’m trying to do sentiment analysis for spanish news, and I have news with like 1000 or more words, and wanted to use pre-trained word embeddings of 300 dimensions each. Wouldn’t it be a size way too huge per document for the network to train properly, or fast enough? I imagine you do not have a precise answer, but I’d like to know if you have tried the above method with long documents, or know that someone has.

Thank you again, I’m sorry for such a long question.

• Jason Brownlee September 28, 2018 at 6:07 am #

Yes.

We can use padding for small documents and a Masking input layer to ignore padded values. More here:
https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/

Try different sized embeddings and use results to guide the configuration.

• Alexi October 1, 2018 at 5:26 pm #

Okay, thank you very much! I will give it a try.

13. chengcheng November 9, 2017 at 2:56 am #

the chinese word how to vector sequence

• Jason Brownlee November 9, 2017 at 10:03 am #

Sorry?

• lstmbot December 16, 2017 at 10:30 pm #

me bot trying interact comments born with lstm

14. Hilmi Jauffer November 16, 2017 at 4:30 pm #

Hi Jason,
I have successfully trained a model using the word embedding and Keras. The accuracy was at 100%.

I saved the trained model and the word tokens for predictions.
MODEL.save(‘model.h5’, True)

TOKENIZER = Tokenizer(num_words=MAX_NB_WORDS)
TOKENIZER.fit_on_texts(TEXT_SAMPLES)
pickle.dump(TOKENIZER, open(‘tokens’, ‘wb’))

When predicting:
– Then predict the category of the new data.

I am not sure the prediction logic is correct, since I am not seeing the expected category from the prediction.

The source code is in Github: https://github.com/hilmij/keras-test/blob/master/predict.py

Appreciate if you can have a look and let me know what I am missing.

Best regards,
Hilmi.

• Jason Brownlee November 17, 2017 at 9:20 am #

What was the problem exactly?

• Tony July 11, 2018 at 8:00 am #

Thank you, Jason! Your examples are very helpful. I hope to get your attention with my question. At training, you prepare the tokenizer by doing:

t = text.Tokenizer();
t.fit_on_texts(docs)

Which creates a dictionary of words:numbers. What do we do if we have a new doc with lost of new words at prediction time? Will all these words go the unknown token? If so, is there a solution for this, like can we fit the tokenizer on all the words in the English vocab?

• Jason Brownlee July 11, 2018 at 2:52 pm #

You must know the words you want to support at training time. Even if you have to guess.

To support new words, you will need a new model.

15. Fabrício Melo November 17, 2017 at 7:35 am #

Hello Jason!

In Example of Using Pre-Trained GloVe Embedding, do you use the word embedding vectors as weights of the embedding layer?

16. Alex November 21, 2017 at 11:15 pm #

Very nice set of Blogs of NLP and Keras – thanks for writing them.

As a quick note for others

When I tried to load the glove file with the line:
f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)

I got the error
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 2776: character maps to

f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’,encoding=”utf8″)

This issue may have been caused by using Windows.

• Jason Brownlee November 22, 2017 at 11:12 am #

Thanks for the tip Alex.

• Liliana November 26, 2017 at 12:00 pm #

Hi Jason,

Wonderful tutorials!

I have a question. Why do we have to one-hot vectorize the labels? Also, if I have a pad sequence of ex. [2,4,0] what the one hot will be? I’m trying to understand better one hot vectorzer.

17. Wassim November 28, 2017 at 1:06 am #

Hi Jason,
Thank you for your excellent tutorial. Do you know if there is a way to build a network for a classification using both text embedded data and categorical data ?
Thank you

18. ashish December 2, 2017 at 8:29 pm #

19. Stuart December 7, 2017 at 12:11 am #

Fantastic explanation, thanks so much. I’m just amazed at how much easier this has become since the last time I looked at it.

• Jason Brownlee December 7, 2017 at 7:59 am #

I’m glad the post helped Stuart!

20. Stuart December 11, 2017 at 3:30 pm #

Hi Jason …the 14 word vocab from your docs is “well done good work great effort nice excellent weak poor not could have better” For a vocab_size of 14, this one_shot encodes to [13 8 7 6 10 13 3 6 10 4 9 2 10 12]. Why does 10 appear 3 times, for “great”, “weak” and “have”?

• Jason Brownlee December 11, 2017 at 4:54 pm #

21. Stuart December 11, 2017 at 9:34 pm #

Hi Jason, the encodings that I provided in the example above came from kerasR with a vocab_size of 14. So let me ask the same question about the uniqueness of encodings using your Part 3 one_hot example above with a vocab_size of 50.
Here different encodings are produced for different new kernels (using Spyder3/Python 3.4):
[[31, 33], [27, 33], [48, 41], [34, 33], [32], [5], [14, 41], [43, 27], [14, 33], [22, 26, 33, 26]]
[[6, 21], [48, 44], [7, 26], [46, 44], [45], [45], [10, 26], [45, 48], [10, 44], [47, 3, 21, 27]]
[[7, 8], [16, 42], [24, 13], [45, 42], [23], [17], [34, 13], [13, 16], [34, 42], [17, 31, 8, 19]]

Pleas note that in the first line, “33” encodes for the words “done”, “work”, “work”, “work” & “done”. In the second line “45” encodes for the words “excellent” & “weak” & “not”. In the third line, “13” encodes “effort”, “effort” & “not”.

So I’m wondering why the encodings are not unique? Secondly, if vocab_size must be much larger then the actual size of the vocabulary?

Thanks

22. Nadav December 25, 2017 at 8:28 am #

Great article Jason.
How do you convert back from an embedding to a one-hot? For example if you have a seq2seq model, and you feed the inputs as word embeddings, in your decoder you need to convert back from the embedding to a one-hot representing the dictionary. If you do it by using matrix multiplication that can be quite a large matrix (e.g embedding size 300, and vocab of 400k).

• Jason Brownlee December 26, 2017 at 5:12 am #

The output layer can predict integers directly that you can map to words in your vocabulary. There would be no embedding layer on the output.

23. Hitkul January 9, 2018 at 9:05 pm #

Hi,
I have created word2vec matrix of a sentence using gensim and pre-trained Google News vector. Can I just flatten this matrix to a vector and use that as a input to my neural network.
For example:
each sentence is of length 140 and I am using a pre-trained model of 100 dimensions, therefore:- I have a 140*100 matrix representing the sentence, can i just flatten it to a 14000 length vector and feed it to my input layer?

• Jason Brownlee January 10, 2018 at 5:25 am #

It depends on what you’re trying to model.

24. Paul January 12, 2018 at 6:40 pm #

Great article, could you shed some light on how do Param # of 400 and 1500 in two neural networks come from? Thanks

• Paul Lo January 12, 2018 at 9:55 pm #

Oh! Is it just vocab_size * # of dimension of embedding space?
1. 50 * 8 = 400
2. 15* 100 = 1500

• Jason Brownlee January 13, 2018 at 5:31 am #

What do you mean exactly?

25. Andy Brown January 14, 2018 at 2:02 pm #

Great post! I’m working with my own corpus. How would I save the weight vector of the embedding layer in a text file like the glove data set?

My thinking is it would be easier for me to apply the vector representations to new data sets and/or machine learning platforms (mxnet etc) and make the output human readable (since the word is associated with the vector).

• Jason Brownlee January 15, 2018 at 6:57 am #

You could use get_weights() in the Keras API to retrieve the vectors and save directly as a CSV file.

• Elizabeth October 27, 2019 at 9:49 am #

get_weights() for what exactly? does it need a loop?

26. jae January 17, 2018 at 8:10 am #

27. Murali Manohar January 17, 2018 at 4:58 pm #

Hello Jason,
I have a dataset with which I’ve attained 0.87 fscore by 5 fold cross validation using SVM.Maximum context window is 20 and one hot encoded.

Now, I’ve done whatever has been mentioned and getting an accuracy of 13-15 percent for RNN models where each one has one LSTM cell with 3,20,150,300 hidden units. Dimensions of my pre-trained embeddings is 300.

Loss is decreasing and even reaching negative values, but no change in accuracy.

I’ve tried the same with your CNN,basic ANN models you’ve mentioned for text classification .

28. Carsten January 17, 2018 at 8:35 pm #

When I copy the code of the first box I get the error:

AttributeError: ‘int’ object has no attribute ‘ndim’

in the line :

Where is the problem?

• Jason Brownlee January 18, 2018 at 10:07 am #

Copy the the code from the “complete example”.

• Thiziri February 8, 2018 at 12:45 am #

Hi Jason,
I’ve got the same error, also will running the “compete example”.
What can be the cause?

• Gokul February 8, 2018 at 6:49 pm #

Try casting the labels to numpy arrays.

• soren February 9, 2018 at 6:31 am #

i get the same!

• Jason Brownlee February 9, 2018 at 9:22 am #

I have fixed and updated the examples.

• ademyanchuk February 8, 2018 at 3:34 pm #

Carsten, you need labels to be numpy.array not just list.

29. Willie January 17, 2018 at 9:18 pm #

Hi Jason,

If I have unkown words in training set, how can I assign the same random initialize vector to all of the unkown words when using pretrained vector model like glove or w2v. thanks!!!

• Jason Brownlee January 18, 2018 at 10:08 am #

Why would you want to do that?

• Willie January 18, 2018 at 1:05 pm #

If my data is in specific domain and I still want to leverage general word embedding model(e.g. glove.6b.100d trained from wiki), then it must has some OOV in domain data, so. no no mather in training time or inference time it propably may appear some unkown words.

• Jason Brownlee January 19, 2018 at 6:27 am #

It may.

You could ignore these words.

You could create a new embedding, set vectors from existing embedding and learn the new words.

30. Vladimir January 21, 2018 at 1:46 pm #

Amazing Dr. Jason!
Thanks for a great walkthrough.

On the step of encoding each word to integer you said: “We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF”. Could you kindly elaborate on how can it be implemented, as tfidf encodes tokens with floats. And how to tie it with Keras, passing it to an Embedding layer please? I’m keen to experiment with it, hope it could yield better results.

Another question is about input docs. Suppose I’ve preprocessed text by the means of nltk up to lemmatization, thus, each sample is a list of tokens. What is the best approach to pass it to Keras embedding layer in this case then?

• Jason Brownlee January 22, 2018 at 4:42 am #

https://machinelearningmastery.com/?s=bag+of+words&submit=Search

You can encode your tokens as integers manually or use the Keras Tokenizer.

• Vladimir January 22, 2018 at 11:54 pm #

Well, Keras Tokenizer can accept only texts or sequences. Seems the only way is to glue tokens together using ‘ ‘.join(token_list) and then pass onto the Tokenizer.

As for the BOW articles, I’ve walked through it theys are so very valuable. Thank you!

Using BOW differs so much from using Embeddings. As BOW would introduce huge sparse array of features for each sample, while Embeddings aim to represent those features (tokens) very densely up to a few hundreds items.

So, BOW in the other article gives incredibly good results with just very simple NN architecture (1 layer of 50 or 100 neurons). While I struggled to get good results using Embeddings along with convolutional layers…
From your experience, would you please advice on that please? Are Embeddings actually viable and it is just a matter of finding a correct architecture?

• Jason Brownlee January 23, 2018 at 8:01 am #

Nice! And well done for working through the tutorials. I love to see that and few people actually “do the work”.

Embeddings make more sense on larger/hard problems generally – e.g. big vocab, complex language model on the front end, etc.

• Vladimir January 23, 2018 at 10:15 am #

I see, thank you.

31. joseph January 31, 2018 at 9:20 pm #

Thanks jason for another great tutorial.

I have some questions :

Isn’t one hot definition is binary one, vector of 0’s and 1?
so [[1,2]] would be encoded to [[0,1,0],[0,0,1]]

how embedding algorithm is done on keras word2vec/globe or simply dense layer(or something else)
thanks
joseph

• Jason Brownlee February 1, 2018 at 7:20 am #

32. Anna February 4, 2018 at 9:40 pm #

Amazing Dr. Jason!
Thanks for a great walkthrough.

The dimension for each word vector like above example e.g. 8, is set randomly?

Thank you

• Jason Brownlee February 5, 2018 at 7:45 am #

The dimensionality is fixed and specified.

In the first example it is 8, in the second it is 100.

• Anna February 5, 2018 at 2:17 pm #

Thank you Dr. Jason for your quick feedback!

Ok, I see that the pre-trained word embedding is set to 100 dimensionality because the original file “glove.6B.100d.txt” contained a fixed number of 100 weights for each line of ASCII.

However, the first example as you mentioned in here, “The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.”

You choose 8 dimensions for the first example. Does it means it can be set to any numbers other than 8? I’ve tried to change the dimension to 12. It doesn’t appeared any errors but the accuracy drops from 100% to 89%

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 4, 12) 600
_________________________________________________________________
flatten_1 (Flatten) (None, 48) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 49
=================================================================
Total params: 649
Trainable params: 649
Non-trainable params: 0

Accuracy: 89.999998

So, how dimensionality is set? Does the dimensions effects the accuracy performance?

Sorry I am trying to grasp the basic concept in understanding NLP stuff. Much appreciated for your help Dr. Jason.

Thank you

• Jason Brownlee February 5, 2018 at 2:54 pm #

Yes, you can choose any dimensionality you like. Larger means more expressive, required for larger vocabs.

Does that help Anna?

• Anna February 5, 2018 at 4:40 pm #

Yes indeed Dr. Now I can see that the dimensionality is set depends on the number of vocabs.

Thank you again Dr Jason for enlighten me! 🙂

• Jason Brownlee February 6, 2018 at 9:11 am #

33. Miroslav February 5, 2018 at 7:09 am #

Hi Jason,
thanks for amazing tutorial.

I have a question. I am trying to do semantic role labeling with context window in Keras. How can I implement context window with embedding layer?

Thank you

34. Gabriel February 6, 2018 at 4:08 am #

Hi, great website! I’ve been learning a lot from all the tutorials. Thank you for providing all these easy to understand information.

How would I go about using other data for the CNN model? At the moment, I am using just textual data for my model using the word embeddings. From what I understand, the first layer of the model has to be the Embeddings, so how would I use other input data such as integers along with the Embeddings?

Thank you!

35. Aditya February 6, 2018 at 4:59 am #

Hi Jason, this tutorial is simple and easy to understand. Thanks.

However, I have a question. While using the pre-trained embedding weights such as Glove or word2vec, what if there exists few words in my dataset, which weren’t present in the dataset on which word2vec or Glove was trained. How does the model represent such words?

My understanding is that in your second section (Using Pre-Trained Glove Embeddin), you are mapping the words from the loaded weights to the words present in your dataset, hence the question above.

Correct me, if it’s not the way I think it is.

• Jason Brownlee February 6, 2018 at 9:22 am #

You can ignore them, or assign them to zero vector, or train a new model that includes them.

• Serhiy February 7, 2019 at 10:52 pm #

Hi Jason. Thanks for this and other VERY clear and informative articles.

“mapping the words from the loaded weights to the words present in your dataset”

how does it mapping? does it use matrix index == word number from (padded_docs or)?

I am asking because – what if I pass embedding_matrix with origin order, but will shuffle padded_docs before model.fit?

• Jason Brownlee February 8, 2019 at 7:49 am #

Words must be assigned unique integers that remain consistent across all data and embeddings.

36. Han February 20, 2018 at 4:03 pm #

Hi Jason,

I am trying to train a Keras LSTM model on some text sentiment data. I am also using GridSearchCV in sklearn to find the best parameters. I am not quite sure what went wrong but the classification report from sklearn says:

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.

Below is what the classification report looks like:

precision recall f1-score support

negative 0.00 0.00 0.00 98
positive 0.70 1.00 0.83 232

avg / total 0.49 0.70 0.58 330

Do you know what the problem is?

• Jason Brownlee February 21, 2018 at 6:35 am #

Perhaps you are trying to use keras metrics with sklearn? Watch the keywords you use when specifying the keras model vs the sklearn evaluation (CV).

37. Alberto Nogales Moyano March 2, 2018 at 2:10 am #

Hi Jason,
your blog is really really interesting. I have a question: Which is the diffrence between using word2vec and texts_to_sequences from Tokenizer in keras? I mean in the way the texts are represented.
Is any of the two options better than the other?
Thanks a lot.
Kind regards.

• Jason Brownlee March 2, 2018 at 5:33 am #

word2vec encodes words (integers) to vectors. texts_to_seqences encodes words to integers. It is a step before word2vec or a step before bag of words.

38. Souraj Adhikary March 2, 2018 at 5:39 pm #

Hi Jason,

I have a dataframe which contains texts and corresponding labels. I have used gensim module and used word2vec to make a model from the text. Now I want to use that model for input into Conv1D layers. Can you please tell me how to load the word2vec model in Keras Embedding layer? Do I need to pre-process the model in some way before loading? Thanks in advance.

• Jason Brownlee March 3, 2018 at 8:07 am #

Yes, load the weights into an Embedding layer, then define the rest of your network.

The tutorial above will help a lot.

39. Ankush Chandna March 13, 2018 at 1:57 am #

This is really helpful. You make us awesome at what we do. Thanks!!

40. R.L. March 20, 2018 at 4:13 pm #

Thank you for this extremely helpful blog post. I have a question regarding to interpreting the model. Is there a way to know / visualize the word importance after the model is trained? I am looking for a way to do so. For instance, is there a way to find like the top 10 words that would trigger the model to classify a text as negative and vice versa? Thanks a lot for your help in advance

• Jason Brownlee March 21, 2018 at 6:31 am #

There may be methods, but I am not across them. Generally, neural nets are opaque, and even weight activations in the first/last layers might be misleading if used as importance scores.

41. Mohit March 30, 2018 at 9:35 pm #

Hi Jason ,

Can you please tell me the logic behind this:

vocab_size = len(t.word_index) + 1

• Jason Brownlee March 31, 2018 at 6:36 am #

So that the word indexes are 1-offset, and 0 is reserved for padding / no data.

42. Ryan March 31, 2018 at 9:59 pm #

Hi Jason,

If I want to use this model to predict next word, can I just change the output layer to Dense(100, activation = ‘linear’) and change the loss function to MSE?

Many thanks,

Ray

43. Coach April 17, 2018 at 10:23 pm #

Thanks for this tutoriel ! Really clear and usefull !

44. Maryam April 28, 2018 at 4:53 am #

Hi Jason,
U R the best in keras tutorials and also replying the questions. I am really grateful.
Although I have understood the content of the context and codes U have written above, I am not able to understand what you mean about this sentence:[It might be better to filter the embedding for the unique words in your training data.].
what does “to filter the embedding” mean??

• Jason Brownlee April 28, 2018 at 5:34 am #

It means, only have the words in the embedding that you know exist in your dataset.

• Maryam April 28, 2018 at 11:45 pm #

Hi Jason,
Thank U 4 replying but as I am not a native English speaker, I am not sure whether I got it or not. You mean to remove all the words which exist in the glove but do not exist in my own dataset?? in order to raise the speed of implementation?
I am sorry to ask it again as I did not understand clearly.

45. Aiza May 9, 2018 at 6:14 am #

Hi,
This post is great. I am new to machine learning so i have a question which might be basic so i am not sure.As from what i understand, the model takes the embedding matrix and text along with labels at once.What i am trying to do is concatenate POS tag embedding with each pre-trained word embedding but POS tag can be different for the same word depending upon the context.It essentially means that i cant alter the embedding matrix at add to the network embedding layer.I want to take each sentence,find its embedding and concatenate with POS tag embedding and then feed into neural network.Is there a way to do the training sentence by sentence or something? Thanks

• Jason Brownlee May 9, 2018 at 6:32 am #

You might be able to use the embedding and pre-calculate the vector representations for each sentence.

• Aiza May 9, 2018 at 7:06 am #

Sorry but i didn’t quite understand.Can you please elaborate a little?

• Jason Brownlee May 9, 2018 at 2:54 pm #

Sorry, I mean that you can prepare an word2vec model standalone. Then pass each sentence through it to build up a list of vectors. Concat the vectors together and you have a distributed sentence representation.

• Aiza May 9, 2018 at 10:16 pm #

Thanks alot! One more thing, is it possible to pass other information to the embedding layer than just weights?For example i was thinking that what if i dont change the embedding matrix at all and create a separate matrix of POS tags for whole training data which is also passed to the embedding layer which concatenates these both sequentially?

• Jason Brownlee May 10, 2018 at 6:32 am #

You could develop a model that has multiple inputs, for example see this post for examples:
https://machinelearningmastery.com/keras-functional-api-deep-learning/

46. Aiza May 15, 2018 at 8:46 am #

Thanks.I saw this post.Your model has separate inputs but they get merged after flattening.In my case i want to pass the embeddings to first convolutional layer,only after they are concatenated. Uptil now what i did was that i have created another integerized sequence of my data according to POS_tags(embedding_pos) to pass as another input and another embedding matrix that contains the embeddings of all the POS tags.
e=(Embedding(vocab_size, 50, input_length=23, weights=[embedding_matrix], trainable=False))
e1=(Embedding(38, 38, input_length=23, weights=[embedding_matrix_pos], trainable=False))
merged_input = concatenate([e,e1], axis=0)
model_embed = Sequential()
model_embed.fit(data,embedding_pos, final_labels, epochs=50, verbose=0)

I know this is wrong but i am not sure how to concat those both sequences and if you can direct me in right direction,it would be great.The error is
‘Layer concatenate_6 was called with an input that isn’t a symbolic tensor. Received type: . Full input: [, ]. All inputs to the layer should be tensors.’

• Jason Brownlee May 15, 2018 at 2:42 pm #

Perhaps you could experiment and compare the performance of models with different merge layers for combining the inputs.

47. Franco May 16, 2018 at 6:51 pm #

Hi Jason, awesome post as usual!

Your last sentence is tricky though. You write:

“In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.”

Without the original corpus, I would argue, that’s impossible.

In Google’s case, the original corpus of around 100 billion words is not publicly available. Solution? I believe you’re suggesting “Transfer Learning for NLP.” In this case, the only solution I see is to add manually words.

E.g. you need ‘dolares’ which is not in Google’s Word2Vec. You want to have similar vectors as ‘money’. In this case, you add ‘dolares’ + 300 vectors from money. Very painful, I know. But it’s the only way I see to do “Transfer Learning with NLP”.

If you have a better solution, I’d love your input.

Cheerio, a big fan

• Jason Brownlee May 17, 2018 at 6:30 am #

Not impossible, you can use an embedding trained on a other corpus and ignore the difference or fine tune the embedding while fitting your model.

You can also add missing words to the embedding and learn those.

Remember, we want a vector for each word that best captures their usage, some inconsistencies does not result in a useless model, it is not binary useful/useless case.

• Franco May 17, 2018 at 4:42 pm #

Thank you very much for the detailed answer!

48. Ashar May 31, 2018 at 6:11 am #

The link for the Tokenizer API is this same webpage. Can you update it please?

49. Andreas Papandreou May 31, 2018 at 10:55 am #

Hi Jason, great post!
I have successfully trained my model using the word embedding and Keras. I saved the trained model and the word tokens.Now in order to make some predictions, do i have to use the same tokenizer with one that i used in the training?

• Jason Brownlee May 31, 2018 at 2:12 pm #

Correct.

• Andreas Papandreou May 31, 2018 at 4:08 pm #

Thank you very much!

50. zedom June 2, 2018 at 5:15 pm #

Hi Jason, when i was looking for how to use pre-trained word embedding,
I found your article along with this one:
https://jovianlin.io/embeddings-in-keras/
They have many similarities.

51. Jack June 13, 2018 at 10:58 pm #

Hey jason,

I am trying to do this but sometime keras gives the same integer to different words. Would it be better to use scikit encoder that converts words to integers?

• Jason Brownlee June 14, 2018 at 6:08 am #

This might happen if you are using a hash encoder, as Keras does, but calls it a one hot encoder.

Perhaps try a different encoding scheme of words to integers

52. Meysam June 18, 2018 at 5:13 am #

Hi Jason,
I have implemented the above tutorial and the code works fine with GloVe. I am so grateful abot the tutorial Jason.

File “/home/mary/anaconda3/envs/virenv/lib/python3.5/site-packages/gensim/models/keyedvectors.py”, line 171, in __getitem__
return vstack([self.get_vector(entity) for entity in entities])

TypeError: ‘int’ object is not iterable.

I wrote the code as the same as your code which you wrote for loading glove but with a little change.

for line in model:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype=’float32′)
embeddings_index[word] = coefs
model.close()
print(‘Loaded %s word vectors.’ % len(embeddings_index))

embedding_matrix = zeros((vocab_dic_size, 300))
for word in vocab_dic.keys():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[vocab_dic[word]] = embedding_vector

I saw you wrote a tutorial about creating word2vec by yourself in this link “https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/”,
but I have not seen a tutorial about aplying pre-trained word2vec like GloVe.
please guide me to solve the error and how to apply the GoogleNews-vectors-negative300.bin pretrained wor2vec?
I am so sorry to write a lot as I wanted to explain in detail to be clear.
any guidance will be appreciated.
Best
Meysam

• Jason Brownlee June 18, 2018 at 6:46 am #

Perhaps try the text version as in the above tutorial?

• Meysam June 19, 2018 at 12:28 am #

Hi Jason
thank you very much for replying, but as I am weak at the English language, the meaning of this sentence is not clear. what do you mean by “try the text version”??
in fact, GloVe contains txt files and I implement it correctly but when I wanna run a program by GoogleNews-vectors-negative300.bin which is a pre-trained embedding word2vec it gave me the error and also this file is a binary one and there is no pre-trained embedding word2vec file by txt prefix.
can you help me though I know you are busy?
Best
Meysam

53. Kavit Gangar June 18, 2018 at 7:17 pm #

How can we use pre-trained word embedding on mobile?

• Jason Brownlee June 19, 2018 at 6:29 am #

I don’t see why not, other than disk/ram size issues.

54. NewToDeepNLP June 28, 2018 at 12:17 pm #

Great post! What changes are necessary if the labels are more than binary such as with 4 classes:
labels = array([2,1,1,1,2,0,-1,0,-1,0])
?
E.g. instead of ‘binary_crossentropy’ perhaps ‘categorical_crossentropy’?
And how shold the Dense layer change?
If I use: model.add(Dense(4, activation=’sigmoid’)), I get an error:

ValueError: Error when checking target: expected dense_1 to have shape (None, 4) but got array with shape (10, 1)

• Jason Brownlee June 28, 2018 at 2:12 pm #
• NewToDeepNLP June 28, 2018 at 3:14 pm #

thanks! also using keras’s to_categorical to discretize the labels was necessary.

• NewToDeepNLP June 28, 2018 at 7:19 pm #

one more question: is there a simply way to create the Tokenizer() instance, fit it, save it, and then extend it on new documents? Specifically, so that t.fit_on_texts( ) can be updated on new data.

• Jason Brownlee June 29, 2018 at 5:53 am #

I’m not so sure that you can.

It might be easier to manage the encoding/mapping yourself so that you can extend it at will.

• Jason Brownlee June 29, 2018 at 5:50 am #

Nice.

55. James June 28, 2018 at 9:58 pm #

Hi Jason,

For starters, thanks for this post. Ideal to get things going quickly. I have a couple of questions if you don’t mind:

1) I don’t think that one-hot encoding the string vectors is ideal. Even with the recommended vocab size (50), I still got collisions which defeats the purpose even in a toy example such as this. Even the documentation states that uniqueness is not guaranteed. Keras’ Tokenizer(), which you used in the pre-trained example is a more reliable choice in that no two words will share the same integer mapping. How come you proposed one-hot encoding when Tokenizer() does the same job better?

2) Getting Tokenizer()’s word_index property, returns the full dictionary. I expected the vocab_size to be equal to len(t.word_index) but you increment that value by one. This is in fact necessary because otherwise fitting the model fails. But I cannot get the intuition of that. Why is the input dimension size equal to vocab_size + 1?

3) I created a model that expects a BoW vector representation of each “document”. Naturally, the vectors were larger and sparser [ (10,14) ] which means more parameters to learn, no. However, in your document you refer to this encoding or tf-idf as “more sophisticated”. Why do you believe so? With that encoding don’t you lose the word order which is important to learn word embeddings? For the record, this encoding worked well too but it’s probably due to the nature of this little experiment.

• Jason Brownlee June 29, 2018 at 6:05 am #

The keras one hot encoding method really just takes a hash. It is better to use a true one hot encoding when needed.

I do prefer the Tokenizer class in practice.

The words are 1-offset, leaving room for 0 for “unknown” word.

tf-idf gives some idea of the frequency over the simple presence/absence of a word in BoW.

How that helps.

56. abbas July 12, 2018 at 3:27 am #

where can i find the file “../glove_data/glove.6B/glove.6B.100d.txt”??because i come up with the following error.
File “”, line 36
f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)
^
SyntaxError: invalid character in identifier

• Jason Brownlee July 12, 2018 at 6:28 am #

57. Harry July 17, 2018 at 1:02 pm #

Excellent work! This is quite helpful to novice.
And I wonder is this useful to other language apart from English? Since I am a Chinese, and I wonder whether I can apply this to Chinese language and vocabulary.
Thanks again for your time devoted!

58. sree harsha July 18, 2018 at 6:02 am #

Hi,

can you explain how can the word embeddings be given as hidden state input to LSTM?

• Jason Brownlee July 18, 2018 at 6:39 am #

Word embeddings don’t have hidden state. They don’t have any state.

• sree harsha July 18, 2018 at 6:57 pm #

for example, I have a word and its 50 (dimensional) embeddings. How can I give these embeddings as hidden state input to an LSTM layer?

• Jason Brownlee July 19, 2018 at 7:47 am #

Why would you want to give them as the hidden state instead of using them as input to the LSTM?

59. zbk July 20, 2018 at 8:39 am #

hi, what a practical post!
I have a question, I work in a sentiment analysis project with word2vec as an embedding model with keras. my problem is when I want to predict a new sentence as an input I face this error:

ValueError: Error when checking input: expected conv1d_1_input to have shape (15, 512) but got array with shape (3, 512)

consider that I want to enter a simple sentence like:”I’m really sad” with the length 3- and my input shape has the length of 15- I don’t know how can I reshape it or doing what to get rid of this error.

and this is the related part of my code:

model = Sequential()

• Jason Brownlee July 21, 2018 at 6:27 am #

You must prepare the new sentence in exactly the same way as the training data, including length and integer encoding.

60. zbk July 20, 2018 at 6:09 pm #

At least would you mind sharing some suitable source for me to solve this problem please?
I hope you answer my question as what you done all the time. Thanks

• Jason Brownlee July 21, 2018 at 6:32 am #

I’m eager to help and answer specific questions, but I don’t have the capacity to write code for you sorry.

61. Alalehrz July 28, 2018 at 5:13 am #

Hi Jason,

I have two questions. First, for new instances if the length is greater than this model input shall we truncate the sentence?
Also, since the embedding input is just for the seen training set words, what happens to the predictions-to-word process? I assume it just returns some words similar to the training words not considering all the dictionary words. Of course I am talking about a language model with Glove.

• Jason Brownlee July 28, 2018 at 6:40 am #

Yes, truncate.

Unseen words are mapped to nothing or 0.

The training dataset MUST be representative of the problem you are solving.

• az July 31, 2018 at 3:30 am #

From what i understood from this comment,it is about prediction on test data. Lets assume that there are 50 words in vocabulary which means sentences will have unique integers uptil 50. Now since test data must be tokenized with same instance of tokenizer and if it has some new words, it would have integers with 51 ,52 oand so on..In this case,would the model automatically use 0 for word embeddings or can it raise out of bound type exception?Thanks

• Jason Brownlee July 31, 2018 at 6:11 am #

You would encode unseen words as 0.

• Yuedong Wu August 30, 2020 at 4:25 pm #

All 50 vocabulary words should start from index 1 to 50, while leave 0 for unseen word in vocabulary. am I right?

• Jason Brownlee August 31, 2020 at 6:07 am #

Correct.

62. Jaskaran July 29, 2018 at 11:21 pm #

can this be described as transfer learning?

• Jason Brownlee July 30, 2018 at 5:50 am #

Perhaps.

• Hayden Stephan December 3, 2021 at 10:44 am #

We need to know for our homework pls help !!

• Adrian Tam December 8, 2021 at 6:49 am #
63. eden July 31, 2018 at 1:07 am #

Hi,
i have trained and tested my own network. During my work,when i integerized the sentences and created a corresponding word embedding matrix ,it included embeddings for train,validation and test data as well.
Now if i want to reload my model to test for some other similar data, i am confused that how the words from this new data would relate to embedding matrix?
You should have embeddings for test data as well right? or when you create embedding matrix you exclude test data?Thanks

• Jason Brownlee July 31, 2018 at 6:09 am #

The embedding is created from the training dataset.

It should be sufficiently rich/representative enough to cover all data you expect to in the future.

New data must have the same integer encoding as the training data prior to being mapped onto the embedding when making a prediction.

Does that help?

• eden July 31, 2018 at 8:04 am #

yes,i understand that i should be using the same tokenizer object for encoding both train and test data, but i am not sure how the embedding layer would behave for the word or index which isnt part of embedding matrix. Obviously test data would have similar words but there must be some words that are bound to be new. Would you say this approach is right to include test data too while creating embedding matrix for model? If you want to predict using some pre trained model,how can i deal with this issue? A small example can be really helpful. Thanks alot for all the help and time!

• Jason Brownlee July 31, 2018 at 2:55 pm #

It won’t. The encoding will set unknown words to 0.

It really depends on the goal of what you want to evaluate.

64. Jaskaran August 1, 2018 at 5:04 pm #

i want to train my model to predict the target word given to a 5 word sequence . how can i represent my target word ?

• Jason Brownlee August 2, 2018 at 5:56 am #

Probably using a one hot encoding.

65. Bharath August 3, 2018 at 12:48 pm #

Hello Jason,

This is regarding the output shape of the first embedding layer : (None,4,8).
Am I correct in understanding that the 4 represents the input size which is 4 words and the 8 is the number of features it has generated using those words?

66. Sreenivasa August 10, 2018 at 7:33 pm #

Hi Jason,
My task is to classify set of documents into different categories.( I have a training set of 100 documents and say 10 categories).
The idea is to extract top M words ( say 20) from the first few lines of each doc, convert words to word embeddings and use it as feature vectors for the neural network.

Question : Since i take top M words from the document, it may not be in the “right” order each time, meaning the there can be different words at a given position in the input layer ( unlike bag of words model). Wont this approach impact the Neural network from converging?

Regards,
Srini

• Jason Brownlee August 11, 2018 at 6:08 am #

The key is to assign the same integer value to each word prior to feeding the data into the embedding.

You must use the same text tokenizer for all documents.

67. Fatemeh August 17, 2018 at 7:15 am #

Hi Jason,
Thank you for your great explanation. I have used the pre-trained google embedding matrix in my seqtoseq project by using encoder-decoder. but in my test, I have a problem. I don’t know how to make a reverse for my embedding matrix. Do you have a sample project? My solution is: when my decoder predicts a vector, I should search for that in my pre-trained embedding matrix, and then find its index and then understand its related word. Am I right?

• Jason Brownlee August 17, 2018 at 7:40 am #

Why would you need to reverse it?

• Vivien September 29, 2018 at 1:45 pm #

Hi Jason

Thanks for an excellent tutorial. Using your methods, I’ve converted text into word index and applied word embeddings.

Like Fatemeh, I’m wondering if it’s possible to reverse the process, and convert embedding vectors back into text? This could be useful for applications such as text summarising.

Thank you.

• Jason Brownlee September 30, 2018 at 6:00 am #

Yes, each vector has an int value known by the embedding and each int has a mapping to a word via the tokenizer.

Random vectors in the space do not, you will have to find the closest vector in the embedding using euclidean distance.

68. fatma August 18, 2018 at 10:09 am #

Dear Dr. Jason,

Accuracy: 89.999998 on my Laptop, result different from computer to other?

69. Ivan September 1, 2018 at 2:00 pm #

Hi,

So many thanks for this tutorial!

I’ve been trying to train a network that consists of an Embedding layer, an LSTM and a softmax as the output layer. However, it seems that the loss and accuracy get stuck at some point.

70. Aramis September 28, 2018 at 10:28 am #

Thank you so much,
It helped me alot in learning how to use pre trained embbeding in neural nets

71. Rafael sa October 24, 2018 at 5:28 am #

Hi Jason, thank uou for the great materiAl.

I have one doubt, want to make the embedding of a list of 1200 documents to use it as input to a classification model to predict moviebox office based on the moviescript text…
My question is… if i want to train the embedding with the vocabulary of the real dataset, how can i after classify the rest of the dataset that was not trained ? Can a use the embeddings learned on the training as input to the classification model ?

• Jason Brownlee October 24, 2018 at 6:33 am #

Good question.

You must ensure that the training dataset is representative of the broader problem. Any words unseen during training will be marked zero (unknown) by the tokenizer unless you update your model.

• Rafael SA November 1, 2018 at 9:06 pm #

Thank You Jason. As soon as I get the results I’ll try to share it here.
I’d like to thank you too about your great platform, it is being very helpful to me.

72. Dimitris October 26, 2018 at 2:06 am #

Nice post once again! It seems that in each batch all embedding are updated which I think it should not happen. You got any idea how to update only the one that are passed each time? That is for computational reasons or others problem definitions related reasons.

• Jason Brownlee October 26, 2018 at 5:38 am #

I’m not sure what you mean exactly, can you elaborate?

73. Mahdi November 18, 2018 at 1:57 am #

Hello Jason, i would like to think you for this post, it’s really interresting and understandable.

I’ve reused the script but instead of using “docs” and “labels” lists, i used the IMDB movie reviews dataset. The problem is that i can’t reach more than 50% accuracy and the loss is stable in all epochs to value 0.6932.

What do you think about that ?

74. Mehran November 18, 2018 at 12:55 pm #

Thanks for the article. Could you also provide an example of how to train a model with only one Embedding layer? I’m trying to do the same with Keras but the problem is that the fit method asks for labels which I don’t have. I mean I only have a bunch of text files that I’m trying to come up with the mapping for.

• Jason Brownlee November 19, 2018 at 6:42 am #

Models typically only have one embedding layer. What do you mean exactly?

75. Vishal November 22, 2018 at 4:07 pm #

Hello,

Thank you for the excellent explanation!

I have a few questions related to unknown words.

Some pretrained word embeddings like the GoogleNews embeddings have an embedding vector for a token called ‘UNKNOWN’ as well.

1. Can I use this vector to represent words that are not present in the training set instead of the vector of all zeros? If so, how should I go about loading this vector into the Keras Embedding layer? Should it be loaded at the 0th index in the embedding matrix?

2. Also, can I use the Tokenizer API to help me convert all unknown words (words not in the training set) to ‘UNKNOWN’?

Thank you.

• Jason Brownlee November 23, 2018 at 7:43 am #

Yes, find the integer value for the unknown word and use that to assign to all words not in the vocab.

76. Saurabh November 27, 2018 at 4:46 pm #

Hi,
If word embedding doesn’t contain a word we input to a model , How to address this issue?
1) Is it possible to load additional words (besides those in our vocabulary) in embedding matrix.
Or may be any other elegant way you would like to suggest?

• Jason Brownlee November 28, 2018 at 7:38 am #

It is marked as “unknown”.

77. mohammad November 28, 2018 at 11:15 pm #

Hi .thanks a lot for your post . i’m new in python and deep learning !

i have 240,000 tweet train set “50 % male and 50% female” class . and 120,000 tweet test set ” 50 % male and 50% female”. i want use lstm in python bud i have following error at ” fit ” method :

ValueError: Error when checking input: expected lstm_16_input to have 3 dimensions, but got array with shape (120000, 400)

can you help me?

• Jason Brownlee November 29, 2018 at 7:40 am #

It looks like a mismatch between your data and the model, you can change the data or change the model.

78. Mohamed December 5, 2018 at 3:46 am #

I am getting this error

TypeError: ‘OneHotEncoder’ object is not callable

How oto overcome?

Thanks

79. Rushi December 6, 2018 at 5:40 pm #

Hi , i have 2 models with this embedding layers , how do i merge those model ?

Thanks

• Jason Brownlee December 7, 2018 at 5:17 am #

What do you mean exactly? An ensemble model?

80. Utkarsh Rai December 7, 2018 at 12:29 am #

hi Jason, greate tutorial, i am very new to all this. I have a query, u r using glove for the embedding layer but during fitting u are directly using padded_docs. The vectors in padded_docs have no co-relation to glove. I am sure that i am missing something plz enlighten.

• Jason Brownlee December 7, 2018 at 5:23 am #

The padding just adds ‘0’ to ensure the sequences are the same length. It does not effect the encoding.

81. Eduardo Andrade December 7, 2018 at 3:48 pm #

Hi, Jason. Considering the “3. Example of Learning an Embedding”, I’m adding “model.add(LSTM(32, return_sequences=True))” after the embedding layer and I would like to understand what happens. The number of parameters returned for this LSTM layer is “5248” and I don’t know how to calculate it. Thank you.

• Jason Brownlee December 8, 2018 at 6:58 am #

Each unit in the LSTM will take the entire embedding as input, therefore must have one weight for each dimension in the embedding.

82. Vic December 18, 2018 at 10:12 am #

Hi Jason,

Do you have any example showing how we can use a bi-directional LSTM on text (i.e., using word embeddings)?

83. Matt December 24, 2018 at 9:57 pm #

I am interested in using the predict function to predict new docs. For instance, ‘Struck out!’ My understanding is that if one or more words in the doc you want to predict weren’t involved in training, then the model can’t predict it. Is the solution to simply train on enough docs to make sure the vocabulary is extensive enough to make new predictions in this way?

• Jason Brownlee December 25, 2018 at 7:21 am #

Yes, or mark new words as “unknown” a predict time.

84. Ajay December 26, 2018 at 11:53 pm #

Hello Jason, Is there any reason that the output of the Embedding layer is a 4×8 matrix?

• Jason Brownlee December 27, 2018 at 5:43 am #

No, it is just an example to demonstrate the concept.

85. Zeyu December 30, 2018 at 7:09 am #

Hi, Jason. Thanks a lot for this excellent tutorial. I have a quick question about the Keras Embedding layer.

vocab_size = len(t.word_index) + 1

t.word_index starts from 1 and ends with 15. Therefore, there are totally 15 words in the vocabulary. Then why do we need to add 1 here please?

Thanks a lot for the help!

• Jason Brownlee December 31, 2018 at 6:02 am #

The words are 1-offset and index 0 is left for “unknown” words.

• Niccola March 3, 2021 at 7:49 am #

Would you mind specifying this answer? This is still not quite clear to me

• Jason Brownlee March 3, 2021 at 8:09 am #

Sure, which part is not clear?

• Niccola March 3, 2021 at 1:49 pm #

What exactly does 1-offset mean in this context?

And if “index 0 is left for unknown words”, wouldn’t that imply that you could ignore them?

And if I use this instead:

vocab_size = len(t.word_counts.keys()) + 1

Do I also have to add the 1?

• Jason Brownlee March 3, 2021 at 1:59 pm #

Words not in the vocab are assigned the value of 0 which maps to the index of the first vector and represents unknown words.

The first word in the vocab is mapped to vector at index 1. The second words maps to the vector at index 2, etc until the total number of vectors is handled.

Does that help?

• Niccola March 3, 2021 at 2:09 pm #

Thank you for your response, but I am not quite getting this. My understanding is, when I work with t.word_counts instead of t.word_index (not sure if it makes a difference) that I get something like this:

OrderedDict([(‘word1’, 35), (‘word2’), 232))

Then, if I use len(t.word_counts) it gives me 2 in this case. Why am I then adding a 1.

That is, if I use t.word_counts I don’t see any unknown word when I print it out.

• Jason Brownlee March 4, 2021 at 5:44 am #

We are not adding 1 to the counts. We are not touching the ordered dict of words.

We are just adding one more word to our vocabulary called “unknown” that maps to the first vector in the embedding.

• Niccola March 3, 2021 at 5:29 pm #

The tensorflow documentation also says to add 1, but I am really not sure why:

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

• Jason Brownlee March 4, 2021 at 5:46 am #

To make room for “unknown” – words not in the vocab.

Adding 1 means adding one word to the vocab, the first word mapped to index 0 – word number zero.

• Niccola March 4, 2021 at 8:05 am #

Are you aware of any resource that might explain why we have to add 1? I understand it is because of unknown words, so we increase the given vocab size by 1. But I don’t understand why that is.

• Jason Brownlee March 4, 2021 at 8:24 am #

We don’t have to do it, we could simply ignore words that are not in the vocab.

86. Anna January 3, 2019 at 4:13 pm #

Hi Jason,

If I have three columns of (string type) multivariate data, one column is categorical, the other two columns are not. Is it ok if I integer encode them using LabelEncoding(), and then scale the encoded data using feature scaling method like MinMax, StandardScaler etc. before feed into anomaly detection model? Even though the ROC shows an impressive result. But is it valid to pre-process text data like that?

Thank you.

• Jason Brownlee January 4, 2019 at 6:26 am #

Perhaps try it and compare performance.

• Anna January 4, 2019 at 2:49 pm #

I have tried it and it shows nearly 100% ROC. What I mean is that, is it accurate to pre-process the text data like that? Because when I checked your post regarding pre-processing text data, there is no feature scaling (MinMax, StandardScaler etc) on text data after encode them to integer. I’m afraid if the way I pre-process data is not accurate.

• Jason Brownlee January 5, 2019 at 6:47 am #

Generally text is encoded as integers then either one hot encoded or mapped to a word embedding. Other data preparation (e.g. scaling) is not required.

87. Kafeel Basha January 6, 2019 at 7:46 pm #

Hello

I was trying to do multi class classification of text data using Keras in R and Python.

In Python I was able to get predicted labels using inverse_transform() method from encoded class values. But when I try to do the same in R using CatEncoders library, getting some of the labels as NAs. Any reason for that.

• Jason Brownlee January 7, 2019 at 6:28 am #

No need to transform the prediction, the model can make a class or probability prediction directly.

88. Li Xiaohong January 7, 2019 at 12:49 am #

Hi Jason,

Thanks for your sharing! I have a question on word embedding. Correct me if I am wrong: noticed the word embedding created here only contains words in the training/test set. I would think a word embedding including all vocab in GloVE file will be better? For example, if in production, we encounter a new word than in training/test set, but it is part of the GloVE vocab, in this case, we can capture the meaning of the production words although we don’t see it in training/test set. I think this will benefit sentiment classification problems with smaller training set?

Thanks!
Regards
Xiaohong

• Jason Brownlee January 7, 2019 at 6:36 am #

Generally, you carefully choose your vocab. If you want to maintain a larger vocab than is required of the project “just in case”, go for it.

89. Rahul Sangole January 17, 2019 at 5:12 am #

Jason,

Are there non-text applications of embeddings? For example – I have large sets of categorical variables, each with very large number of levels, which go into a classification model. Could I use embeddings in such a case?

Rahul

• Jason Brownlee January 17, 2019 at 5:30 am #

Yes, embeddings are fantastic for categorical data!

90. Dipawesh Pawar January 18, 2019 at 2:11 am #

Hi Jason…

Thanks a lot for such a nice post. It enriched my knowledge a lot.

I have one doubt on text_to_sequence and one_hot methods provided by keras. Both of them are giving same encoded docs with your example. If they gives same output then when we should use text_to_sequence and when we should go for one_hot?

Again thanx a lot for such a nice post.

• Jason Brownlee January 18, 2019 at 5:46 am #

Use the approach that you prefer.

91. Ritesh January 21, 2019 at 6:18 pm #

Hi Jason,

Your post are really superb. Thanks for writing such great post .

I have one query , why people use Embedding Layer when we have already got the vector representation of a word from word2vec or glove . Using these two pre trained model we have already got the same size vector representation of each word and if word in not found we can assign the random value of same size. After getting the vector representation why we are passing to the Embedding layer?

Thanks

• Jason Brownlee January 22, 2019 at 6:21 am #

Often the learned embedding in the neural net performs better because it is specific to the model and prediction task.

92. Ritesh January 22, 2019 at 4:54 pm #

Hi Jason,

What if I set trainable = False . Then also it is needed to use Embedding layer when I have already vector representation of each word in sequence using word2vec or glove.

Thanks

• Jason Brownlee January 23, 2019 at 8:45 am #

Yes, you can keep the embedding fixed during training if you like. Results are often not as good in my experience.

93. hayj February 1, 2019 at 10:08 pm #

Hello Jason, thank you for this very good tutorial. I have a question : I trained your model on the imbd sentiment analysis dataset. The model has 80% of accuracy. But I have very bad embeddings.

First a short detail, I used one_hot which use a hashing trick, but I used the md5 because the default hash function is not consistent across runs, this is mentionned in the doc of Keras (so to save the model and predict new documents it is not good, am I right ?).

But the important thing is that I have very bad embeddings, I created a dict which map lower words and embedding (of size 8). Following this : https://stackoverflow.com/questions/51235118/how-to-get-word-vectors-from-keras-embedding-layer I didn’t use Glove vector for now.

I tested to search most similar words and I got random words for “good” (holt, christmas, stodgy, artistic, szabo, mandatory…). I set the voc size to 100000. Of course due to the hashing trick, 2 words can have the same index so I don’t take into account similarities of 1.0. I think bad vectors embeddings are due to the fact we train embeddings on entire document and not context like word2vec, what do you think ?

• Jason Brownlee February 2, 2019 at 6:17 am #

Generally, the embeddings learned by a model work better than pre-fit embeddings, at least in my experience.

94. Despoina February 7, 2019 at 6:47 pm #

Hello, such a great tutorial. All your tutorials are very helpful! Thank you.

I want to find embedings of three different Greek sentences (for classification). Then I want to merge per paired and to fit my model.
I have read your tutorial ‘How to Use the Keras Functional API for Deep Learning’ which is very helpful for the merge.
My question is: Is there any way to calculated before to use as an input to my model? I must have three different models to calculate the embedings?

• Jason Brownlee February 8, 2019 at 7:46 am #

Yes, you could prepare the input manually if you wish: each word is mapped to an integer, and the integer will be an index in the embedding. From that you can retrieve the vector for each word and use that as input to a model.

• Despoina February 8, 2019 at 8:56 pm #

Thank you!!!

95. Mohit February 10, 2019 at 12:19 am #

Hi Jason,

Thank you so much for this wonderful explanation. After reading many other resources I understand the embedding layers only after reading this. I have few questions and I’d really appreciate if you could take out the time and answer them.

1) In your code, you used a Flatten layer after the Embedding layer and before the Dense layer. In few other places I noticed that a GlobalAveragePooling1D() layer is used in place of the Flattern. Can you explaining what Global Average Pooling does and why it’s used for Text Classification?

2)You explained in of the comments that each word will have only one vector representation before and after the training. So just to confirm, when a word x is inputted to the embedding layer, the output always updates the same vector that represents x? For example, for vocab size 500 and embedding dimension 10, ([500, 10] output shape)if word x is the first vector([0,10]) in the output, every time the word x is inputted the first vector([0, 10]) will be updated and not if the word is not present?

3) What’s the intuition behind choosing the size of the embedding dimension?

Thank you again Jason. Will be waiting for your response.
Mohit

• Jason Brownlee February 10, 2019 at 9:43 am #

Either approach can be used, both do the same thing. Perhaps try both and see what works best for your specific dataset.

The vectors for a given word are learned. After the model is finished training, the vectors are fixed for each word. If a word is not present during training, the model will not be able to support it.

Embedding must be large enough to capture the relationship between words. Often 100 to 300 is more than enough for a large vocab.

96. vamsi February 15, 2019 at 4:27 pm #

very userful post for beginners. But i have a doubt regarding size of embedding vector. As you mentioned in the post–
“The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.
The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. ”

I did not understand why the output from the embedding layer should be 4 vectors. If it is related to input length, please explain me how ?. also the phrase “one for each word” , i did not understand this as well

• Jason Brownlee February 16, 2019 at 6:15 am #

A good rule of thumb is to start with a length of 50, and try increasing to 100 or 200 and see if it impacts model performance.

97. Ratha February 21, 2019 at 9:14 pm #

Hi Jason, I’m facing touble while using bag of words as features in CNN. Do you have any idea to implement BoW based CNN?

• Jason Brownlee February 22, 2019 at 6:18 am #

It is not clear to me why/how you would pass a BoW encoded doc to a CNN – as all spatial information would be lost. Sounds like a bad idea.

98. Alexa February 21, 2019 at 10:17 pm #

Hi! Thanks for a good tutorial. I have a question regarding embeddings. How is the performance when using Embedding Layer for training embeddings VS using pre trained embedding? Is it faster and does the model require less time when training if using a pre trained embedding?

• Jason Brownlee February 22, 2019 at 6:19 am #

Embeddings trained as part of the model seem to perform better in my experience.

99. Despoina February 24, 2019 at 8:16 pm #

Hello, great post!

I want to know more about vocabulary size in Keras Embedding Layer. I am working with Greek and Italian languages. Do you have any scientific paper to suggest?

Thank you very much!

• Jason Brownlee February 25, 2019 at 6:40 am #

Perhaps test a few different vocab sizes for your dataset and see how it impacts model performance?

• Despoina February 25, 2019 at 10:45 pm #

Ok, thank you very much!

100. Yoni Pick February 25, 2019 at 6:48 am #

Hi Jason

What needs to change when using the unsupervised model on pre-train embedding matrix?

also if you want to use week supervised?

101. ark March 5, 2019 at 9:58 pm #

Good

102. Hadij March 10, 2019 at 7:47 am #

Hi Jason,
For one-hot encoder, if we are given the new test set, how can we use one_hot function to have the same matrix space as we had for training set? Since we can not have a separate one_hot encoder for the test set.
Thank you very much.

• Jason Brownlee March 10, 2019 at 8:19 am #

You can keep the encoder objects used to prepare the data and use them on new data.

103. Miguel Won March 31, 2019 at 1:01 am #

Hi,
I ofter see implementations like yours, where the embedding layer is built from a word_index that contains all words build, e.g., with Keras preprocessing Tokenizer API. But if I fit my corpus with the Tokenizer and with vocabulary size limit (with num_words), why should I need an embedding layer of the size of the total number of unique words? Wouldn’t it be a waste of space being used? Is there any issue to build an embedding layer with a size suitable to my vocabulary size limit I need?

• Jason Brownlee March 31, 2019 at 9:31 am #

Not really, the space is small.

It can be good motivation to ensure your vocab only contains words required to make accurate predictions.

104. Leena Agrawal April 15, 2019 at 9:47 pm #

Hi Mr Jason,

Excellent tutorial indeed!

e = Embedding(200, 32, input_length=50)

How do we decide to select size of out_dim which is 32 here? is there any specefic reason for this value?

Leena

• Jason Brownlee April 16, 2019 at 6:48 am #

You can use trial and error and evaluate how different sized output dimensions impact model performance on your specific dataset.

50 or 100 is a good starting point if you are unsure.

105. Mohammad April 16, 2019 at 3:28 pm #

Hi,
i trained my model using word embedding with glove, but kindly let me know how to prepare the test data for predicting results with trained weight. still i did not find any post which follow whole process. specially word embedding with glove

• Jason Brownlee April 17, 2019 at 6:51 am #

The word embedding cannot predict anything. I don’t understand, perhaps you can elaborate?

106. Alex April 19, 2019 at 12:20 pm #

Hello, Jason,

thanks for the post. I have a question about the embedding data you actually fit. I print the padded_docs after the model compile. It seems to me that the printed matrix is not an embedding matrix. It’s a integer matrix. So I think what you fit in the CNN is not embedding but the integer matrix you define. Could you please help me explain it? Thanks a lot.

• Jason Brownlee April 19, 2019 at 3:05 pm #

Yes, the padded_docs is integers that are fed to the embedding layer that maps each integer to an 8-element vector.

The values of these vectors are then defined by training the network.

107. Oscar April 21, 2019 at 4:25 am #

Hi Jason,

I am working on character embedding. My dataset consists of raw HTTP traffic both normal and malicious. I have used the Tokenizer API to integer encode my data with each character having an index assigned to it.

Please let me know if I understood this correctly:

My data is integer encoded to values between 1-55, therefore my input_dim is 55.

I will start with output-dim of 32 and modify this value a needed.

Now for the input_length, I am a bit confused how to set this value.
I have different lengths for the numerical strings in my dataset the longest is 666. Do I set input-length to 666? And if I do this what will happen to the sequences with shorter length?

• Oscar April 21, 2019 at 6:39 am #

Also, should I set the input dim to a value higher than 55?

• Jason Brownlee April 21, 2019 at 8:27 am #

Do you mean word embedding instead of char embedding?

I don’t have any examples of embedding char’s – I’m not sure it would be effective.

• Oscar April 21, 2019 at 5:34 pm #

In meant character embedding. I used tokenizer and set the character level to True.
I am not sure how to use word embedding for query strings of http traffic when they are not made of real words and just strings of characters.
I am designing a character level neural network for detecting parameter injection in http requests.
The result would be in a binary format 0 if request is normal and 1 if it’s malicious.
So you don’t think character embedding is helpful here?

• Jason Brownlee April 22, 2019 at 6:21 am #

Sorry, I don’t have an example of a character embedding.

Nevertheless, you should be able to provided strings of integer encoded chars to the embedding in your model. It will look much like an embedding for words, just lower carnality (<100 chars perhaps). Also I don't expect good results.

What problem are you having exactly?

108. Mario May 2, 2019 at 6:09 pm #

I have found vocab_size = len(t.word_index) + 1 to be wrong. This index not only ignores the Tokenizer(num_words=X parameter, but also stores more words than are actually ever going to be encoded.

I fit my text without word limit, and then encode the same text using the tokenizer, and the length of the word_index is larger than the max(max(encoded_texts))`.

• Jason Brownlee May 3, 2019 at 6:16 am #

That is very surprising!

Are you sure there is no bug in your code?

• Marshal Ma May 15, 2019 at 9:25 am #

Hi Mario,

Yea. I just noticed the num_words issue with keras Tokenizer today. I think there are already multiple issues regarding it logged on GitHub:

https://github.com/keras-team/keras/issues/8092

109. lamesa May 17, 2019 at 6:46 pm #

hello jason, how are you?
I am doing my masters thesis on text summarization using word embeddings, and now i am in the middle of many questions, how could I use these features and which neural network alg is best. please could you give some guide….

110. lamesa May 20, 2019 at 9:36 pm #

111. Siddhartha June 9, 2019 at 6:06 pm #

Hi Jason,

This entire article is very useful. It helped me in writing my initial implementation.
I have one question related to Input() and Embedding() in Keras.
If I already have a pretrained word embedding. In that case, should I use Embedding or Input ?

• Jason Brownlee June 10, 2019 at 7:36 am #

Yes, the embedding vectors are loaded into the Embedding layer and the layer may be marked as not trainable.

112. alphabeta June 23, 2019 at 7:44 pm #

what if in test we have a new words which were not there in the training text ?
It will not proceed from the embedding layer correct?

• Jason Brownlee June 24, 2019 at 6:23 am #

They will me mapped to 0 or “unknown word”.

113. Zeinab June 29, 2019 at 10:53 pm #

Hi, Jason
I want to ask you about how can i save my learned word embedding?

114. Zeinab July 2, 2019 at 4:03 am #

Can I construct a network for learning the embedding matrix only?

115. zeinab July 10, 2019 at 6:28 pm #

Hi,
I have a text similarity application where I measure Pearson correlation coefficient as keras metrics.
In many epochs, I noticed that the correlation value is nan.
Is this is normal or there is a problem in the model?

• Jason Brownlee July 11, 2019 at 9:46 am #

You may have to debug the model to see what the cause of the NAN is.

• Zeinab July 12, 2019 at 2:55 pm #

Do you mean that I have to adjust the activation function?

I use elu activation function and Adam optimization function,

Do you mean that I have to change any of them and see the results?

• Jason Brownlee July 13, 2019 at 6:52 am #

Perhaps.

Try relu.
Try batch norm.
Try smaller learning rate.

• Zeinab July 13, 2019 at 10:19 pm #

Can I know what do you mean by debuging the model?

• Jason Brownlee July 14, 2019 at 8:10 am #

Yes, here are some ideas:

– Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
– Consider cutting the problem back to just one or a few simple examples.
– Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
– Consider posting your question and code to StackOverflow.

116. Leon July 20, 2019 at 12:19 pm #

vocab_size = len(t.word_index) + 1

why do we need to increase the vocabulary size by 1???

• Jason Brownlee July 21, 2019 at 6:23 am #

This is so that we can reserve 0 for unknown words and start known words at 1.

117. Zineb_Morocco July 26, 2019 at 5:48 am #

Thanks a lot for this wonderful work .. we’re in July 2019 and still taking advange of it.

• Jason Brownlee July 26, 2019 at 8:33 am #

Thanks. I try to write evergreen tutorials that remain useful.

118. Christiane July 29, 2019 at 5:24 am #

Dear Jason,

thanks a lot for your detailed explanations and the code examples.

What I am still wondering about, however, is how I would combine embeddings with other variables, i.e. having a few numeric or categorical variables and then one or two text variables. As I see you input the padded docs (only text) in model.fit. But how would I add the other variables? It doesn’t seem realistic to always only have one text variable as input.

• Jason Brownlee July 29, 2019 at 6:22 am #

Good question. You can have a separate input to the model for each embedding, and one for other numeric static vars.

This is called a multi-input model.

119. Zineb_Morocco July 30, 2019 at 11:43 pm #

Thank you Jason for this wonderful work and examples. That really help.

120. Youssef MELLAH August 4, 2019 at 6:19 am #

Thank u Jason Brownlee, that’s very interesting and clear.

And if i have two inputs?

for example i am working on Text-to-SQL task and that necessit 2 inputs : user question and table schema (columns names).

how can i process? how can i do embeddig? with 2 embeddings layers?

Thank u for help.

121. Youssef MELLAH August 4, 2019 at 7:30 am #

Ah okay, that’s interesting too, thanks!!

Can you please confirm me the architecture above to encode both user Questions and table Schema in the same model?

(1) userQuestion ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM
(2) tableShema ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM

(1) concatenate (merge) (2) ==> rest of model layers….

thakns Jason.

• Jason Brownlee August 5, 2019 at 6:42 am #

No need for the one hot encoding, the text can be encoded as integers, and the integers mapped to vectors via the embedding.

• Youssef Mellah August 5, 2019 at 7:58 am #

ok that’s clear, thanks.

The attention mechanism can be done only with the merge?

• Jason Brownlee August 5, 2019 at 1:59 pm #

No, you can use attention on each input if you wish.

• Youssef MELLAH August 5, 2019 at 6:42 pm #

Should i do attention on both inputs (user Question & table Schema) separately or can i do it after merging the 2 inputs?

• Jason Brownlee August 6, 2019 at 6:31 am #

Test many different model types and see what works well for your specific dataset.

• Youssef MELLAH August 15, 2019 at 12:41 am #

okay thank you Jason!!

• Jason Brownlee August 15, 2019 at 8:12 am #

No problem.

122. Elizabeth August 5, 2019 at 12:06 am #

I wonder about pre-trained word2vec, there is no such good tutorial for that. I am looking to implement pre-trained word2vec but I do not know should I follow the same steps of Glove or look for another source for that?
thanks MR.Jason I am very inspired by you in machine learning

123. Dean August 7, 2019 at 1:06 am #

Why dont you use mask_zero=True ion your embedding layers? It seems necessary since you are padding sentences with 0’s.

• Jason Brownlee August 7, 2019 at 8:01 am #

Great suggestion, not sure if that argument existed back when I wrote these tutorials. I was using masking input layers instead.

124. Nathalia August 9, 2019 at 8:04 am #

hi, can you help me with a question?

Im working with a dataset that has a city column as a feature and thats has a lot of different cities. So, I create a embeddinglayer for this feature. First, I used this command :
data[‘city’]= data[‘city’].astype(‘category’)
data[‘city’]= data[‘city’].cat.codes
After that, for each different city a value was assigned starting at 0

So, Im confused about how this embedded layer works when the test data has a input that was not training. I saw that you said that when this occurs, we had to put 0 as input, but 0
it’s related with some city. Should i start assigning this values ​​to the city from 1?

• Jason Brownlee August 9, 2019 at 8:20 am #

Excellent question.

Typically, unseen categories are assigned 0 for “unknown”.

0 should be reserved and real numbering should start at 1.

• Nathalia August 9, 2019 at 8:25 am #

thank you, you always help me a lot!

• Jason Brownlee August 9, 2019 at 2:17 pm #

You’re very welcome Nathalia!

125. ravi August 20, 2019 at 1:38 pm #

Hi Jason..

Thanks for such a great tutorial. I am confused on when we talk about learned word embeddings , do we consider weighs of the embedding layer or output of embedding layer.

let me ask in other way as well, when we utilize pretrained embedding let us say “glove.6B.50d.txt” . those word embeddings are weights or the output of the layer?

• Jason Brownlee August 20, 2019 at 2:14 pm #

They are the same thing. Integers are mapped to vectors, those vectors are the output of the embedding.

126. Ralph August 23, 2019 at 7:39 pm #

Hi Jason,

I am new to ML, trying out different things, and your posts are the most helpful I encountered, it helps me a lot to understand, thank you!

Here I think I understood the procedure, but I still have a deeper question, on the point of embeddings. If I understand correctly, this embedding kind of maps a set of words as points onto another dimensionnal space. The surprising fact in your example is that we pass from a space of dimension 4 to a space of dimension 8, so it might not be seen as an improvement at first.

Still I imagine that the embedding makes it so that points in the new space are more equally placed, am I right? Then I don’t understand several things:
-How does the context where one words appear come into play? Other words which are often close by will also be represented by closer points in the new space?
-Why does it have to be integers? And why is it more applied to word encodings? I mean we could imagine the same process could be helpful for images as well. Or is it just a dimension reduction technique tailored for words documents?

Thank you for your insights anyway

• Jason Brownlee August 24, 2019 at 7:50 am #

Not equally spaced, but spaced in away that preserves or best captures their relationships.

Context defines the relationships captured in the embedding, e.g. what works appear with what other words. Their closeness.

Each word gets one vector. The simplest way is to map words to integers and integers to the index of vectors in a matrix. No other reason.

Great questions!

127. Anand August 24, 2019 at 4:43 pm #

Jason,Thank you so much for your time and effort!

My question is related to the line-
“e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)”

Here you are using-weights=[embedding_matrix], but there is no relation telling for which word which vector. Then, it produces for each document one 4*100 matrix(example for [ 6 2 0 0]).How it will extract the vectors related to 6,2,0,0 accurately?

• Jason Brownlee August 25, 2019 at 6:34 am #

The vectors are ordered, with an array index you can retrieve the vector for each word, e.g. word 0, word 1, word 2, work 3, etc.

128. Youssef MELLAH September 2, 2019 at 12:36 am #

Hello Jason,

I m searching for interesting formation on Python (numpy pandas … and tools for ML DL and NLP) and formation on Keras !!

129. Emre Calisir September 3, 2019 at 6:32 pm #

Thanks for article, I will run it for the Italian language documents. Is there any GoogleNews pretrained word2vec covering Italian vocabulary?

• Jason Brownlee September 4, 2019 at 5:56 am #

Good question, I’m not sure off the cuff sorry.

130. Muhammad Usgan September 7, 2019 at 5:59 pm #

Hallo jason, Can I put Dropout into this model ?

131. ASHWARYA ASHIYA September 19, 2019 at 6:49 pm #

e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)

What does the parameter – weights=[embedding_matrix] – stand for ? weights or inputs for the Embedding Layer ?

• Jason Brownlee September 20, 2019 at 5:37 am #

The weights are the vectors, e.g. one vector for each word in your vocab.

132. Munisha October 3, 2019 at 7:30 pm #

_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 4, 8) 400
_________________________________________________________________
flatten_1 (Flatten) (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_

Could you please clarify no. of weights learnt for embedding layer. Now we have 10 documents and embedding layer from 4 to 8. How many weights parameters will actually be learnt here. My understanding was that there should only be (4+1)*8 = 40 weights to be learnt, including the bias term. Why is it learning weights for all the documents separately (10*(4+1)*8 = 400) ?

• Jason Brownlee October 4, 2019 at 5:40 am #

The number of weights in an embedding is vector length times number of vectors (words).

133. martin October 12, 2019 at 4:29 pm #

Hi, Jason:

In this example, what type of neural architecture it is? It is not LSTM, not CNN. Is it a Multi-Layer Perceptron model?

• Jason Brownlee October 13, 2019 at 8:26 am #

We are just working with embeddings. I guess with an output layer, you can call it an MLP.

134. Zafari October 13, 2019 at 5:52 am #

Hi, Thanks for this excellent article. I tried to use a pre-trained word embedding instead of random number in a Keras-based classifier. However, after constructing the embedding matrix and adding it to the embedding layer as follow, during training epochs all of the accuracy values are the same and no learning happens. However, after removing the “weights=[embedding_matrix]” it works well and reached the accuracy of 90%.

layers.Embedding(input_dim=vocab_size, weights=[embedding_matrix],
output_dim=embedding_dim,
input_length=max_len,trainable=True)

What can be the reason of this strange behavior?
thanks

• Jason Brownlee October 13, 2019 at 8:35 am #

An embedding specalized on your own data is often better than something generic.

• Hussain July 28, 2020 at 8:46 am #

Quick question. While using pre-trained embeddings (MUSE) in an Embeddings layer, is it okay to set trainable=True?

Note: The model doesn’t overfit when i set trainable=True. The model doesn’t predict well if i set trainable=False.

• Jason Brownlee July 28, 2020 at 8:52 am #

Yes, although you might want to use a small learning rate to ensure you don’t wash away the weights.

• Hussain July 28, 2020 at 9:24 am #

Thank you very much for your reply. Currently I am using a learning rate of 0.0001.

• Jason Brownlee July 28, 2020 at 10:54 am #

Perhaps use SGD instead of Adam as Adam will change the learning rate for each model parameter and could get quite large.

Or at least compare results with adam vs sgd.

135. Xuesong Wang October 13, 2019 at 8:32 pm #

Hi Jason,
Thank you for your post. I have an issue. The data I used include both categorical and numerical features. Say, Some features are cost and time while others are post code. How should I write the code? Build separate models and concatenate them together? Thank you

• Jason Brownlee October 14, 2019 at 8:07 am #

Great question!

The features must be prepared separately then aggregated.

The two ways I like to do this is:

1. Manually. Prepare each feature type separately then concate into input vectors.
2. Multi-Input Model. Prepare features separately and feed different features into different inputs of a model, and let the model concate the features.

Does that help?

• martin October 15, 2019 at 4:39 pm #

Should the categorical be converted into numerical using one-hot encoding?

• Jason Brownlee October 16, 2019 at 7:57 am #

It can be, or an embedding can be used.

• Franz Götz-Hahn October 28, 2019 at 9:04 pm #

If you have multivariate time series of which you know they are meaningfully connected (say trajectories in x and y), does it make sense to put a Conv layer before feeding them into the embedding?

Could you explain what you mean by “preparing” the features?

• Jason Brownlee October 29, 2019 at 5:22 am #

No embedding is typically the first step, e.g. an interpretation of input.

Prepared means transformed in whatever way you want, such as encoding or scaling.

• Franz Götz-Hahn October 29, 2019 at 5:03 pm #

Thanks for the answer! If I may ask a follow up: is embedding of multivariate numerical data uncommon? I have seen fairly little work that uses it.

• Jason Brownlee October 30, 2019 at 5:57 am #

Embedding is used for categorical data, not numerical data.

136. Zineb_Morocco October 16, 2019 at 1:28 am #

hi Jason,

I use an example where a put the vocabulary size = 200 and the training sample contain about 20 different words.
When I check the embeddings ( the vectors) using ** layers[0].get_weights()[0]** I obtain an array with 200 rows.

1/ how can I know the vector corresponding to each word (from the 20 words I ‘ve got)?
2/ where the 180 (200 – 20) vectors come from since I use only 20 words?

• Jason Brownlee October 16, 2019 at 8:08 am #

The vocab size and number of words are the same thing.

I think you might be confusing the size of the embedding and the vocab?

Each word is assigned a number (0, 1, 2, etc), the index of vectors maps to the words, vector 0 is word 0, etc.

137. Zineb_Morocco October 16, 2019 at 11:38 pm #

I ‘ll clarify my question:
the vocab size is 200 that means that the number of words is 200.
But effectively i’m working with 20 words only ( the words of my training sample) : let say word[0] to word[19].
So, after the embedding, the vector[0] corresponds to word[0] and so on. but vector[20].. vector [30] … what do they match ?
I have no word[20] or word[30] .

• Jason Brownlee October 17, 2019 at 6:37 am #

If you define the vocab with 200 words but only have 20 in the training set, the the words not in the training set will have random vectors.

• Zineb_Morocco October 18, 2019 at 12:22 am #

Ok. Thank you.

138. Elizabeth October 27, 2019 at 7:28 am #

I want to save my own pretrained model in the same way Golve saved their model as txt file and the word followed by its vector? How I would do that?
thank you

• Jason Brownlee October 28, 2019 at 6:00 am #

You could extract the weights from the embedding layer via layer.get_weights() then enumerate the vectors and save to a file int he format you prefer.

139. Elizabeth October 28, 2019 at 7:47 am #

beginner in python I did not understand what you mean by enumerating..and which layer should I get weight from?…

• Jason Brownlee October 28, 2019 at 1:16 pm #

You can get the vectors from the embedding layer.

You can either hold a reference to the embedding layer from when you constructed the model, or retrieve the layer by index (e.g. model.get_layers()[0]) or by name, if you name it.

Enumerating means looping.

140. Michael November 19, 2019 at 4:47 am #

Hello, Jason!

Thanks for the article!
I have been wondering about the input_dim of the learnable embedding layer.
You set it to vocab_size, that in your case is 50 (the hashing trick upper limit), which is much larger than the actual vocabulary size of 15.

The documentation of Embedding in keras says:
“Size of the vocabulary, i.e. maximum integer index + 1.”
Which is ambiguous.

I have experimented with some numbers for vocab_size, and cannot see any systematic difference.

Would it actually matter for more realistically sized examples?

Could you say a couple of words about it?
Thanks again

• Jason Brownlee November 19, 2019 at 7:50 am #

Smaller vocabs means you will have fewer words/word vectors and in turn a simpler model which is faster/easier to learn. The cost is it might perform worse.

This is the trade-off large/slow but good, small/fast but less good.

• Michael November 19, 2019 at 7:48 pm #

Thanks, Jason!

I may have not explained myself properly:
The *actual* number of words in the vocabulary is the same (14).
The difference is the value of input_dim to Embedding().

In the example, you chose 50 as high enough to prevent collisions in encoding, but also
used it as an input_dim in one of the cases.

Michael

• Jason Brownlee November 20, 2019 at 6:11 am #

I see.

• martin November 22, 2019 at 6:28 pm #

I thought the question is “Size of the vocabulary, i.e. maximum integer index + 1.”. Since there are 14 words in this example, why vocab size isn’t 15, instead of 50?

• Jason Brownlee November 23, 2019 at 6:49 am #

There is the size of the vocab, there is also the size of the embedding space. They are different – in case that us causing confusion.

We must have size(vocab) + 1 vectors in the embedding, to have space for “unkown”, e.g. vector at index 0.

141. martin November 22, 2019 at 5:52 pm #

Jason: In this example, ‘one_hot’ function instead of ‘to_categorical’ function is used. The 2nd is the real one-hot representation, and the 1st is simply creating an integer for each word. Why isn’t to_categorical used here? They are different, right?

142. moSaber November 24, 2019 at 11:11 am #

Thanks a lot Jason! in 3. Example of Learning an Embedding section, could you please elaborate what is 400 params that are being trained in the embedding layer? Thnx

• Jason Brownlee November 25, 2019 at 6:19 am #

Yes, each vector is mapped to an 8 element vector, and the vocab is 50 words. Therefore 50*8 = 400.

• Mohit September 26, 2020 at 6:54 pm #

Jason, why the output shape of embedding layer is: (4,8)?
It should be (50,8) as the vocab size is 50 and we are creating the embeddings of all words in our vocabulary.

• Jason Brownlee September 27, 2020 at 6:51 am #

vocab size is the total vectors in the layer – the number of words supported, not the output.

The output is the number of input words (8) where each word has the same vector length (4).

143. criz December 30, 2019 at 3:13 am #

Hi i need some help when running the file.

(wordembedding) C:\Users\Criz Lee\Desktop\Python Projects\wordembedding>wordembedding.py
Traceback (most recent call last):
File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\wordembedding.py”, line 1, in
from numpy import array
File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\__init__.py”, line 140, in
from . import _distributor_init
File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\_distributor_init.py”, line 34, in
from . import _mklinit
ImportError: DLL load failed: The specified module could not be found.

• Jason Brownlee December 30, 2019 at 6:02 am #

Looks like there is a problem with your development environment.

• criz December 31, 2019 at 3:20 am #

Hi jason, i’ve tried the url u provided but still didnt manage to solve it.

basically i typed
1. conda create -n wordembedding
2. activate wordembedding
3. pip install numpy (installed ver 1.16)
4. ran wordembedding.py

error shows
File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\wordembedding.py”, line 2, in
from numpy import array
ModuleNotFoundError: No module named ‘numpy’

• Jason Brownlee December 31, 2019 at 7:35 am #

Sorry to hear that. I am not an expert in debugging workstations, perhaps try posting to stackoverflow.

144. martin December 30, 2019 at 8:23 am #

Hi, Jason: How to encode a new document using the Tokenizer object fit on training data? It seems there is no function to return an encoder from the tokenizer object.

• Jason Brownlee December 31, 2019 at 7:24 am #

You can save the tokenizer and use it to prepare new data in an identical manner as you did the training data after the tokenizer was fit.

145. congmin min December 31, 2019 at 11:59 am #

What do you mean by ‘save the tokenizer’? Tokenizer is an object, not a model.

• Jason Brownlee January 1, 2020 at 6:30 am #

It is as important as the model, in that sense it is part of the model.

You can save Python objects to file using pickle.

146. Sintayehu January 3, 2020 at 5:08 pm #

• Jason Brownlee January 4, 2020 at 8:27 am #

Great, does the above tutorial help?

147. Rishang January 25, 2020 at 4:56 pm #

Hello Sir,

I am not able to understand the significance of vector space?
You have given 8 for the first problem, glove vectors has 100 dimension for each word.
What is the idea behind these vector spaces and what does each value of the dimension tells us?

Thankyou 🙂

• Jason Brownlee January 26, 2020 at 5:15 am #

The size of the vector space does not matter too much.

More importantly, the model learns a representation where similar works will have a similar representatioN (coordinate) in the vector space. We don’t have to specify these relationships, they are learned automatically.

• Rishang February 3, 2020 at 1:07 am #

Thankyou Sir for your answer. I clearly understood what is vector space.
I have one more question- If I declare the vocabulary size as 50 and if there are more than 50 words in my training data, what happens to those extra words?

For the same reason I could not understand this line of glove vectors-
“The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words.”
What about the 600 thousand words ?

• Jason Brownlee February 3, 2020 at 5:46 am #

Words not in the vocab are marked as 0 or unknown.

148. asma January 31, 2020 at 4:22 pm #

Hi,
I have a list of words as my dataset/training data. So i run your code for glove as follows: