Last Updated on February 2, 2021
Word embeddings provide a dense representation of words and their relative meanings.
They are an improvement over sparse representations used in simpler bag of word model representations.
Word embeddings can be learned from text data and reused among projects. They can also be learned as part of fitting a neural network on text data.
In this tutorial, you will discover how to use word embeddings for deep learning in Python with Keras.
After completing this tutorial, you will know:
- About word embeddings and that Keras supports word embeddings via the Embedding layer.
- How to learn a word embedding while fitting a neural network.
- How to use a pre-trained word embedding in a neural network.
Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Feb/2018: Fixed a bug due to a change in the underlying APIs.
- Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.

How to Use Word Embedding Layers for Deep Learning with Keras
Photo by thisguy, some rights reserved.
Tutorial Overview
This tutorial is divided into 3 parts; they are:
- Word Embedding
- Keras Embedding Layer
- Example of Learning an Embedding
- Example of Using Pre-Trained GloVe Embedding
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
1. Word Embedding
A word embedding is a class of approaches for representing words and documents using a dense vector representation.
It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.
Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.
The position of a word in the learned vector space is referred to as its embedding.
Two popular examples of methods of learning word embeddings from text include:
- Word2Vec.
- GloVe.
In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.
2. Keras Embedding Layer
Keras offers an Embedding layer that can be used for neural networks on text data.
It requires that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras.
The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.
It is a flexible layer that can be used in a variety of ways, such as:
- It can be used alone to learn a word embedding that can be saved and used in another model later.
- It can be used as part of a deep learning model where the embedding is learned along with the model itself.
- It can be used to load a pre-trained word embedding model, a type of transfer learning.
The Embedding layer is defined as the first hidden layer of a network. It must specify 3 arguments:
It must specify 3 arguments:
- input_dim: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.
- output_dim: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.
- input_length: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.
For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.
1 |
e = Embedding(200, 32, input_length=50) |
The Embedding layer has weights that are learned. If you save your model to file, this will include weights for the Embedding layer.
The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).
If you wish to connect a Dense layer directly to an Embedding layer, you must first flatten the 2D output matrix to a 1D vector using the Flatten layer.
Now, let’s see how we can use an Embedding layer in practice.
3. Example of Learning an Embedding
In this section, we will look at how we can learn a word embedding while fitting a neural network on a text classification problem.
We will define a small problem where we have 10 text documents, each with a comment about a piece of work a student submitted. Each text document is classified as positive “1” or negative “0”. This is a simple sentiment analysis problem.
First, we will define the documents and their class labels.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.'] # define class labels labels = array([1,1,1,1,1,0,0,0,0,0]) |
Next, we can integer encode each document. This means that as input the Embedding layer will have sequences of integers. We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF.
Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.
1 2 3 4 |
# integer encode the documents vocab_size = 50 encoded_docs = [one_hot(d, vocab_size) for d in docs] print(encoded_docs) |
The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. We will pad all input sequences to have the length of 4. Again, we can do this with a built in Keras function, in this case the pad_sequences() function.
1 2 3 4 |
# pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs) |
We are now ready to define our Embedding layer as part of our neural network model.
The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.
The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. We flatten this to a one 32-element vector to pass on to the Dense output layer.
1 2 3 4 5 6 7 8 9 |
# define the model model = Sequential() model.add(Embedding(vocab_size, 8, input_length=max_length)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # summarize the model print(model.summary()) |
Finally, we can fit and evaluate the classification model.
1 2 3 4 5 |
# fit the model model.fit(padded_docs, labels, epochs=50, verbose=0) # evaluate the model loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy: %f' % (accuracy*100)) |
The complete code listing is provided below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
from numpy import array from keras.preprocessing.text import one_hot from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.embeddings import Embedding # define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.'] # define class labels labels = array([1,1,1,1,1,0,0,0,0,0]) # integer encode the documents vocab_size = 50 encoded_docs = [one_hot(d, vocab_size) for d in docs] print(encoded_docs) # pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs) # define the model model = Sequential() model.add(Embedding(vocab_size, 8, input_length=max_length)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # summarize the model print(model.summary()) # fit the model model.fit(padded_docs, labels, epochs=50, verbose=0) # evaluate the model loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy: %f' % (accuracy*100)) |
Running the example first prints the integer encoded documents.
1 |
[[6, 16], [42, 24], [2, 17], [42, 24], [18], [17], [22, 17], [27, 42], [22, 24], [49, 46, 16, 34]] |
Then the padded versions of each document are printed, making them all uniform length.
1 2 3 4 5 6 7 8 9 10 |
[[ 6 16 0 0] [42 24 0 0] [ 2 17 0 0] [42 24 0 0] [18 0 0 0] [17 0 0 0] [22 17 0 0] [27 42 0 0] [22 24 0 0] [49 46 16 34]] |
After the network is defined, a summary of the layers is printed. We can see that as expected, the output of the Embedding layer is a 4×8 matrix and this is squashed to a 32-element vector by the Flatten layer.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 4, 8) 400 _________________________________________________________________ flatten_1 (Flatten) (None, 32) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 33 ================================================================= Total params: 433 Trainable params: 433 Non-trainable params: 0 _________________________________________________________________ |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Finally, the accuracy of the trained model is printed, showing that it learned the training dataset perfectly (which is not surprising).
1 |
Accuracy: 100.000000 |
You could save the learned weights from the Embedding layer to file for later use in other models.
You could also use this model generally to classify other documents that have the same kind vocabulary seen in the test dataset.
Next, let’s look at loading a pre-trained word embedding in Keras.
4. Example of Using Pre-Trained GloVe Embedding
The Keras Embedding layer can also use a word embedding learned elsewhere.
It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.
For example, the researchers behind GloVe method provide a suite of pre-trained word embeddings on their website released under a public domain license. See:
The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. There are a few different embedding vector sizes, including 50, 100, 200 and 300 dimensions.
You can download this collection of embeddings and we can seed the Keras Embedding layer with weights from the pre-trained embedding for the words in your training dataset.
This example is inspired by an example in the Keras project: pretrained_word_embeddings.py.
After downloading and unzipping, you will see a few files, one of which is “glove.6B.100d.txt“, which contains a 100-dimensional version of the embedding.
If you peek inside the file, you will see a token (word) followed by the weights (100 numbers) on each line. For example, below are the first line of the embedding ASCII text file showing the embedding for “the“.
1 |
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062 |
As in the previous section, the first step is to define the examples, encode them as integers, then pad the sequences to be the same length.
In this case, we need to be able to map words to integers as well as integers to words.
Keras provides a Tokenizer class that can be fit on the training data, can convert text to sequences consistently by calling the texts_to_sequences() method on the Tokenizer class, and provides access to the dictionary mapping of words to integers in a word_index attribute.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.'] # define class labels labels = array([1,1,1,1,1,0,0,0,0,0]) # prepare tokenizer t = Tokenizer() t.fit_on_texts(docs) vocab_size = len(t.word_index) + 1 # integer encode the documents encoded_docs = t.texts_to_sequences(docs) print(encoded_docs) # pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs) |
Next, we need to load the entire GloVe word embedding file into memory as a dictionary of word to embedding array.
1 2 3 4 5 6 7 8 9 10 |
# load the whole embedding into memory embeddings_index = dict() f = open('glove.6B.100d.txt') for line in f: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Loaded %s word vectors.' % len(embeddings_index)) |
This is pretty slow. It might be better to filter the embedding for the unique words in your training data.
Next, we need to create a matrix of one embedding for each word in the training dataset. We can do that by enumerating all unique words in the Tokenizer.word_index and locating the embedding weight vector from the loaded GloVe embedding.
The result is a matrix of weights only for words we will see during training.
1 2 3 4 5 6 |
# create a weight matrix for words in training docs embedding_matrix = zeros((vocab_size, 100)) for word, i in t.word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector |
Now we can define our model, fit, and evaluate it as before.
The key difference is that the embedding layer can be seeded with the GloVe word embedding weights. We chose the 100-dimensional version, therefore the Embedding layer must be defined with output_dim set to 100. Finally, we do not want to update the learned word weights in this model, therefore we will set the trainable attribute for the model to be False.
1 |
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False) |
The complete worked example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
from numpy import array from numpy import asarray from numpy import zeros from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers import Embedding # define documents docs = ['Well done!', 'Good work', 'Great effort', 'nice work', 'Excellent!', 'Weak', 'Poor effort!', 'not good', 'poor work', 'Could have done better.'] # define class labels labels = array([1,1,1,1,1,0,0,0,0,0]) # prepare tokenizer t = Tokenizer() t.fit_on_texts(docs) vocab_size = len(t.word_index) + 1 # integer encode the documents encoded_docs = t.texts_to_sequences(docs) print(encoded_docs) # pad documents to a max length of 4 words max_length = 4 padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post') print(padded_docs) # load the whole embedding into memory embeddings_index = dict() f = open('../glove_data/glove.6B/glove.6B.100d.txt') for line in f: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Loaded %s word vectors.' % len(embeddings_index)) # create a weight matrix for words in training docs embedding_matrix = zeros((vocab_size, 100)) for word, i in t.word_index.items(): embedding_vector = embeddings_index.get(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector # define model model = Sequential() e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False) model.add(e) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) # compile the model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # summarize the model print(model.summary()) # fit the model model.fit(padded_docs, labels, epochs=50, verbose=0) # evaluate the model loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy: %f' % (accuracy*100)) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
Running the example may take a bit longer, but then demonstrates that it is just as capable of fitting this simple problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
[[6, 2], [3, 1], [7, 4], [8, 1], [9], [10], [5, 4], [11, 3], [5, 1], [12, 13, 2, 14]] [[ 6 2 0 0] [ 3 1 0 0] [ 7 4 0 0] [ 8 1 0 0] [ 9 0 0 0] [10 0 0 0] [ 5 4 0 0] [11 3 0 0] [ 5 1 0 0] [12 13 2 14]] Loaded 400000 word vectors. _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 4, 100) 1500 _________________________________________________________________ flatten_1 (Flatten) (None, 400) 0 _________________________________________________________________ dense_1 (Dense) (None, 1) 401 ================================================================= Total params: 1,901 Trainable params: 401 Non-trainable params: 1,500 _________________________________________________________________ Accuracy: 100.000000 |
In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.
See what works best for your specific problem.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
- Word Embedding on Wikipedia
- Keras Embedding Layer API
- Using pre-trained word embeddings in a Keras model, 2016
- Example of using a pre-trained GloVe Embedding in Keras
- GloVe Embedding
- An overview of word embeddings and their connection to distributional semantic models, 2016
- Deep Learning, NLP, and Representations, 2014
Summary
In this tutorial, you discovered how to use word embeddings for deep learning in Python with Keras.
Specifically, you learned:
- About word embeddings and that Keras supports word embeddings via the Embedding layer.
- How to learn a word embedding while fitting a neural network.
- How to use a pre-trained word embedding in a neural network.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Thank you Jason,
I am excited to read more NLP posts.
Thanks.
Thanks man, It was really helpful.
You’re welcome.
after embedding,have to have a “Flatten()” layer? In my project, I used a dense layer directly after embedding. is it ok?
Try it and see.
I appreciate how well updated you keep these tutorials. the first thing I always look at, when I start reading is the update date. thank you very much.
You’re welcome.
I require all of the code to work and keep working!
Hi, Jason:
when one_hot encoding is used, why is padding necessary? Doesn’t one_hot encoding already create an input of equal length?
The one hot encoding is for one variable at one time step, e.g. features.
Padding is needed to make all sequences have the same number of time steps.
See this:
https://machinelearningmastery.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input
I split my data into 80-20 test-train and I’m still getting 100% accuracy. Any idea why? It is ~99% on epoch 1 and the rest its 100%.
Consider using the procedure in this post to evaluate your model:
https://machinelearningmastery.com/evaluate-skill-deep-learning-models/
Use drop-out 20%, your model is overfit!!
Thank you Jason. I always find things easier when reading your post.
I have a question about the vector of each word after training. For example, the word “done” in sentence “Well done!” will be represented in different vector from that word in sentence “Could have done better!”. Is that right? I mean the presentation of each word will depend on the context of each sentence?
No, each word in the dictionary is represented differently, but the same word in different contexts will have the same representation.
It is the word in its different contexts that is used to define the representation of the word.
Does that help?
Yes, thank you. But I still have a question. We will train each context separately, then after training the first context, in this case is “Well done!”, we will have a vector representation of the word “done”. After training the second context, “Could have done better”, we have another vector representation of the word “done”. So, which vector will we choose to be the representation of the word “done”?
I might misunderstand the procedure of training. Thank you for clarifying it for me.
No. All examples where a word is used are used as part of the training of the representation of the word. There is only one representation for each word during and after training.
I got it. Thank you, Jason.
Hi Jason,
any ideas on how to “filter the embedding for the unique words in your training data” as mentioned in the tutorial?
The mapping of word to vector dictionary is built into Gensim, you can access it directly to retrieve the representations for the words you want: model.wv.vocab
HI Jason,
I am really appreciated the time U spend to write this tutorial and also replying.
My question is about “model.wv.vocab” you wrote. is it an address site?
It does not work actually.
No, it is an attribute on the model.
Hi, Jason
Good day.
I just need your suggestion and example. I have two different dataset, where one is structured and the other is unstructured. The goal is to use the structured to construct a representation for the unstructured, so apply use word embedding on the two input data but how can I find the average of the two embedding and flatten it to one before feeding the layer into CNN and LSTM.
Looking forward to your response.
Regards
Abbey
Sorry, what was your question?
If your question was if this is a good approach, my advice is to try it and see.
Hi, Jason
How can I find the average of the word embedding from the two input?
Regards
Abbey
Perhaps you could retrieve the vectors for each word and take their average?
Perhaps you can use the Gensim API to achieve this result?
Hi Jason,
I have a set of documents(1200 text of movie Scripts) and i want to use pretrained embeddings. But i want to update the vocabulary and train again adding the words of my corpus. Is that possible ?
Sure.
Load the pre-trained vectors. Add new random vectors for the new words in the vocab and train the whole lot together.
Hi Jason…Could you also help us with R codes for using Pre-Trained GloVe Embedding
Sorry, I don’t have R code for word embeddings.
Hi Jason, really appreciate that you answered all the replies! I am planning to try both CNN and RNN (maybe LSTM & GRU) on text classification. Most of my documents are less than 100 words long, but about 5 % are longer than 500 words. How do you suggest to set the max length when using RNN?If I set it to be 1000, will it degrade the learning result? Should I just use 100? Will it be different in the case of CNN?
Thank you!
I would recommend experimenting with different configurations and see how the impact model skill.
Dear Hao,
Did you try RNN(LSTM or GRU) on text classification?If yes then can you plz provide me the code??
Here is an example:
https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
I’d like to thank you for this post. I’ve been struggling to understand this precise way of using keras for a week now and this is the only post I’ve found that actually explains what each step in the process is doing – and provides code that self-documents what the data looks like as the model is constructed and trained. This makes it so much easier to adapt to my particular requirements.
Thanks, I’m glad it helped.
In the above Keras example, how can we predict a list of context words given a word? Lets say i have a word named ‘sudoku’ and want to predict the sourrounding words. how can we use word2vec from keras to do that?
It sounds like you are describing a language model. We can use LSTMs to learn these relationships.
Here is an example:
https://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/
No, what i meant was for word2vec skip-gram model predicts a context word given the center word. So if i train a word2vec skip-gram model, how can i predict the list of context words if my center word is ‘sudoku’?
Regards,
Azim
I don’t know Azim.
You can get the cosine distance between the words, and the one that is having the least distance would surround it.. here is the link:
https://github.com/Hvass-Labs/TensorFlow-Tutorials
Go to Natural Language Processing and you can find a cosine function there, use them to find yours..
Hi Jason,
Thanks for your useful blog I have learned a lots.
I am wondering if I already have pretrained word embedding, is that possible to set keras embedding layer trainable to be true? If it is workable, will I get a better result, when I only use small size of data to pretrain the word embedding model. Many thanks!
You can. It is hard to know whether it will give better results. Try it.
Hey Jason,
Is it possible to perform probability calculations on the label? I am looking at a case where it is not simply +/- but that a given data entry could be both but more likely one and not the other.
Yes, a neural network with a sigmoid or softmax output can predict a probability-like score for each class.
I’m doing something like this except with my own feature vectors — but to the point of the labels — I do ternary classification using categorical_crossentropy and a softmax output. I get back an answer of the probability of each label.
Nice!
Hey Jason!
Thanks for a wonderful and detailed explanation of the post. It helped me a lot.
However, I’m struggling to understand how the model predicts a sentence as positive or negative.
i understand that each word in the document is converted into a word embedding, so how does our model evaluate the entire sentence as positive or negative? Does it take the sum of all the word vectors? Perhaps average of them? I’ve not been able to figure this part out.
Great question!
The model interprets all of the words in the sequence and learns to associate specific patterns (of encoded words) with positive or negative sentiment
Hi Jason,
Thanks a lot for your amazing posts. I have the same question as Ravil. Can you elaborate a bit more on “learns to associate specific patterns?”
Good question Ken, perhaps this post will make it clearer how ml algorithms work (a functional mapping):
https://machinelearningmastery.com/how-machine-learning-algorithms-work/
Does that help?
Thanks for your reply. But I was trying to ask is that how does keras manage to produce a document level representation by having the vectors of each word? I don’t seem to find how was this being done in the code.
Cheers.
The model such as the LSTM or CNN will put this together.
In the case of LSTMs, you can learn more here:
https://machinelearningmastery.com/start-here/#lstm
Does that help?
Hi Jason,
First, thanks for all your really useful posts.
If I understand well your post and answers to Ken and Ravil, the neural network you build in fact reduces the sequence of embedding vectors corresponding to all the words of a document to a one-dimensional vector with the Flatten layer, and you just train this flattening, as well as the embedding, to get the best classification on your training set, isn’t it?
Thank you in advance for your answer.
Sort of.
words => integers => embedding
The embedding has a vector per word which the network will use as a representation for the word.
We have a sequence of vectors for the words, so we flatten this sequence to one long vector for the Dense model to read. Alternately, we could wrap the dense in a timedistributed layer.
Aaah! So nothing tricky is done when flattening, more or less just concatenating the fixed number of embedding vectors that is the output of the embedding layer, and this is why the number of words per document has to be fixed as a setting of this layer. If this is correct, I think I’m finally understanding how all this works.
I’m sorry to bother you more, but how does the network works if a document much shorter than the longest document (the number of its words being set as the number of words per document to the embedding layer) is given to the network as training or testing? It just fills the embedding vectors of this non-appearing words as 0? I’ve been looking for ways to convert all the word embeddings of a text to some sort of document embedding, and this just seems a solution too simple to work, or that may work but for short documents (as well as other options like averaging the word vectors or taking the element-wise maximum of minimum).
I’m trying to do sentiment analysis for spanish news, and I have news with like 1000 or more words, and wanted to use pre-trained word embeddings of 300 dimensions each. Wouldn’t it be a size way too huge per document for the network to train properly, or fast enough? I imagine you do not have a precise answer, but I’d like to know if you have tried the above method with long documents, or know that someone has.
Thank you again, I’m sorry for such a long question.
Yes.
We can use padding for small documents and a Masking input layer to ignore padded values. More here:
https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/
Try different sized embeddings and use results to guide the configuration.
Okay, thank you very much! I will give it a try.
the chinese word how to vector sequence
Sorry?
me bot trying interact comments born with lstm
Hi Jason,
I have successfully trained a model using the word embedding and Keras. The accuracy was at 100%.
I saved the trained model and the word tokens for predictions.
MODEL.save(‘model.h5’, True)
TOKENIZER = Tokenizer(num_words=MAX_NB_WORDS)
TOKENIZER.fit_on_texts(TEXT_SAMPLES)
pickle.dump(TOKENIZER, open(‘tokens’, ‘wb’))
When predicting:
– Load the saved model.
– Setup the tokenizer, by loading the saved word tokens.
– Then predict the category of the new data.
I am not sure the prediction logic is correct, since I am not seeing the expected category from the prediction.
The source code is in Github: https://github.com/hilmij/keras-test/blob/master/predict.py
Appreciate if you can have a look and let me know what I am missing.
Best regards,
Hilmi.
Your process sounds correct. I cannot review your code sorry.
What was the problem exactly?
Thank you, Jason! Your examples are very helpful. I hope to get your attention with my question. At training, you prepare the tokenizer by doing:
t = text.Tokenizer();
t.fit_on_texts(docs)
Which creates a dictionary of words:numbers. What do we do if we have a new doc with lost of new words at prediction time? Will all these words go the unknown token? If so, is there a solution for this, like can we fit the tokenizer on all the words in the English vocab?
You must know the words you want to support at training time. Even if you have to guess.
To support new words, you will need a new model.
Hello Jason!
In Example of Using Pre-Trained GloVe Embedding, do you use the word embedding vectors as weights of the embedding layer?
Yes.
Very nice set of Blogs of NLP and Keras – thanks for writing them.
As a quick note for others
When I tried to load the glove file with the line:
f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)
I got the error
UnicodeDecodeError: ‘charmap’ codec can’t decode byte 0x9d in position 2776: character maps to
To fix I added:
f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’,encoding=”utf8″)
This issue may have been caused by using Windows.
Thanks for the tip Alex.
Hi Jason,
Wonderful tutorials!
I have a question. Why do we have to one-hot vectorize the labels? Also, if I have a pad sequence of ex. [2,4,0] what the one hot will be? I’m trying to understand better one hot vectorzer.
I appreciate your response!
We don’t one hot encode the labels, we one hot encode the words.
Perhaps this post will help you better understand one hot encoding:
https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/
Hi Jason,
Thank you for your excellent tutorial. Do you know if there is a way to build a network for a classification using both text embedded data and categorical data ?
Thank you
Sure, you could have a network with two inputs:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Thank you Jason
How to do sentence classification using CNN in keras ? please help
See this tutorial:
https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/
Fantastic explanation, thanks so much. I’m just amazed at how much easier this has become since the last time I looked at it.
I’m glad the post helped Stuart!
Hi Jason …the 14 word vocab from your docs is “well done good work great effort nice excellent weak poor not could have better” For a vocab_size of 14, this one_shot encodes to [13 8 7 6 10 13 3 6 10 4 9 2 10 12]. Why does 10 appear 3 times, for “great”, “weak” and “have”?
Sorry, I don’t follow Stuart. Could you please restate the question?
Hi Jason, the encodings that I provided in the example above came from kerasR with a vocab_size of 14. So let me ask the same question about the uniqueness of encodings using your Part 3 one_hot example above with a vocab_size of 50.
Here different encodings are produced for different new kernels (using Spyder3/Python 3.4):
[[31, 33], [27, 33], [48, 41], [34, 33], [32], [5], [14, 41], [43, 27], [14, 33], [22, 26, 33, 26]]
[[6, 21], [48, 44], [7, 26], [46, 44], [45], [45], [10, 26], [45, 48], [10, 44], [47, 3, 21, 27]]
[[7, 8], [16, 42], [24, 13], [45, 42], [23], [17], [34, 13], [13, 16], [34, 42], [17, 31, 8, 19]]
Pleas note that in the first line, “33” encodes for the words “done”, “work”, “work”, “work” & “done”. In the second line “45” encodes for the words “excellent” & “weak” & “not”. In the third line, “13” encodes “effort”, “effort” & “not”.
So I’m wondering why the encodings are not unique? Secondly, if vocab_size must be much larger then the actual size of the vocabulary?
Thanks
The one_hot() function does not map words to unique integers, it uses a hash function that can have collisions, learn more here
https://keras.io/preprocessing/text/#one_hot
Thanks Jason, In your Part 4 example, the Tokenizer approach always gives the same encodings and these appear to be unique.
Yes, I recommend using Tokenizer, see this post for more:
https://machinelearningmastery.com/prepare-text-data-deep-learning-keras/
Great article Jason.
How do you convert back from an embedding to a one-hot? For example if you have a seq2seq model, and you feed the inputs as word embeddings, in your decoder you need to convert back from the embedding to a one-hot representing the dictionary. If you do it by using matrix multiplication that can be quite a large matrix (e.g embedding size 300, and vocab of 400k).
The output layer can predict integers directly that you can map to words in your vocabulary. There would be no embedding layer on the output.
Hi,
Very helpful article.
I have created word2vec matrix of a sentence using gensim and pre-trained Google News vector. Can I just flatten this matrix to a vector and use that as a input to my neural network.
For example:
each sentence is of length 140 and I am using a pre-trained model of 100 dimensions, therefore:- I have a 140*100 matrix representing the sentence, can i just flatten it to a 14000 length vector and feed it to my input layer?
It depends on what you’re trying to model.
Great article, could you shed some light on how do Param # of 400 and 1500 in two neural networks come from? Thanks
Oh! Is it just vocab_size * # of dimension of embedding space?
1. 50 * 8 = 400
2. 15* 100 = 1500
What do you mean exactly?
Great post! I’m working with my own corpus. How would I save the weight vector of the embedding layer in a text file like the glove data set?
My thinking is it would be easier for me to apply the vector representations to new data sets and/or machine learning platforms (mxnet etc) and make the output human readable (since the word is associated with the vector).
You could use get_weights() in the Keras API to retrieve the vectors and save directly as a CSV file.
get_weights() for what exactly? does it need a loop?
get_weights is a function that will return the weights for a model or layers, depending on what you call it on exactly:
https://keras.io/layers/about-keras-layers/
Clear Short Good reading, always thank you for your work!
Thanks.
Hello Jason,
I have a dataset with which I’ve attained 0.87 fscore by 5 fold cross validation using SVM.Maximum context window is 20 and one hot encoded.
Now, I’ve done whatever has been mentioned and getting an accuracy of 13-15 percent for RNN models where each one has one LSTM cell with 3,20,150,300 hidden units. Dimensions of my pre-trained embeddings is 300.
Loss is decreasing and even reaching negative values, but no change in accuracy.
I’ve tried the same with your CNN,basic ANN models you’ve mentioned for text classification .
Would you please suggest some solution. Thanks in advance.
I have some ideas here:
https://machinelearningmastery.com/improve-deep-learning-performance/
When I copy the code of the first box I get the error:
AttributeError: ‘int’ object has no attribute ‘ndim’
in the line :
model.fit(padded_docs, labels, epochs=50, verbose=0)
Where is the problem?
Copy the the code from the “complete example”.
Hi Jason,
I’ve got the same error, also will running the “compete example”.
What can be the cause?
Try casting the labels to numpy arrays.
i get the same!
I have fixed and updated the examples.
Carsten, you need labels to be numpy.array not just list.
Hi Jason,
If I have unkown words in training set, how can I assign the same random initialize vector to all of the unkown words when using pretrained vector model like glove or w2v. thanks!!!
Why would you want to do that?
If my data is in specific domain and I still want to leverage general word embedding model(e.g. glove.6b.100d trained from wiki), then it must has some OOV in domain data, so. no no mather in training time or inference time it propably may appear some unkown words.
It may.
You could ignore these words.
You could create a new embedding, set vectors from existing embedding and learn the new words.
Amazing Dr. Jason!
Thanks for a great walkthrough.
Kindly advice on the following.
On the step of encoding each word to integer you said: “We could experiment with other more sophisticated bag of word model encoding like counts or TF-IDF”. Could you kindly elaborate on how can it be implemented, as tfidf encodes tokens with floats. And how to tie it with Keras, passing it to an Embedding layer please? I’m keen to experiment with it, hope it could yield better results.
Another question is about input docs. Suppose I’ve preprocessed text by the means of nltk up to lemmatization, thus, each sample is a list of tokens. What is the best approach to pass it to Keras embedding layer in this case then?
I have most posts on BoW, learn more here:
https://machinelearningmastery.com/?s=bag+of+words&submit=Search
You can encode your tokens as integers manually or use the Keras Tokenizer.
Well, Keras Tokenizer can accept only texts or sequences. Seems the only way is to glue tokens together using ‘ ‘.join(token_list) and then pass onto the Tokenizer.
As for the BOW articles, I’ve walked through it theys are so very valuable. Thank you!
Using BOW differs so much from using Embeddings. As BOW would introduce huge sparse array of features for each sample, while Embeddings aim to represent those features (tokens) very densely up to a few hundreds items.
So, BOW in the other article gives incredibly good results with just very simple NN architecture (1 layer of 50 or 100 neurons). While I struggled to get good results using Embeddings along with convolutional layers…
From your experience, would you please advice on that please? Are Embeddings actually viable and it is just a matter of finding a correct architecture?
Nice! And well done for working through the tutorials. I love to see that and few people actually “do the work”.
Embeddings make more sense on larger/hard problems generally – e.g. big vocab, complex language model on the front end, etc.
I see, thank you.
Thanks jason for another great tutorial.
I have some questions :
Isn’t one hot definition is binary one, vector of 0’s and 1?
so [[1,2]] would be encoded to [[0,1,0],[0,0,1]]
how embedding algorithm is done on keras word2vec/globe or simply dense layer(or something else)
thanks
joseph
Sorry, I don’t follow your question. Perhaps you could rephrase it?
Amazing Dr. Jason!
Thanks for a great walkthrough.
The dimension for each word vector like above example e.g. 8, is set randomly?
Thank you
The dimensionality is fixed and specified.
In the first example it is 8, in the second it is 100.
Thank you Dr. Jason for your quick feedback!
Ok, I see that the pre-trained word embedding is set to 100 dimensionality because the original file “glove.6B.100d.txt” contained a fixed number of 100 weights for each line of ASCII.
However, the first example as you mentioned in here, “The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.”
You choose 8 dimensions for the first example. Does it means it can be set to any numbers other than 8? I’ve tried to change the dimension to 12. It doesn’t appeared any errors but the accuracy drops from 100% to 89%
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 4, 12) 600
_________________________________________________________________
flatten_1 (Flatten) (None, 48) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 49
=================================================================
Total params: 649
Trainable params: 649
Non-trainable params: 0
Accuracy: 89.999998
So, how dimensionality is set? Does the dimensions effects the accuracy performance?
Sorry I am trying to grasp the basic concept in understanding NLP stuff. Much appreciated for your help Dr. Jason.
Thank you
No problem, please ask more questions.
Yes, you can choose any dimensionality you like. Larger means more expressive, required for larger vocabs.
Does that help Anna?
Yes indeed Dr. Now I can see that the dimensionality is set depends on the number of vocabs.
Thank you again Dr Jason for enlighten me! 🙂
You asked good questions.
Hi Jason,
thanks for amazing tutorial.
I have a question. I am trying to do semantic role labeling with context window in Keras. How can I implement context window with embedding layer?
Thank you
I don’t know, sorry.
Hi, great website! I’ve been learning a lot from all the tutorials. Thank you for providing all these easy to understand information.
How would I go about using other data for the CNN model? At the moment, I am using just textual data for my model using the word embeddings. From what I understand, the first layer of the model has to be the Embeddings, so how would I use other input data such as integers along with the Embeddings?
Thank you!
Great question!
You can use a multiple-input model, see examples here:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Thank you for the fast reply!
Hi Jason, this tutorial is simple and easy to understand. Thanks.
However, I have a question. While using the pre-trained embedding weights such as Glove or word2vec, what if there exists few words in my dataset, which weren’t present in the dataset on which word2vec or Glove was trained. How does the model represent such words?
My understanding is that in your second section (Using Pre-Trained Glove Embeddin), you are mapping the words from the loaded weights to the words present in your dataset, hence the question above.
Correct me, if it’s not the way I think it is.
You can ignore them, or assign them to zero vector, or train a new model that includes them.
Hi Jason. Thanks for this and other VERY clear and informative articles.
Wanted to add one question to Aditya post:
“mapping the words from the loaded weights to the words present in your dataset”
how does it mapping? does it use matrix index == word number from (padded_docs or)?
I am asking because – what if I pass embedding_matrix with origin order, but will shuffle padded_docs before model.fit?
Words must be assigned unique integers that remain consistent across all data and embeddings.
Hi Jason,
I am trying to train a Keras LSTM model on some text sentiment data. I am also using GridSearchCV in sklearn to find the best parameters. I am not quite sure what went wrong but the classification report from sklearn says:
UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
Below is what the classification report looks like:
precision recall f1-score support
negative 0.00 0.00 0.00 98
positive 0.70 1.00 0.83 232
avg / total 0.49 0.70 0.58 330
Do you know what the problem is?
Perhaps you are trying to use keras metrics with sklearn? Watch the keywords you use when specifying the keras model vs the sklearn evaluation (CV).
Hi Jason,
your blog is really really interesting. I have a question: Which is the diffrence between using word2vec and texts_to_sequences from Tokenizer in keras? I mean in the way the texts are represented.
Is any of the two options better than the other?
Thanks a lot.
Kind regards.
word2vec encodes words (integers) to vectors. texts_to_seqences encodes words to integers. It is a step before word2vec or a step before bag of words.
Hi Jason,
I have a dataframe which contains texts and corresponding labels. I have used gensim module and used word2vec to make a model from the text. Now I want to use that model for input into Conv1D layers. Can you please tell me how to load the word2vec model in Keras Embedding layer? Do I need to pre-process the model in some way before loading? Thanks in advance.
Yes, load the weights into an Embedding layer, then define the rest of your network.
The tutorial above will help a lot.
This is really helpful. You make us awesome at what we do. Thanks!!
I’m glad to hear that.
Thank you for this extremely helpful blog post. I have a question regarding to interpreting the model. Is there a way to know / visualize the word importance after the model is trained? I am looking for a way to do so. For instance, is there a way to find like the top 10 words that would trigger the model to classify a text as negative and vice versa? Thanks a lot for your help in advance
There may be methods, but I am not across them. Generally, neural nets are opaque, and even weight activations in the first/last layers might be misleading if used as importance scores.
Maybe look at Lime
https://github.com/marcotcr/lime
Thanks.
Hi Jason ,
Can you please tell me the logic behind this:
vocab_size = len(t.word_index) + 1
Why we added 1 here??
So that the word indexes are 1-offset, and 0 is reserved for padding / no data.
Hi Jason,
If I want to use this model to predict next word, can I just change the output layer to Dense(100, activation = ‘linear’) and change the loss function to MSE?
Many thanks,
Ray
Perhaps look at some of the posts on training a language model for word generation:
https://machinelearningmastery.com/?s=language+model&submit=Search
Thanks for this tutoriel ! Really clear and usefull !
You’re welcome.
Hi Jason,
U R the best in keras tutorials and also replying the questions. I am really grateful.
Although I have understood the content of the context and codes U have written above, I am not able to understand what you mean about this sentence:[It might be better to filter the embedding for the unique words in your training data.].
what does “to filter the embedding” mean??
Thank you for replying.
It means, only have the words in the embedding that you know exist in your dataset.
Hi Jason,
Thank U 4 replying but as I am not a native English speaker, I am not sure whether I got it or not. You mean to remove all the words which exist in the glove but do not exist in my own dataset?? in order to raise the speed of implementation?
I am sorry to ask it again as I did not understand clearly.
Thank U in advance Jason
Exactly.
Hi,
This post is great. I am new to machine learning so i have a question which might be basic so i am not sure.As from what i understand, the model takes the embedding matrix and text along with labels at once.What i am trying to do is concatenate POS tag embedding with each pre-trained word embedding but POS tag can be different for the same word depending upon the context.It essentially means that i cant alter the embedding matrix at add to the network embedding layer.I want to take each sentence,find its embedding and concatenate with POS tag embedding and then feed into neural network.Is there a way to do the training sentence by sentence or something? Thanks
You might be able to use the embedding and pre-calculate the vector representations for each sentence.
Sorry but i didn’t quite understand.Can you please elaborate a little?
Sorry, I mean that you can prepare an word2vec model standalone. Then pass each sentence through it to build up a list of vectors. Concat the vectors together and you have a distributed sentence representation.
Thanks alot! One more thing, is it possible to pass other information to the embedding layer than just weights?For example i was thinking that what if i dont change the embedding matrix at all and create a separate matrix of POS tags for whole training data which is also passed to the embedding layer which concatenates these both sequentially?
You could develop a model that has multiple inputs, for example see this post for examples:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Thanks.I saw this post.Your model has separate inputs but they get merged after flattening.In my case i want to pass the embeddings to first convolutional layer,only after they are concatenated. Uptil now what i did was that i have created another integerized sequence of my data according to POS_tags(embedding_pos) to pass as another input and another embedding matrix that contains the embeddings of all the POS tags.
e=(Embedding(vocab_size, 50, input_length=23, weights=[embedding_matrix], trainable=False))
e1=(Embedding(38, 38, input_length=23, weights=[embedding_matrix_pos], trainable=False))
merged_input = concatenate([e,e1], axis=0)
model_embed = Sequential()
model_embed.add(merged_input)
model_embed.fit(data,embedding_pos, final_labels, epochs=50, verbose=0)
I know this is wrong but i am not sure how to concat those both sequences and if you can direct me in right direction,it would be great.The error is
‘Layer concatenate_6 was called with an input that isn’t a symbolic tensor. Received type: . Full input: [, ]. All inputs to the layer should be tensors.’
Perhaps you could experiment and compare the performance of models with different merge layers for combining the inputs.
Hi Jason, awesome post as usual!
Your last sentence is tricky though. You write:
“In practice, I would encourage you to experiment with learning a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding.”
Without the original corpus, I would argue, that’s impossible.
In Google’s case, the original corpus of around 100 billion words is not publicly available. Solution? I believe you’re suggesting “Transfer Learning for NLP.” In this case, the only solution I see is to add manually words.
E.g. you need ‘dolares’ which is not in Google’s Word2Vec. You want to have similar vectors as ‘money’. In this case, you add ‘dolares’ + 300 vectors from money. Very painful, I know. But it’s the only way I see to do “Transfer Learning with NLP”.
If you have a better solution, I’d love your input.
Cheerio, a big fan
Not impossible, you can use an embedding trained on a other corpus and ignore the difference or fine tune the embedding while fitting your model.
You can also add missing words to the embedding and learn those.
Remember, we want a vector for each word that best captures their usage, some inconsistencies does not result in a useless model, it is not binary useful/useless case.
Thank you very much for the detailed answer!
The link for the Tokenizer API is this same webpage. Can you update it please?
Fixed, thanks.
Hi Jason, great post!
I have successfully trained my model using the word embedding and Keras. I saved the trained model and the word tokens.Now in order to make some predictions, do i have to use the same tokenizer with one that i used in the training?
Correct.
Thank you very much!
Hi Jason, when i was looking for how to use pre-trained word embedding,
I found your article along with this one:
https://jovianlin.io/embeddings-in-keras/
They have many similarities.
Glad to hear it.
Hey jason,
I am trying to do this but sometime keras gives the same integer to different words. Would it be better to use scikit encoder that converts words to integers?
This might happen if you are using a hash encoder, as Keras does, but calls it a one hot encoder.
Perhaps try a different encoding scheme of words to integers
Hi Jason,
I have implemented the above tutorial and the code works fine with GloVe. I am so grateful abot the tutorial Jason.
but when I download GoogleNews-vectors-negative300.bin which is a pre-trained embedding word2vec it gave me this error:
File “/home/mary/anaconda3/envs/virenv/lib/python3.5/site-packages/gensim/models/keyedvectors.py”, line 171, in __getitem__
return vstack([self.get_vector(entity) for entity in entities])
TypeError: ‘int’ object is not iterable.
I wrote the code as the same as your code which you wrote for loading glove but with a little change.
‘model = gensim.models.KeyedVectors.load_word2vec_format(‘./GoogleNews-vectors-negative300.bin’, binary=True)
for line in model:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype=’float32′)
embeddings_index[word] = coefs
model.close()
print(‘Loaded %s word vectors.’ % len(embeddings_index))
embedding_matrix = zeros((vocab_dic_size, 300))
for word in vocab_dic.keys():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[vocab_dic[word]] = embedding_vector
I saw you wrote a tutorial about creating word2vec by yourself in this link “https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/”,
but I have not seen a tutorial about aplying pre-trained word2vec like GloVe.
please guide me to solve the error and how to apply the GoogleNews-vectors-negative300.bin pretrained wor2vec?
I am so sorry to write a lot as I wanted to explain in detail to be clear.
any guidance will be appreciated.
Best
Meysam
Perhaps try the text version as in the above tutorial?
Hi Jason
thank you very much for replying, but as I am weak at the English language, the meaning of this sentence is not clear. what do you mean by “try the text version”??
in fact, GloVe contains txt files and I implement it correctly but when I wanna run a program by GoogleNews-vectors-negative300.bin which is a pre-trained embedding word2vec it gave me the error and also this file is a binary one and there is no pre-trained embedding word2vec file by txt prefix.
can you help me though I know you are busy?
Best
Meysam
You are using the binary version of the glove file (.bin). Try downloading and using the text version instead.
You can get .txt versions here:
https://nlp.stanford.edu/projects/glove/
How can we use pre-trained word embedding on mobile?
I don’t see why not, other than disk/ram size issues.
Great post! What changes are necessary if the labels are more than binary such as with 4 classes:
labels = array([2,1,1,1,2,0,-1,0,-1,0])
?
E.g. instead of ‘binary_crossentropy’ perhaps ‘categorical_crossentropy’?
And how shold the Dense layer change?
If I use: model.add(Dense(4, activation=’sigmoid’)), I get an error:
ValueError: Error when checking target: expected dense_1 to have shape (None, 4) but got array with shape (10, 1)
thanks for your work!
I believe this will help:
https://machinelearningmastery.com/faq/single-faq/how-can-i-change-a-neural-network-from-regression-to-classification
thanks! also using keras’s to_categorical to discretize the labels was necessary.
one more question: is there a simply way to create the Tokenizer() instance, fit it, save it, and then extend it on new documents? Specifically, so that t.fit_on_texts( ) can be updated on new data.
I’m not so sure that you can.
It might be easier to manage the encoding/mapping yourself so that you can extend it at will.
Nice.
Hi Jason,
For starters, thanks for this post. Ideal to get things going quickly. I have a couple of questions if you don’t mind:
1) I don’t think that one-hot encoding the string vectors is ideal. Even with the recommended vocab size (50), I still got collisions which defeats the purpose even in a toy example such as this. Even the documentation states that uniqueness is not guaranteed. Keras’ Tokenizer(), which you used in the pre-trained example is a more reliable choice in that no two words will share the same integer mapping. How come you proposed one-hot encoding when Tokenizer() does the same job better?
2) Getting Tokenizer()’s word_index property, returns the full dictionary. I expected the vocab_size to be equal to len(t.word_index) but you increment that value by one. This is in fact necessary because otherwise fitting the model fails. But I cannot get the intuition of that. Why is the input dimension size equal to vocab_size + 1?
3) I created a model that expects a BoW vector representation of each “document”. Naturally, the vectors were larger and sparser [ (10,14) ] which means more parameters to learn, no. However, in your document you refer to this encoding or tf-idf as “more sophisticated”. Why do you believe so? With that encoding don’t you lose the word order which is important to learn word embeddings? For the record, this encoding worked well too but it’s probably due to the nature of this little experiment.
Thank you in advance.
The keras one hot encoding method really just takes a hash. It is better to use a true one hot encoding when needed.
I do prefer the Tokenizer class in practice.
The words are 1-offset, leaving room for 0 for “unknown” word.
tf-idf gives some idea of the frequency over the simple presence/absence of a word in BoW.
How that helps.
where can i find the file “../glove_data/glove.6B/glove.6B.100d.txt”??because i come up with the following error.
File “”, line 36
f = open(‘../glove_data/glove.6B/glove.6B.100d.txt’)
^
SyntaxError: invalid character in identifier
You must download it and place it in your current working directory.
Perhaps re-read section 4.
I have placed the code and dataset in same directory.what’s wrong with the code??
f = open(‘glove.6B/glove.6B.100d.txt’)
I am facing the following error.
File “”, line 36
f = open(‘glove.6B/glove.6B.100d.txt’)
^
SyntaxError: invalid character in identifier
Perhaps this will help you when copying code from the tutorial:
https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
Excellent work! This is quite helpful to novice.
And I wonder is this useful to other language apart from English? Since I am a Chinese, and I wonder whether I can apply this to Chinese language and vocabulary.
Thanks again for your time devoted!
I don’t see why not.
Hi,
can you explain how can the word embeddings be given as hidden state input to LSTM?
thanks in advance
Word embeddings don’t have hidden state. They don’t have any state.
for example, I have a word and its 50 (dimensional) embeddings. How can I give these embeddings as hidden state input to an LSTM layer?
Why would you want to give them as the hidden state instead of using them as input to the LSTM?
hi, what a practical post!
I have a question, I work in a sentiment analysis project with word2vec as an embedding model with keras. my problem is when I want to predict a new sentence as an input I face this error:
ValueError: Error when checking input: expected conv1d_1_input to have shape (15, 512) but got array with shape (3, 512)
consider that I want to enter a simple sentence like:”I’m really sad” with the length 3- and my input shape has the length of 15- I don’t know how can I reshape it or doing what to get rid of this error.
and this is the related part of my code:
model = Sequential()
model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’, input_shape=(15, 512)))
model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))
model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))
model.add(Conv1D(32, kernel_size=3, activation=’elu’, padding=’same’))
model.add(Dropout(0.25))
model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
model.add(Conv1D(32, kernel_size=2, activation=’elu’, padding=’same’))
model.add(Dropout(0.25))
model.add(Dense(256, activation=’relu’))
model.add(Dense(256, activation=’relu’))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(2, activation=’softmax’))
would you please help me to solve this problem?
You must prepare the new sentence in exactly the same way as the training data, including length and integer encoding.
At least would you mind sharing some suitable source for me to solve this problem please?
I hope you answer my question as what you done all the time. Thanks
I’m eager to help and answer specific questions, but I don’t have the capacity to write code for you sorry.
Hi Jason,
I have two questions. First, for new instances if the length is greater than this model input shall we truncate the sentence?
Also, since the embedding input is just for the seen training set words, what happens to the predictions-to-word process? I assume it just returns some words similar to the training words not considering all the dictionary words. Of course I am talking about a language model with Glove.
Yes, truncate.
Unseen words are mapped to nothing or 0.
The training dataset MUST be representative of the problem you are solving.
From what i understood from this comment,it is about prediction on test data. Lets assume that there are 50 words in vocabulary which means sentences will have unique integers uptil 50. Now since test data must be tokenized with same instance of tokenizer and if it has some new words, it would have integers with 51 ,52 oand so on..In this case,would the model automatically use 0 for word embeddings or can it raise out of bound type exception?Thanks
You would encode unseen words as 0.
All 50 vocabulary words should start from index 1 to 50, while leave 0 for unseen word in vocabulary. am I right?
Correct.
can this be described as transfer learning?
Perhaps.
We need to know for our homework pls help !!
Please see https://machinelearningmastery.com/faq/single-faq/can-you-help-me-with-my-homework-or-assignment/
Hi,
i have trained and tested my own network. During my work,when i integerized the sentences and created a corresponding word embedding matrix ,it included embeddings for train,validation and test data as well.
Now if i want to reload my model to test for some other similar data, i am confused that how the words from this new data would relate to embedding matrix?
You should have embeddings for test data as well right? or when you create embedding matrix you exclude test data?Thanks
The embedding is created from the training dataset.
It should be sufficiently rich/representative enough to cover all data you expect to in the future.
New data must have the same integer encoding as the training data prior to being mapped onto the embedding when making a prediction.
Does that help?
yes,i understand that i should be using the same tokenizer object for encoding both train and test data, but i am not sure how the embedding layer would behave for the word or index which isnt part of embedding matrix. Obviously test data would have similar words but there must be some words that are bound to be new. Would you say this approach is right to include test data too while creating embedding matrix for model? If you want to predict using some pre trained model,how can i deal with this issue? A small example can be really helpful. Thanks alot for all the help and time!
It won’t. The encoding will set unknown words to 0.
It really depends on the goal of what you want to evaluate.
i want to train my model to predict the target word given to a 5 word sequence . how can i represent my target word ?
Probably using a one hot encoding.
Hello Jason,
This is regarding the output shape of the first embedding layer : (None,4,8).
Am I correct in understanding that the 4 represents the input size which is 4 words and the 8 is the number of features it has generated using those words?
I believe so.
Hi Jason,
Thanks for sharing your knowledge.
My task is to classify set of documents into different categories.( I have a training set of 100 documents and say 10 categories).
The idea is to extract top M words ( say 20) from the first few lines of each doc, convert words to word embeddings and use it as feature vectors for the neural network.
Question : Since i take top M words from the document, it may not be in the “right” order each time, meaning the there can be different words at a given position in the input layer ( unlike bag of words model). Wont this approach impact the Neural network from converging?
Regards,
Srini
The key is to assign the same integer value to each word prior to feeding the data into the embedding.
You must use the same text tokenizer for all documents.
Hi Jason,
Thank you for your great explanation. I have used the pre-trained google embedding matrix in my seqtoseq project by using encoder-decoder. but in my test, I have a problem. I don’t know how to make a reverse for my embedding matrix. Do you have a sample project? My solution is: when my decoder predicts a vector, I should search for that in my pre-trained embedding matrix, and then find its index and then understand its related word. Am I right?
Why would you need to reverse it?
Hi Jason
Thanks for an excellent tutorial. Using your methods, I’ve converted text into word index and applied word embeddings.
Like Fatemeh, I’m wondering if it’s possible to reverse the process, and convert embedding vectors back into text? This could be useful for applications such as text summarising.
Thank you.
Yes, each vector has an int value known by the embedding and each int has a mapping to a word via the tokenizer.
Random vectors in the space do not, you will have to find the closest vector in the embedding using euclidean distance.
Dear Dr. Jason,
Accuracy: 89.999998 on my Laptop, result different from computer to other?
Well done!
Results are different each run, learn more here:
https://machinelearningmastery.com/randomness-in-machine-learning/
Hi,
So many thanks for this tutorial!
I’ve been trying to train a network that consists of an Embedding layer, an LSTM and a softmax as the output layer. However, it seems that the loss and accuracy get stuck at some point.
Do you have any advice?
Thanks in advance.
Yes, I have a ton of advice right here:
https://machinelearningmastery.com/improve-deep-learning-performance/
Thank you so much,
It helped me alot in learning how to use pre trained embbeding in neural nets
I’m happy to hear that.
Hi Jason, thank uou for the great materiAl.
I have one doubt, want to make the embedding of a list of 1200 documents to use it as input to a classification model to predict moviebox office based on the moviescript text…
My question is… if i want to train the embedding with the vocabulary of the real dataset, how can i after classify the rest of the dataset that was not trained ? Can a use the embeddings learned on the training as input to the classification model ?
Good question.
You must ensure that the training dataset is representative of the broader problem. Any words unseen during training will be marked zero (unknown) by the tokenizer unless you update your model.
Thank You Jason. As soon as I get the results I’ll try to share it here.
I’d like to thank you too about your great platform, it is being very helpful to me.
You’re welcome.
Nice post once again! It seems that in each batch all embedding are updated which I think it should not happen. You got any idea how to update only the one that are passed each time? That is for computational reasons or others problem definitions related reasons.
I’m not sure what you mean exactly, can you elaborate?
Hello Jason, i would like to think you for this post, it’s really interresting and understandable.
I’ve reused the script but instead of using “docs” and “labels” lists, i used the IMDB movie reviews dataset. The problem is that i can’t reach more than 50% accuracy and the loss is stable in all epochs to value 0.6932.
What do you think about that ?
I have some suggestions here:
https://machinelearningmastery.com/improve-deep-learning-performance/
Okay I’ll check it out, thank you Jason
Thanks for the article. Could you also provide an example of how to train a model with only one Embedding layer? I’m trying to do the same with Keras but the problem is that the fit method asks for labels which I don’t have. I mean I only have a bunch of text files that I’m trying to come up with the mapping for.
Models typically only have one embedding layer. What do you mean exactly?
Hello,
Thank you for the excellent explanation!
I have a few questions related to unknown words.
Some pretrained word embeddings like the GoogleNews embeddings have an embedding vector for a token called ‘UNKNOWN’ as well.
1. Can I use this vector to represent words that are not present in the training set instead of the vector of all zeros? If so, how should I go about loading this vector into the Keras Embedding layer? Should it be loaded at the 0th index in the embedding matrix?
2. Also, can I use the Tokenizer API to help me convert all unknown words (words not in the training set) to ‘UNKNOWN’?
Thank you.
Yes, find the integer value for the unknown word and use that to assign to all words not in the vocab.
Hi,
If word embedding doesn’t contain a word we input to a model , How to address this issue?
1) Is it possible to load additional words (besides those in our vocabulary) in embedding matrix.
Or may be any other elegant way you would like to suggest?
It is marked as “unknown”.
Hi .thanks a lot for your post . i’m new in python and deep learning !
i have 240,000 tweet train set “50 % male and 50% female” class . and 120,000 tweet test set ” 50 % male and 50% female”. i want use lstm in python bud i have following error at ” fit ” method :
ValueError: Error when checking input: expected lstm_16_input to have 3 dimensions, but got array with shape (120000, 400)
can you help me?
It looks like a mismatch between your data and the model, you can change the data or change the model.
Hi Jason, Thanks for this article.
I am getting this error
TypeError: ‘OneHotEncoder’ object is not callable
How oto overcome?
Thanks
I have some suggestions here:
https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
Hi , i have 2 models with this embedding layers , how do i merge those model ?
Thanks
What do you mean exactly? An ensemble model?
hi Jason, greate tutorial, i am very new to all this. I have a query, u r using glove for the embedding layer but during fitting u are directly using padded_docs. The vectors in padded_docs have no co-relation to glove. I am sure that i am missing something plz enlighten.
The padding just adds ‘0’ to ensure the sequences are the same length. It does not effect the encoding.
Hi, Jason. Considering the “3. Example of Learning an Embedding”, I’m adding “model.add(LSTM(32, return_sequences=True))” after the embedding layer and I would like to understand what happens. The number of parameters returned for this LSTM layer is “5248” and I don’t know how to calculate it. Thank you.
Each unit in the LSTM will take the entire embedding as input, therefore must have one weight for each dimension in the embedding.
Hi Jason,
Do you have any example showing how we can use a bi-directional LSTM on text (i.e., using word embeddings)?
It is the same as using the LSTM except, the LSTM is wrapped in a Bidirectional Layer wrapper.
Here’s an example of how to use the Bidirectional wrapper:
https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/
I am interested in using the predict function to predict new docs. For instance, ‘Struck out!’ My understanding is that if one or more words in the doc you want to predict weren’t involved in training, then the model can’t predict it. Is the solution to simply train on enough docs to make sure the vocabulary is extensive enough to make new predictions in this way?
Yes, or mark new words as “unknown” a predict time.
Hello Jason, Is there any reason that the output of the Embedding layer is a 4×8 matrix?
No, it is just an example to demonstrate the concept.
Hi, Jason. Thanks a lot for this excellent tutorial. I have a quick question about the Keras Embedding layer.
vocab_size = len(t.word_index) + 1
t.word_index starts from 1 and ends with 15. Therefore, there are totally 15 words in the vocabulary. Then why do we need to add 1 here please?
Thanks a lot for the help!
The words are 1-offset and index 0 is left for “unknown” words.
Would you mind specifying this answer? This is still not quite clear to me
Sure, which part is not clear?
What exactly does 1-offset mean in this context?
And if “index 0 is left for unknown words”, wouldn’t that imply that you could ignore them?
And if I use this instead:
vocab_size = len(t.word_counts.keys()) + 1
Do I also have to add the 1?
Words not in the vocab are assigned the value of 0 which maps to the index of the first vector and represents unknown words.
The first word in the vocab is mapped to vector at index 1. The second words maps to the vector at index 2, etc until the total number of vectors is handled.
Does that help?
Thank you for your response, but I am not quite getting this. My understanding is, when I work with t.word_counts instead of t.word_index (not sure if it makes a difference) that I get something like this:
OrderedDict([(‘word1’, 35), (‘word2’), 232))
Then, if I use len(t.word_counts) it gives me 2 in this case. Why am I then adding a 1.
That is, if I use t.word_counts I don’t see any unknown word when I print it out.
We are not adding 1 to the counts. We are not touching the ordered dict of words.
We are just adding one more word to our vocabulary called “unknown” that maps to the first vector in the embedding.
The tensorflow documentation also says to add 1, but I am really not sure why:
https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
To make room for “unknown” – words not in the vocab.
Adding 1 means adding one word to the vocab, the first word mapped to index 0 – word number zero.
Are you aware of any resource that might explain why we have to add 1? I understand it is because of unknown words, so we increase the given vocab size by 1. But I don’t understand why that is.
We don’t have to do it, we could simply ignore words that are not in the vocab.
Hi Jason,
If I have three columns of (string type) multivariate data, one column is categorical, the other two columns are not. Is it ok if I integer encode them using LabelEncoding(), and then scale the encoded data using feature scaling method like MinMax, StandardScaler etc. before feed into anomaly detection model? Even though the ROC shows an impressive result. But is it valid to pre-process text data like that?
Thank you.
Perhaps try it and compare performance.
I have tried it and it shows nearly 100% ROC. What I mean is that, is it accurate to pre-process the text data like that? Because when I checked your post regarding pre-processing text data, there is no feature scaling (MinMax, StandardScaler etc) on text data after encode them to integer. I’m afraid if the way I pre-process data is not accurate.
Generally text is encoded as integers then either one hot encoded or mapped to a word embedding. Other data preparation (e.g. scaling) is not required.
Hello
I was trying to do multi class classification of text data using Keras in R and Python.
In Python I was able to get predicted labels using inverse_transform() method from encoded class values. But when I try to do the same in R using CatEncoders library, getting some of the labels as NAs. Any reason for that.
No need to transform the prediction, the model can make a class or probability prediction directly.
Hi Jason,
Thanks for your sharing! I have a question on word embedding. Correct me if I am wrong: noticed the word embedding created here only contains words in the training/test set. I would think a word embedding including all vocab in GloVE file will be better? For example, if in production, we encounter a new word than in training/test set, but it is part of the GloVE vocab, in this case, we can capture the meaning of the production words although we don’t see it in training/test set. I think this will benefit sentiment classification problems with smaller training set?
Thanks!
Regards
Xiaohong
Generally, you carefully choose your vocab. If you want to maintain a larger vocab than is required of the project “just in case”, go for it.
Jason,
Are there non-text applications of embeddings? For example – I have large sets of categorical variables, each with very large number of levels, which go into a classification model. Could I use embeddings in such a case?
Rahul
Yes, embeddings are fantastic for categorical data!
Hi Jason…
Thanks a lot for such a nice post. It enriched my knowledge a lot.
I have one doubt on text_to_sequence and one_hot methods provided by keras. Both of them are giving same encoded docs with your example. If they gives same output then when we should use text_to_sequence and when we should go for one_hot?
Again thanx a lot for such a nice post.
Use the approach that you prefer.
Hi Jason,
Your post are really superb. Thanks for writing such great post .
I have one query , why people use Embedding Layer when we have already got the vector representation of a word from word2vec or glove . Using these two pre trained model we have already got the same size vector representation of each word and if word in not found we can assign the random value of same size. After getting the vector representation why we are passing to the Embedding layer?
Thanks
Often the learned embedding in the neural net performs better because it is specific to the model and prediction task.
Hi Jason,
Thanks for the reply .
What if I set trainable = False . Then also it is needed to use Embedding layer when I have already vector representation of each word in sequence using word2vec or glove.
Thanks
Yes, you can keep the embedding fixed during training if you like. Results are often not as good in my experience.
Hello Jason, thank you for this very good tutorial. I have a question : I trained your model on the imbd sentiment analysis dataset. The model has 80% of accuracy. But I have very bad embeddings.
First a short detail, I used
one_hot
which use a hashing trick, but I used themd5
because the defaulthash
function is not consistent across runs, this is mentionned in the doc of Keras (so to save the model and predict new documents it is not good, am I right ?).But the important thing is that I have very bad embeddings, I created a dict which map lower words and embedding (of size 8). Following this : https://stackoverflow.com/questions/51235118/how-to-get-word-vectors-from-keras-embedding-layer I didn’t use Glove vector for now.
I tested to search most similar words and I got random words for “good” (holt, christmas, stodgy, artistic, szabo, mandatory…). I set the voc size to 100000. Of course due to the hashing trick, 2 words can have the same index so I don’t take into account similarities of 1.0. I think bad vectors embeddings are due to the fact we train embeddings on entire document and not context like word2vec, what do you think ?
Generally, the embeddings learned by a model work better than pre-fit embeddings, at least in my experience.
Hello, such a great tutorial. All your tutorials are very helpful! Thank you.
I want to find embedings of three different Greek sentences (for classification). Then I want to merge per paired and to fit my model.
I have read your tutorial ‘How to Use the Keras Functional API for Deep Learning’ which is very helpful for the merge.
My question is: Is there any way to calculated before to use as an input to my model? I must have three different models to calculate the embedings?
Thank you in Advance.
Yes, you could prepare the input manually if you wish: each word is mapped to an integer, and the integer will be an index in the embedding. From that you can retrieve the vector for each word and use that as input to a model.
Thank you!!!
Hi Jason,
Thank you so much for this wonderful explanation. After reading many other resources I understand the embedding layers only after reading this. I have few questions and I’d really appreciate if you could take out the time and answer them.
1) In your code, you used a Flatten layer after the Embedding layer and before the Dense layer. In few other places I noticed that a GlobalAveragePooling1D() layer is used in place of the Flattern. Can you explaining what Global Average Pooling does and why it’s used for Text Classification?
2)You explained in of the comments that each word will have only one vector representation before and after the training. So just to confirm, when a word x is inputted to the embedding layer, the output always updates the same vector that represents x? For example, for vocab size 500 and embedding dimension 10, ([500, 10] output shape)if word x is the first vector([0,10]) in the output, every time the word x is inputted the first vector([0, 10]) will be updated and not if the word is not present?
3) What’s the intuition behind choosing the size of the embedding dimension?
Thank you again Jason. Will be waiting for your response.
Mohit
Either approach can be used, both do the same thing. Perhaps try both and see what works best for your specific dataset.
The vectors for a given word are learned. After the model is finished training, the vectors are fixed for each word. If a word is not present during training, the model will not be able to support it.
Embedding must be large enough to capture the relationship between words. Often 100 to 300 is more than enough for a large vocab.
very userful post for beginners. But i have a doubt regarding size of embedding vector. As you mentioned in the post–
“The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.
The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. ”
I did not understand why the output from the embedding layer should be 4 vectors. If it is related to input length, please explain me how ?. also the phrase “one for each word” , i did not understand this as well
A good rule of thumb is to start with a length of 50, and try increasing to 100 or 200 and see if it impacts model performance.
Hi Jason, I’m facing touble while using bag of words as features in CNN. Do you have any idea to implement BoW based CNN?
It is not clear to me why/how you would pass a BoW encoded doc to a CNN – as all spatial information would be lost. Sounds like a bad idea.
Hi! Thanks for a good tutorial. I have a question regarding embeddings. How is the performance when using Embedding Layer for training embeddings VS using pre trained embedding? Is it faster and does the model require less time when training if using a pre trained embedding?
Thanks for answer.
Embeddings trained as part of the model seem to perform better in my experience.
Hello, great post!
I want to know more about vocabulary size in Keras Embedding Layer. I am working with Greek and Italian languages. Do you have any scientific paper to suggest?
Thank you very much!
Perhaps test a few different vocab sizes for your dataset and see how it impacts model performance?
Ok, thank you very much!
Hi Jason
What needs to change when using the unsupervised model on pre-train embedding matrix?
also if you want to use week supervised?
You can follow the example here:
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
Good
Thanks.
Hi Jason,
For one-hot encoder, if we are given the new test set, how can we use one_hot function to have the same matrix space as we had for training set? Since we can not have a separate one_hot encoder for the test set.
Thank you very much.
You can keep the encoder objects used to prepare the data and use them on new data.
Hi,
I ofter see implementations like yours, where the embedding layer is built from a word_index that contains all words build, e.g., with Keras preprocessing Tokenizer API. But if I fit my corpus with the Tokenizer and with vocabulary size limit (with num_words), why should I need an embedding layer of the size of the total number of unique words? Wouldn’t it be a waste of space being used? Is there any issue to build an embedding layer with a size suitable to my vocabulary size limit I need?
Not really, the space is small.
It can be good motivation to ensure your vocab only contains words required to make accurate predictions.
Hi Mr Jason,
Excellent tutorial indeed!
I have a question regarding dimensions of embedding layer , could you please help:
e = Embedding(200, 32, input_length=50)
How do we decide to select size of out_dim which is 32 here? is there any specefic reason for this value?
Thanks in advance:
Leena
You can use trial and error and evaluate how different sized output dimensions impact model performance on your specific dataset.
50 or 100 is a good starting point if you are unsure.
Hi,
i trained my model using word embedding with glove, but kindly let me know how to prepare the test data for predicting results with trained weight. still i did not find any post which follow whole process. specially word embedding with glove
The word embedding cannot predict anything. I don’t understand, perhaps you can elaborate?
Hello, Jason,
thanks for the post. I have a question about the embedding data you actually fit. I print the padded_docs after the model compile. It seems to me that the printed matrix is not an embedding matrix. It’s a integer matrix. So I think what you fit in the CNN is not embedding but the integer matrix you define. Could you please help me explain it? Thanks a lot.
Yes, the padded_docs is integers that are fed to the embedding layer that maps each integer to an 8-element vector.
The values of these vectors are then defined by training the network.
Hi Jason,
I am working on character embedding. My dataset consists of raw HTTP traffic both normal and malicious. I have used the Tokenizer API to integer encode my data with each character having an index assigned to it.
Please let me know if I understood this correctly:
My data is integer encoded to values between 1-55, therefore my input_dim is 55.
I will start with output-dim of 32 and modify this value a needed.
Now for the input_length, I am a bit confused how to set this value.
I have different lengths for the numerical strings in my dataset the longest is 666. Do I set input-length to 666? And if I do this what will happen to the sequences with shorter length?
Thank you for your help!
Also, should I set the input dim to a value higher than 55?
Do you mean word embedding instead of char embedding?
I don’t have any examples of embedding char’s – I’m not sure it would be effective.
In meant character embedding. I used tokenizer and set the character level to True.
I am not sure how to use word embedding for query strings of http traffic when they are not made of real words and just strings of characters.
I am designing a character level neural network for detecting parameter injection in http requests.
The result would be in a binary format 0 if request is normal and 1 if it’s malicious.
So you don’t think character embedding is helpful here?
Sorry, I don’t have an example of a character embedding.
Nevertheless, you should be able to provided strings of integer encoded chars to the embedding in your model. It will look much like an embedding for words, just lower carnality (<100 chars perhaps). Also I don't expect good results.
What problem are you having exactly?
I have found
vocab_size = len(t.word_index) + 1
to be wrong. This index not only ignores theTokenizer(num_words=X
parameter, but also stores more words than are actually ever going to be encoded.I fit my text without word limit, and then encode the same text using the tokenizer, and the length of the
word_index
is larger than themax(max(encoded_texts))
.That is very surprising!
Are you sure there is no bug in your code?
Hi Mario,
Yea. I just noticed the num_words issue with keras Tokenizer today. I think there are already multiple issues regarding it logged on GitHub:
https://github.com/keras-team/keras/issues/8092
hello jason, how are you?
I am doing my masters thesis on text summarization using word embeddings, and now i am in the middle of many questions, how could I use these features and which neural network alg is best. please could you give some guide….
I believe this will help:
https://machinelearningmastery.com/encoder-decoder-models-text-summarization-keras/
thanks jason…it is really helpful..
I’m happy to hear that.
Hi Jason,
This entire article is very useful. It helped me in writing my initial implementation.
I have one question related to Input() and Embedding() in Keras.
If I already have a pretrained word embedding. In that case, should I use Embedding or Input ?
Yes, the embedding vectors are loaded into the Embedding layer and the layer may be marked as not trainable.
what if in test we have a new words which were not there in the training text ?
It will not proceed from the embedding layer correct?
They will me mapped to 0 or “unknown word”.
Hi, Jason
I want to ask you about how can i save my learned word embedding?
You can use get_weights() on the layer and save the vectors directly as nunmpy arrays if you like.
Or save the whole Keras model:
https://machinelearningmastery.com/save-load-keras-deep-learning-models/
Can I construct a network for learning the embedding matrix only?
Yes, it is called a word2vec model, here’s how:
https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Hi,
I have a text similarity application where I measure Pearson correlation coefficient as keras metrics.
In many epochs, I noticed that the correlation value is nan.
Is this is normal or there is a problem in the model?
You may have to debug the model to see what the cause of the NAN is.
Perhaps an exploding gradient or vanishing gradient?
Do you mean that I have to adjust the activation function?
I use elu activation function and Adam optimization function,
Do you mean that I have to change any of them and see the results?
Perhaps.
Try relu.
Try batch norm.
Try smaller learning rate.
…
Can I know what do you mean by debuging the model?
Yes, here are some ideas:
– Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
– Consider cutting the problem back to just one or a few simple examples.
– Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
– Consider posting your question and code to StackOverflow.
vocab_size = len(t.word_index) + 1
why do we need to increase the vocabulary size by 1???
This is so that we can reserve 0 for unknown words and start known words at 1.
Thanks a lot for this wonderful work .. we’re in July 2019 and still taking advange of it.
Thanks. I try to write evergreen tutorials that remain useful.
Dear Jason,
thanks a lot for your detailed explanations and the code examples.
What I am still wondering about, however, is how I would combine embeddings with other variables, i.e. having a few numeric or categorical variables and then one or two text variables. As I see you input the padded docs (only text) in model.fit. But how would I add the other variables? It doesn’t seem realistic to always only have one text variable as input.
Good question. You can have a separate input to the model for each embedding, and one for other numeric static vars.
This is called a multi-input model.
This will help:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Thank you Jason for this wonderful work and examples. That really help.
My last comment is related to : https://machinelearningmastery.com/develop-word-embeddings-python-gensim/ !
You’re welcome, I’m glad it helped.
Thank u Jason Brownlee, that’s very interesting and clear.
And if i have two inputs?
for example i am working on Text-to-SQL task and that necessit 2 inputs : user question and table schema (columns names).
how can i process? how can i do embeddig? with 2 embeddings layers?
Thank u for help.
You an design a multiple input model, I give examples here:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Ah okay, that’s interesting too, thanks!!
Can you please confirm me the architecture above to encode both user Questions and table Schema in the same model?
(1) userQuestion ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM
(2) tableShema ==> One-hote-endoding ==> Embedding(GloVe) ==> LSTM
(1) concatenate (merge) (2) ==> rest of model layers….
thakns Jason.
No need for the one hot encoding, the text can be encoded as integers, and the integers mapped to vectors via the embedding.
ok that’s clear, thanks.
The attention mechanism can be done only with the merge?
No, you can use attention on each input if you wish.
Should i do attention on both inputs (user Question & table Schema) separately or can i do it after merging the 2 inputs?
Test many different model types and see what works well for your specific dataset.
okay thank you Jason!!
No problem.
I wonder about pre-trained word2vec, there is no such good tutorial for that. I am looking to implement pre-trained word2vec but I do not know should I follow the same steps of Glove or look for another source for that?
thanks MR.Jason I am very inspired by you in machine learning
I show how to fit a standalone word2vec here:
https://machinelearningmastery.com/develop-word-embeddings-python-gensim/
Why dont you use mask_zero=True ion your embedding layers? It seems necessary since you are padding sentences with 0’s.
Great suggestion, not sure if that argument existed back when I wrote these tutorials. I was using masking input layers instead.
hi, can you help me with a question?
Im working with a dataset that has a city column as a feature and thats has a lot of different cities. So, I create a embeddinglayer for this feature. First, I used this command :
data[‘city’]= data[‘city’].astype(‘category’)
data[‘city’]= data[‘city’].cat.codes
After that, for each different city a value was assigned starting at 0
So, Im confused about how this embedded layer works when the test data has a input that was not training. I saw that you said that when this occurs, we had to put 0 as input, but 0
it’s related with some city. Should i start assigning this values to the city from 1?
Excellent question.
Typically, unseen categories are assigned 0 for “unknown”.
0 should be reserved and real numbering should start at 1.
thank you, you always help me a lot!
You’re very welcome Nathalia!
Hi Jason..
Thanks for such a great tutorial. I am confused on when we talk about learned word embeddings , do we consider weighs of the embedding layer or output of embedding layer.
let me ask in other way as well, when we utilize pretrained embedding let us say “glove.6B.50d.txt” . those word embeddings are weights or the output of the layer?
They are the same thing. Integers are mapped to vectors, those vectors are the output of the embedding.
Hi Jason,
I am new to ML, trying out different things, and your posts are the most helpful I encountered, it helps me a lot to understand, thank you!
Here I think I understood the procedure, but I still have a deeper question, on the point of embeddings. If I understand correctly, this embedding kind of maps a set of words as points onto another dimensionnal space. The surprising fact in your example is that we pass from a space of dimension 4 to a space of dimension 8, so it might not be seen as an improvement at first.
Still I imagine that the embedding makes it so that points in the new space are more equally placed, am I right? Then I don’t understand several things:
-How does the context where one words appear come into play? Other words which are often close by will also be represented by closer points in the new space?
-Why does it have to be integers? And why is it more applied to word encodings? I mean we could imagine the same process could be helpful for images as well. Or is it just a dimension reduction technique tailored for words documents?
Thank you for your insights anyway
Not equally spaced, but spaced in away that preserves or best captures their relationships.
Context defines the relationships captured in the embedding, e.g. what works appear with what other words. Their closeness.
Each word gets one vector. The simplest way is to map words to integers and integers to the index of vectors in a matrix. No other reason.
Great questions!
Jason,Thank you so much for your time and effort!
My question is related to the line-
“e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)”
Here you are using-weights=[embedding_matrix], but there is no relation telling for which word which vector. Then, it produces for each document one 4*100 matrix(example for [ 6 2 0 0]).How it will extract the vectors related to 6,2,0,0 accurately?
The vectors are ordered, with an array index you can retrieve the vector for each word, e.g. word 0, word 1, word 2, work 3, etc.
Hello Jason,
I m searching for interesting formation on Python (numpy pandas … and tools for ML DL and NLP) and formation on Keras !!
Some suggestions please?
This might be a good place to start if you are working on NLP:
https://machinelearningmastery.com/start-here/#nlp
Thanks again Jason !!
No problem.
Thanks for article, I will run it for the Italian language documents. Is there any GoogleNews pretrained word2vec covering Italian vocabulary?
Good question, I’m not sure off the cuff sorry.
Hallo jason, Can I put Dropout into this model ?
I don’t see why not.
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
What does the parameter – weights=[embedding_matrix] – stand for ? weights or inputs for the Embedding Layer ?
The weights are the vectors, e.g. one vector for each word in your vocab.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 4, 8) 400
_________________________________________________________________
flatten_1 (Flatten) (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
=================================================================
Total params: 433
Trainable params: 433
Non-trainable params: 0
_
Could you please clarify no. of weights learnt for embedding layer. Now we have 10 documents and embedding layer from 4 to 8. How many weights parameters will actually be learnt here. My understanding was that there should only be (4+1)*8 = 40 weights to be learnt, including the bias term. Why is it learning weights for all the documents separately (10*(4+1)*8 = 400) ?
The number of weights in an embedding is vector length times number of vectors (words).
Hi, Jason:
In this example, what type of neural architecture it is? It is not LSTM, not CNN. Is it a Multi-Layer Perceptron model?
We are just working with embeddings. I guess with an output layer, you can call it an MLP.
Hi, Thanks for this excellent article. I tried to use a pre-trained word embedding instead of random number in a Keras-based classifier. However, after constructing the embedding matrix and adding it to the embedding layer as follow, during training epochs all of the accuracy values are the same and no learning happens. However, after removing the “weights=[embedding_matrix]” it works well and reached the accuracy of 90%.
layers.Embedding(input_dim=vocab_size, weights=[embedding_matrix],
output_dim=embedding_dim,
input_length=max_len,trainable=True)
What can be the reason of this strange behavior?
thanks
An embedding specalized on your own data is often better than something generic.
Quick question. While using pre-trained embeddings (MUSE) in an Embeddings layer, is it okay to set trainable=True?
Note: The model doesn’t overfit when i set trainable=True. The model doesn’t predict well if i set trainable=False.
Yes, although you might want to use a small learning rate to ensure you don’t wash away the weights.
Thank you very much for your reply. Currently I am using a learning rate of 0.0001.
“Adam(learning_rate=0.0001)”
Perhaps use SGD instead of Adam as Adam will change the learning rate for each model parameter and could get quite large.
Or at least compare results with adam vs sgd.
Hi Jason,
Thank you for your post. I have an issue. The data I used include both categorical and numerical features. Say, Some features are cost and time while others are post code. How should I write the code? Build separate models and concatenate them together? Thank you
Great question!
The features must be prepared separately then aggregated.
The two ways I like to do this is:
1. Manually. Prepare each feature type separately then concate into input vectors.
2. Multi-Input Model. Prepare features separately and feed different features into different inputs of a model, and let the model concate the features.
Does that help?
Should the categorical be converted into numerical using one-hot encoding?
It can be, or an embedding can be used.
If you have multivariate time series of which you know they are meaningfully connected (say trajectories in x and y), does it make sense to put a Conv layer before feeding them into the embedding?
Could you explain what you mean by “preparing” the features?
No embedding is typically the first step, e.g. an interpretation of input.
Prepared means transformed in whatever way you want, such as encoding or scaling.
Thanks for the answer! If I may ask a follow up: is embedding of multivariate numerical data uncommon? I have seen fairly little work that uses it.
Embedding is used for categorical data, not numerical data.
hi Jason,
I use an example where a put the vocabulary size = 200 and the training sample contain about 20 different words.
When I check the embeddings ( the vectors) using ** layers[0].get_weights()[0]** I obtain an array with 200 rows.
1/ how can I know the vector corresponding to each word (from the 20 words I ‘ve got)?
2/ where the 180 (200 – 20) vectors come from since I use only 20 words?
Thanks in advance.
The vocab size and number of words are the same thing.
I think you might be confusing the size of the embedding and the vocab?
Each word is assigned a number (0, 1, 2, etc), the index of vectors maps to the words, vector 0 is word 0, etc.
Thanks for your answer Jason,
I ‘ll clarify my question:
the vocab size is 200 that means that the number of words is 200.
But effectively i’m working with 20 words only ( the words of my training sample) : let say word[0] to word[19].
So, after the embedding, the vector[0] corresponds to word[0] and so on. but vector[20].. vector [30] … what do they match ?
I have no word[20] or word[30] .
If you define the vocab with 200 words but only have 20 in the training set, the the words not in the training set will have random vectors.
Ok. Thank you.
I want to save my own pretrained model in the same way Golve saved their model as txt file and the word followed by its vector? How I would do that?
thank you
You could extract the weights from the embedding layer via layer.get_weights() then enumerate the vectors and save to a file int he format you prefer.
beginner in python I did not understand what you mean by enumerating..and which layer should I get weight from?…
You can get the vectors from the embedding layer.
You can either hold a reference to the embedding layer from when you constructed the model, or retrieve the layer by index (e.g. model.get_layers()[0]) or by name, if you name it.
Enumerating means looping.
Hello, Jason!
Thanks for the article!
I have been wondering about the input_dim of the learnable embedding layer.
You set it to vocab_size, that in your case is 50 (the hashing trick upper limit), which is much larger than the actual vocabulary size of 15.
The documentation of Embedding in keras says:
“Size of the vocabulary, i.e. maximum integer index + 1.”
Which is ambiguous.
I have experimented with some numbers for vocab_size, and cannot see any systematic difference.
Would it actually matter for more realistically sized examples?
Could you say a couple of words about it?
Thanks again
Smaller vocabs means you will have fewer words/word vectors and in turn a simpler model which is faster/easier to learn. The cost is it might perform worse.
This is the trade-off large/slow but good, small/fast but less good.
Thanks, Jason!
I may have not explained myself properly:
The *actual* number of words in the vocabulary is the same (14).
The difference is the value of input_dim to Embedding().
In the example, you chose 50 as high enough to prevent collisions in encoding, but also
used it as an input_dim in one of the cases.
Michael
I see.
I thought the question is “Size of the vocabulary, i.e. maximum integer index + 1.”. Since there are 14 words in this example, why vocab size isn’t 15, instead of 50?
There is the size of the vocab, there is also the size of the embedding space. They are different – in case that us causing confusion.
We must have size(vocab) + 1 vectors in the embedding, to have space for “unkown”, e.g. vector at index 0.
Jason: In this example, ‘one_hot’ function instead of ‘to_categorical’ function is used. The 2nd is the real one-hot representation, and the 1st is simply creating an integer for each word. Why isn’t to_categorical used here? They are different, right?
The function is badly named, but it does integer encode the words:
https://keras.io/preprocessing/text/
Thanks a lot Jason! in 3. Example of Learning an Embedding section, could you please elaborate what is 400 params that are being trained in the embedding layer? Thnx
Yes, each vector is mapped to an 8 element vector, and the vocab is 50 words. Therefore 50*8 = 400.
Jason, why the output shape of embedding layer is: (4,8)?
It should be (50,8) as the vocab size is 50 and we are creating the embeddings of all words in our vocabulary.
vocab size is the total vectors in the layer – the number of words supported, not the output.
The output is the number of input words (8) where each word has the same vector length (4).
is it typo? The output is the number of input words (4) where each word has the same vector length (8).
Hi Sean…We do not believe it is a typo. What occurs when you execute the code?
Hi i need some help when running the file.
(wordembedding) C:\Users\Criz Lee\Desktop\Python Projects\wordembedding>wordembedding.py
Traceback (most recent call last):
File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\wordembedding.py”, line 1, in
from numpy import array
File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\__init__.py”, line 140, in
from . import _distributor_init
File “C:\Users\Criz Lee\Anaconda3\lib\site-packages\numpy\_distributor_init.py”, line 34, in
from . import _mklinit
ImportError: DLL load failed: The specified module could not be found.
pls advise.. thanks
Looks like there is a problem with your development environment.
This tutorial may help:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
Hi jason, i’ve tried the url u provided but still didnt manage to solve it.
basically i typed
1. conda create -n wordembedding
2. activate wordembedding
3. pip install numpy (installed ver 1.16)
4. ran wordembedding.py
error shows
File “C:\Users\Criz Lee\Desktop\Python Projects\wordembedding\wordembedding.py”, line 2, in
from numpy import array
ModuleNotFoundError: No module named ‘numpy’
pls advise.. thanks
Sorry to hear that. I am not an expert in debugging workstations, perhaps try posting to stackoverflow.
Hi, Jason: How to encode a new document using the Tokenizer object fit on training data? It seems there is no function to return an encoder from the tokenizer object.
You can save the tokenizer and use it to prepare new data in an identical manner as you did the training data after the tokenizer was fit.
What do you mean by ‘save the tokenizer’? Tokenizer is an object, not a model.
It is as important as the model, in that sense it is part of the model.
You can save Python objects to file using pickle.
Brief post… and am interested to read about pre-trained word embedding for sequence labeling task.
Great, does the above tutorial help?
Hello Sir,
I am not able to understand the significance of vector space?
You have given 8 for the first problem, glove vectors has 100 dimension for each word.
What is the idea behind these vector spaces and what does each value of the dimension tells us?
Thankyou 🙂
The size of the vector space does not matter too much.
More importantly, the model learns a representation where similar works will have a similar representatioN (coordinate) in the vector space. We don’t have to specify these relationships, they are learned automatically.
Thankyou Sir for your answer. I clearly understood what is vector space.
I have one more question- If I declare the vocabulary size as 50 and if there are more than 50 words in my training data, what happens to those extra words?
For the same reason I could not understand this line of glove vectors-
“The smallest package of embeddings is 822Mb, called “glove.6B.zip“. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words.”
What about the 600 thousand words ?
Words not in the vocab are marked as 0 or unknown.
Hi,
I have a list of words as my dataset/training data. So i run your code for glove as follows:
—-Error is——
ValueError Traceback (most recent call last)
in ()
8 values = line.split()
9 words = values[0]
—> 10 coefs = asarray(values[1:], dtype=’float32′)
11 embeddings_index[words] = coefs
12 f.close()
/usr/local/lib/python3.6/dist-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
83
84 “””
—> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: ‘0.1076.097748’
————————————————————————————-
could you please help
This is a common question that I answer here:
https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
i looked up the FAQ, i do not find any question related to “ValueError: could not convert string to float:”
I recommend the instructions in the specific FAQ I linked to.
Hello Jason Brownlee
I hope you are fine and thanks for writing such an article. I would like to ask an question to you where I am struck.
I had created keras deep learning model with glove word embedding on imdb movie set. I had used 50 dimension but during validation there is lot of gap between accuracy of training and validation data. I have done pre-processing very carefully using custom approach but still I am getting overfitting.
During prediction phase instead of testing dataset, I have taken real time text and goes for prediction but if I do prediction again and again for same input data, results varies. I am trying my best why my results varies.
I have 8 classes which depicts the score of each review and data is categorical. Softmax layer is predicting different result every time same data in fed to trained model
One model will produce the same output for the same input.
But, if you fit the same model on the same data multiple times, you will get different predictions. This is to be expected:
https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
You can reduce the variance by creating an ensemble of final models and combining their predictions.
I have a dataset in which training data are test data in different files. How do I pass test data to the model for evaluation?
Load them into memory separately then use one for train and one for test/val.
Perhaps I don’t understand the problem?
How to give text input to a pre-trained LSTM model with input layer shape (200, ) .
I want to know how convert the text input and give it as an input to the model.
New text must be prepared in an identical way to data used to train the model.
Sir, could you please show me one example or link to code.
Yes, see the examples here:
https://machinelearningmastery.com/start-here/#nlp
Thanks you so much for this detailed tutorial. I think that I missed something… When using a Pre-Trained GloVe Embedding, in case you want to do a train test split, when is the right time to perform it?
Should I need to do it after creating the weight matrix?
Sorry, I don’t follow your question – how are the embedding and train/test split related? Can you please elaborate?
Sure. I would like to use the embedding to create a model that will predict a class out of 30 different classes. The dataframe that I’m using contains a text column which I want to use to predict the class. In order to do so, I thought about using the pre-trained embedding (same as you did but with more classes). Now, in order to test the model I want to do a split, so my question is when do you recommend to do it? I tried to create a weight matrix for the all data and then split it to train and test but it gave me a very poor results on the test set.
Any idea what am I doing wrong?
If you train your own embedding, it is prepared on the training dataset.
If you use a pre-trained embedding, not split make sense as it is already trained on different data.
Thanks I think that I understand (sorry I’m kind of newbie) . And assuming that I want to use the trained model to predict the labels on an unseen data, what should be the input? Should it be a padded docs with the same shape?
New input must be prepared in an identical way to the training data.
Hello, another great post!
I would like to ask, regarding with the Example of Using Pre-Trained Embedding, is it possible to reduce the vocabulary size? I had pre-trained embedding with 300 dimension. My vocabulary size is 7000.
When I put vocabulary size=300 I have this error:
embedding_matrix[i] = embedding_vector
IndexError: index 300 is out of bounds for axis 0 with size 300
Thanks in advance
Yes, but you must set the vocab size in the Tokenizer.
Thank you for the quick response. I did it.
Nice work!
Thanks for the tutorials! I’m using the 300d embedding for my image to caption model, but filter out rare words in my vocabulary. Say I have 10k words, should I just:
tokenizer.word_index.get(word, vocab_size + 1) <= vocab_size
to filter out the words I don't want?
Also, do you think it's worth retraining the embedding weights at a later point in training to fine-tune? I'm thinking of it in the same context as freezing a pre-trained encoder, then gradually unfreezing layers as the decoder reaches a certain level of validity.
You can control the vocab size by setting the “num_words” argument when constructing the tokenizer:
https://keras.io/preprocessing/text/
Perhaps try with and without fine tuning and compare the results.
Hi Jason,
I have a question regarding the code. You have taken the vocab_size to be 1 greater than the number of unique words. I am unable to understand why did you do that, can you please tell me.
I am a total newbie and I’m not from a programming background so I’m sorry if this was a silly question.
Thank You
Yes, we use a 1-offset for words and save index 0 for “unknown” words.
Oh…Thanks jason
You’re welcome.
Hi Jason,
I work on a multivariate time series with categoricals and numericals features. and I use a data of 3 years : 20 stocks during 30 days as window of training and 20 stocks during 7 days as target X_train.shape = (N_samples, 20*30, N_features), y_train.shape = (N_samples, 20*7, N_features), my question is, how I can apply an embedding layer for 3 categorical variables for this 3D arrays ?
I tried to use this part of code but it doesn’t work :
cat1_input = Input(shape=(1,), name=’cat1′)
cat2_input = Input(shape=(1,), name=’cat2′)
cat3_input = Input(shape=(1,), name=’cat3′)
cat1_emb = Flatten()(Embedding(33, 1)(cat1_input))
cat2_emb = Flatten()(Embedding(18, 1)(cat2_input))
cat3_emb = Flatten()(Embedding( 2, 1)(cat3_input))
See this tutorial:
https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
Thank you for your prompt reply Jason, this tutorial is about Embedding for 2D array but in my case I need to build a model that take 3D array (N_samples, time_steps, N_features) as input and (time_steps,N_stocks) as output.
Perhaps you can adapt the example for your needs.
Hello again,
I am wondering this time, if it’s worthwhile to compare two embedding matrices from two different languages. I want to find the similarity about two matrices from different languages.
Thank you very much
Perhaps use vector distance measures?
https://machinelearningmastery.com/distance-measures-for-machine-learning/
Thank you, I will try it.
You’re welcome.
Hello, Jason, This post really helped me, I think the parameters have changed,
the weights parameter is now ” Embedding_initializer ” in keras documentation.
Thanks, I’m happy to hear that.
Agreed, thanks:
https://keras.io/layers/embeddings/#embedding
Hi Jason, I am facing a memory issue since my vocab_size is around 400,000. I have around 195847 training examples and each sentence output has maxLen of 5 words.
So when I try to create a one_hot for the decoder_outputs because this tries to create a 195847, 400000, 5 matrix it obviously runs out of memory.
How do i get around this problem?
Good question.
Try reducing the size of the vocab.
Try reducing the size of the dataset.
Try reducing the size of the model.
Try an alternate representation, such as hashing.
Try running on an ec2 instance with more RAM.
An additional point could be to run it in batch using fit_generator of keras.
There is one more query that I have and i know it may sound silly but:
Why do you need the inference model or decoding model for testing? Then how does the training model help? The only thing that I can see is that it helps in deciding the encoder inputs. Is there something I am missing. It seems to me that the entire purpose of training is lost.
I don’t understand. You train a model and then use it for inference. Without training it is useless. Without inference, you trained for no reason.
Perhaps you can elaborate?
Hi,
I would like to ask if I have label not only positive and negative but also neutral how I can add that to y_training?
That would be a 3 class classification problem.
Hi,
Thank you for this very well explained article.
I don’t understand how the embeddings are learnt in the Embedding layer. Is this common backpropagation ?
You’re welcome.
Yes, it uses backprop.
Dear Jason
Thanks for your wonderful and helpful guides. When I run your suggested code in part 4 of this page (Example of Using Pre-Trained GloVe Embedding), I get the following error, while I have downloaded glove.6B.100d file from the URL you provided above. Can you please help me find the cause of error?
Traceback (most recent call last):
File “C:/Users/Dehkhargani/PycharmProjects/test2/test2.py”, line 40, in
coefs = asarray(values[1:], dtype=’float32′)
File “C:\Users\Dehkhargani\Anaconda\lib\site-packages\numpy\core\_asarray.py”, line 85, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: ‘ng’
Sorry to hear that, this may help:
https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
Hi Jason, thanks for your helpful post.
Do pre-trained word embedding like Glove or Google news still work fine on non-English texts?
Thanks
No, they are for english text.
Could anyone please tell me why are we actually adding 1 to the vocab size.
vocab_size = len(t.word_index) + 1
Please let me know. Thank you!!
To make space for “0” which is “unknown” words.
“The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word.”
How did you know output will be 4 vectors ?
We can inspect the shape of the model by calling model.summary()
Hi Jason:
Great tutorial !. It is my first time I approach NLP and particularly “text classification/sentiment analysis”. Then I am little confused about tech definitions, procedures, meaning of them, etc.
Let me summarize them for everybody, using my own ideas, in order to share and expect corrections/comments from you.
1º) From a Dataset, constituted of a list of texts docs, that we are going to train for a texts classification problem (e.g.), we read them and get a words list of each doc (using e.g. keras API “text_to_word_sequence()”.
2º) We can then apply some sort fo word cleaning/cured process, using string object methods to get only alphabetical words, or suppressing words punctuations, or eliminating minimum words lengths or word repetitions, etc.
Using a keras “Tokenizer” object (e.g.) , we can define a Vocabulary (or the minimum common number of different words that are constituted the whole dataset docs), besides a words counter, through the tokenizer method “fit_on_texts (apply to all docs words).
3º) Now that we have all words (cleanups) for all docs, we must perform the most important issue in MachineLearning. That is, to convert words (tokens) to be trained by a ML model into Numbers. How? by Coding them !.
And this numbers must be arranged in Tensors (or arrays), consisting on samples (rows of tensor), with columns (different features per doc or numbers words or number of vocabulary word ) and even a third dimensions of tensor (e.g. deep axis?), by embedding them into a word vector space (of an arbitrary dimension).
To perform this encoded work (number conversion) we can use keras tokenizer methods such as “texts_to_matrix()”, “texts_to_sequences()”or others keras methods such as “one_hot() “…helping always with keras “pad_sequences()” method to put the same length for all words(features) for any docs, using e.g. max length of words per doc or simply because we want to use the full vocabulary size representation as our feature length for any doc.
As a result of all this process we must have a clean Dataset input, X tensor of numbers, with 2D dimensions (docs, features) or 3D dimensions ( docs, features, an extra embedding dimension) if we use techniques of Deep Learning (e.g. embedding).
4) Also we can take advantage of a sort of “transfer learning”, when we apply embedding DL layers, if we load previous trained weights (trained elsewhere) of embedding layers such as GloVe dictionary (e.g. consisting of 400,000 different words with their own 100 coordinates weights) that we must adapt to our vocabulary used in our own docs (e.g. our own arbitrary selected 100 different words, and 100 coordinates as a vector word space for each of them).
5º) So now that we have converted a list of docs (of a list of words) into a single unique Tensor of Numbers X, besides the Y labels associated, we must define the Model with an input layer, besides embedding(), Conv1D(), and MaxPool1D() layers (e.g.) if we are using DL techniques, plus a final flatten(), in order to present the feature word extraction to the head fully connected layers ( comprising mainly on Dense() layers, that are going finally to learn the number pattern extraction (representing the set of words of each doc) associated to each document label.
This is my summed up understanding of NLP text classification issues . thanks
Not a bad summary, generally accurate from my skim.
Sorry for the extensive comment Jason!.
-I would like to share also my code results, based on your great tutorial. Because I like the way you teach us…learning not theoretically but doing and experiment with final and operative codes…! thanks one more time!.
– My first main concern it is that even we have string methods, “tokenizer” objects (class), etc… still we have to write down many code lines… I can figure out , that a more high level APIs text processing and coding, may already exist, in order to give them appropriate arguments parameters such as (docs to be read, filters of ways you can clean ups your docs words, types of coding you want to apply to your words, including if you want to use pre-trained weights) ..in order to get , the main output dataset preparation in terms of a “X” input tensor …instead of going to this tedious code lines!
– Anyway, I went through it I get the following results:
– I got 100% accuracy on evaluation using a basic ML model, not deep learning, consisting on a word codification by the keras “texts_to_matrix()” but using a X dataset input of 2D (10 different docs samples rows and 2 different options of vocab_size 50 (yours) and even less (15) associated to less words use it on them). I use a fully connected Model that comprise 2 dense layer (the output and previous one of 10 nodes).
– I got also 100% acc but now using the DL embedding layer with the “one_hot()” encoding (you suggest), and not GloVe pre-trained weights, and in addition to the embedding layer I add a Conv1D layer and MaxPool1D and Flatten layers (CNN part !). I use a 2D X tensor with two option as features cols, with a max_length of 4 and even applying later a 50 vocabulary cols features (using “pad-sequences()” methods for it). I train later the embedding layer with Glove pre-trained weights using trainable false and true. and the same fully connected head model (with 2 dense layers)
– as third main option I got also got 100% accuracy, but using the DL embedding layer with the “texts_to_sequences()” encoding method of keras, and for two options (max-length and applying the 50 vocab-size (after applying pad_sequences()). I also train the embedding with Glove (tranable false and trainable = true) and without Glove weights. And the same fully connected model as a head.
My conclusion, it is the more stable and quick maximum level reaching is applying embedded layer with pre-train Glove (trainable = true) and using the coding “texts_to_sequences()”. I measure the best performance because I can reduce the 50 epochs o using less units on my second Dense layers, to get the same 1005 accuracy.
regards
I expect this could help Jason!
Great write-up!
Yes, it is possible that text data prep has a come a long way over the last 3-4 years since I wrote this stuff up.
Yes… I think “spaCY” … is one of the new powerful ML/DL NLP libraries
https://spacy.io/
It sure is.
Hi Jason
Good read ! I would like to have more clarity on this :
What do we set trainable = False ?
Should we not set it to ‘True’ if we want to use the GloVe 100-D weights present in the embeddings matrix ?
I don’t quite understand the difference between setting it to ‘True’ and ‘False’
Thanks
Vivek
It means the weights in the ensemble cannot change during training if set “False”. As in, cannot be trained.
I think there is a typo here: “For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive),…..” Because, since 0 (zero) is used for padding, words are encoded words from 1 (one not zero!) to 199, inclusive. Regards,
What is the typo, can you please elaborate?
For a multilabel classifier where each text could be linked to zero or more topics, what changes to the above example would be needed?
For example, say I have a list of topics like [“politics”, “sports”, entertainment”, “finance”, “health”].
A document mentioning heathcare and legistation would be labeled [1,0,0,0,1].
Would it be enough to encode the “y” labels in a binary array format above?
Typically multi-class classification we integer encode the labels then one hot encode them and use a softmax activation function in the output layer.
In “Example of Learning an Embedding” section why 8 dimensions?
No reason, it is arbitrary.
Hi,
In the pretrained Glove embedding section, you have used vocab_size = len(t.word_index) + 1.
Why 1 is added? I can’t understand the reason.
To make room for index 0==unknown, so the vocab starts at index 1.
Hii Joson Brownlee,
Which one is better according to you?
Sorry, what do you mean?
Are there any good word or token embedding models for log files (syslogs, application logs etc.)
Can you please suggest some links on machine learning applications for log files? e.g.
– Does a log file have sensitive data? e.g. personal data, topology data like IP address etc,
– What types of sensitive data are in a log file
I would recommend training one for your specific dataset.
Where can I get this entire code?
You can copy it from the tutorial, here’s how:
https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
Hi jason,
Could you give me some insight with this paper https://www.researchgate.net/publication/327173446, please?
I mean, I’ve some confusion when they would use the output of the model to the NCRF++, they said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”. What they mean and how?
Maybe if you have some tutorial sequence tagging using NCRF++ it would be very useful. Thankyou
This is a common question that I answer here:
https://machinelearningmastery.com/faq/single-faq/can-you-explain-this-research-paper-to-me
Ahh sorry for that, so let me explain it.
tldr, the author using deep learning model which output a sigmoid probabilities.
Then the author feed it to the NCRF++, author said “For converting the instances into embeddings, two features were used namely, sigmoid output from MNN (fe1), character embedding (fe2) of size 30”.
Basically what I want to ask, do you have any articles about how NCRF++ works? Thank you
Sorry I do not.
guys, plz consider adding a dark mode (theme) to the website
Thanks for the suggestion.
hi i am trying to concatenate two embedding layers in keras, one will be fixed and other one will be trained,
model = Sequential()
fixed_weights = np.array([word2glove[w] for w in words_in_glove])
e1 = Embedding(input_dim = fixed_weights.shape[0], output_dim = EMBEDDING_DIM, weights = [fixed_weights], input_length = MAX_NEWS_LENGTH, trainable = False)
e2 = Embedding(input_dim = len(word2index) – fixed_weights.shape[0],output_dim = EMBEDDING_DIM, input_length = MAX_NEWS_LENGTH)
# model.add()
model.add(concatenate([e1, e2]))
but this seems unsuccessful and i get the error, (‘NoneType’ object is not subscriptable).. could you guide me where i am wrong?
Perhaps have 2 inputs to the model, 1 embedding and LSTM for each input, then concat the outputs of the lstms.
Hi , Want to write a common python function for a mode building which works for a custom embedding layer AND also with Glove’s pre-trained embeddings
Sounds great, good luck!
Hi Jason,
Thanks for the amazing tutorials. Always helpful.
I have a question. I am trying to extract features from some annotated symptoms (from dialogue). So my sequences are ‘headache extent heavy headache frequency sometimes’ with different lengths each assigned with a time point.
I want to extract features to later use them for time-series classification (so after flattening I would add a dense layer of size 100 for example). The problem is that I do not want to use the same labels here for embedding (in model.fit). I’m afraid this will cause bias when I am using these features for training the main classifier with the same labels assigned to the time points containing these samples.
Is there any way to do the embedding without assigning labels to each sample?
Thank you,
Perhaps you can use a standalone doc2vec technique, e.g. an equilivient of standalone word2vec methods.
Hi Jason,
short question regarding the accruacy, and predict method. When I use:
model.predict(pad_sequences([one_hot(‘poor work’, vocab_size)], maxlen=max_length, padding=’post’))
I get an output of 0.5076578 {array([[0.5076578]], dtype=float32)}, which means by my understanding that it is neither good nor bad. And if I run model.evaluate I get an accuracy of 100% but this cant be true as ‘poor work’ is in my test set and I get a score of 0.50… and not 0.
How can this be?
Thank you very much
The model may be uncertain for some cases.
Nevertheless, if rounded to a crisp label, accuracy shows a good result.
If you prefer to evaluate the model based on its confidence, you can use logloss or brier skill score:
https://machinelearningmastery.com/probability-metrics-for-imbalanced-classification/
Thanks Jason,
after 2-3 run i am able to get 100% accuracy as you said.
Nice.
Hi Jason,
I am trying to predict the output after training with this code. When I see the outputs, It always predicting [[0.72645265]]. Why is it doing so?
I am predicting using the following model call-
output = model.predict(padded_docs)
Following is the code for getting padded_docs-
t = Tokenizer()
t.fit_on_texts([“good job”])
vocab_size = len(t.word_index) + 1
encoded_docs = t.texts_to_sequences(docs)
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding=’post’)
Or someway I am going wrong? I am not exactly sure about what to pass in predict().
Looks good, it is predicting the probability of the document belonging to class 1.
E.g. P(class=1) = 70%
More on predicting with Keras models here:
https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
Yeah, it is going well. But not for all. When I test on input “poor job”, it predicts the same output “[[0.72645265]]” (and same for all test cases). I think I am providing the wrong input which is simply padded data ([[1,2,0,0]]) every time, not the exact words. How should I form input for the model.predict() function?
New input data must be prepared in an identical manner as the training dataset – same encoding, vocab, etc.
i have the same error could you please show us the code to predict correctly?
Sorry I don’t understand. What error are you having exactly?
I answer to myself and Tammay
try=[“Excellent!”]
encoded_try = t.texts_to_sequences(try)
padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)
model.predict(padded_try)
only the trainned words works***
Jasson good blog, the next put also like evaluate the model please, we are noobs 😉
psdt: (the problem was that we didn’t know how to evaluate text “model.predict” because we thought the same result always)
Hi,
The output from the Embedding layer is 3D (Ref:https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) but in your blog its written that the output is 2D
I answer to myself and Tammay
try=[“Excellent!”]
encoded_try = t.texts_to_sequences(try)
padded_try = pad_sequences(encoded_try, maxlen=4, padding=’post’)
model.predict(padded_try)
only the trainned words works***
Jasson good blog, the next put also like evaluate the model please, we are noobs 😉
Thanks.
You can see hundreds of examples of evaluating a model on the blog, perhaps start here:
https://machinelearningmastery.com/start-here/#deeplearning
what does Embedding layer in model ?
Sorry, I don’t understand your question, can you please elaborate?
I mean why are we using Embedding layer in a model ????
Embedding layers can be used to learn the relationship between words.
Hi Jason, thanks for your amazing tutorials.
How do you add bias to the embedding?
Thanks in advance
You’re welcome!
What do you mean by “add bias to the embedding”?
I mean when you train the embedding, is it advantageous to have a single bias term also for each learned embedding?
Thanks for your interest!
No.
Thanks for clearing that up, Jason.
You’re welcome.
Hi Jason,
Is there any way that I can convert a 2D input into embeddings. Basically I have an adjacency matrix (2-D array) representing a graph and training would have happen over a batch of graphs as data points. Is there any way that each node of the graph (which is basically represented by each row in the adjacency matrix) can be converted to a more meaningful embedding ?
Yes, an autoencoder can be used to convert data into an “embedding”.
Perhaps start here:
https://machinelearningmastery.com/autoencoder-for-classification/
And here:
https://machinelearningmastery.com/lstm-autoencoders/
thanks, Mr, Jason please help me.
1- if I have one word in each statement and there is no relation between them so how can I represent class labels and really I don’t know the benefits of class labels here?
2- what is the mathematical model behind the embedding process?
Perhaps try the three approaches here:
https://machinelearningmastery.com/how-to-prepare-categorical-data-for-deep-learning-in-python/
Perhaps check a textbook or a paper for the math behind the embedding layer.
pre-train means test data
No. Pre-trained means training dataset using a different algorithm and resulting in a standalone model that can be used as part of a subsequent model (like a neural net).
hi…i want to concatenate the hidden states with word embeddings..how we can do this
Perhaps test a merge or concat layer in keras to see if you can achieve your desired effect?
thanks…I have tried the following and it’s giving dimension error:
latent_dim=128
encoder_inputs = Input(shape=(None,))
enc_emb = Embedding(eng_vocab_size, latent_dim, mask_zero = True)(encoder_inputs)
encoder_lstm = LSTM(latent_dim*4, return_sequences=True, return_state=True)
encoder_outputs, state_h, state_c = encoder_lstm(enc_emb)
state_h=Concatenate()([state_h,enc_emb])
encoder_states = [state_h, state_c]
error is:
ValueError: A
Concatenate
layer requires inputs with matching shapes except for the concat axis. Got inputs shapes: [(None, 512), (None, None, 128)]Sorry, I don’t have the capacity to debug your code.
Hi Jason, I am trying to understand how can we extracted embedded features? e.g. for a each training example, how can I get textual feature of say 300 dimension x 50 sequence length. So that I am not required to use embedding layer in sequential model and I can directly use LSTM and other layers. Any guidance will really help.
Each word is mapped to one vector. Retrieve the vectors for your words directly.
Hello, if I have a document containing 300k sentences and the longest sentence in the document has 451 words, what would my vocab_size and input_length be in the embedding layer? I still don’t quite get it.